
(and it is a graph)
Importing the Stack Overflow Data into Neo4j
Importing the millions of Stack Overflow questions, users, answers and comments into Neo4j has been a long-time goal of mine. One of the distractions that kept me from doing it was answering many of the 8,200 Neo4j questions out there. Two weeks ago, Damien at Linkurious pinged me in our public Slack channel. He asked about Neo4j’s import performance for ingesting the full Stack Exchange data dump into Neo4j. After a quick discussion, I pointed him to Neo4j’s CSV import tool, which is perfect for the task as the dump consists of only relational tables wrapped in XML. So Damien wrote a small Python script to extract the CSV from XML and with the necessary headers the neo4j-import tool did the grunt work of creating a graph out of huge tables. You can find the script and instructions on GitHub here. Importing the smaller Stack Exchange community data only takes a few seconds. Amazingly, the full Stack Overflow dump with users, questions and answers takes 80 minutes to convert back to CSV and then only 3 minutes to import into Neo4j on a regular laptop with an SSD. Here is how we did it:Download Stack Exchange Dump Files
First, we downloaded the dump files from the Internet archive for the Stack Overflow community (total 11 GB) into a directory:- 7.3G stackoverflow.com-Posts.7z
- 576K stackoverflow.com-Tags.7z
- 154M stackoverflow.com-Users.7z
- 91M stackoverflow.com-Badges.7z
- 2.0G stackoverflow.com-Comments.7z
- 36M stackoverflow.com-PostLinks.7z
- 501M stackoverflow.com-Votes.7z
Unzip the .7z Files
for i in *.7z; do 7za -y -oextracted x $i; doneThis extracts the files into an
extracted
directory and takes 20 minutes and uses 66GB on disk.
Clone Damien’s GitHub repository
The next step was to clone Damien’s GitHub repo:git clone https://github.com/mdamien/stackoverflow-neo4jNote: This command uses Python 3, so you have to install
xmltodict
.
sudo apt-get install python3-setuptools easy_install3 xmltodict
Run the XML-to-CSV Conversion
After that, we ran the conversion of XML to CSV.python3 to_csv.py extractedThe conversion ran for 80 minutes on my system and resulted in 9.5GB CSV files, which were compressed to 3.4G. This is the data structure imported into Neo4j. The header lines of the CSV files provide the mapping. Nodes:
posts.csv postId:ID(Post),title,postType:INT,createdAt,score:INT,views:INT, answers:INT,comments:INT,favorites:INT,updatedAt,body users.csv userId:ID(User),name,reputation:INT,createdAt,accessedAt,url,location, views:INT,upvotes:INT,downvotes:INT,age:INT,accountId:INT tags.csv tagId:ID(Tag),count:INT,wikiPostId:INTRelationships:
posts_answers.csv:ANSWER -> :START_ID(Post),:END_ID(Post) posts_rel.csv:PARENT_OF -> :START_ID(Post),:END_ID(Post) tags_posts_rel.csv:HAS_TAG -> :START_ID(Post),:END_ID(Tag) users_posts_rel.csv:POSTED -> :START_ID(User),:END_ID(Post)
Import into Neo4j
We then used the Neo4j import toolneo/bin/neo4j-import
to ingest Posts, Users, Tags and the relationships between them.
../neo/bin/neo4j-import \ --into ../neo/data/graph.db \ --id-type string \ --nodes:Post csvs/posts.csv \ --nodes:User csvs/users.csv \ --nodes:Tag csvs/tags.csv \ --relationships:PARENT_OF csvs/posts_rel.csv \ --relationships:ANSWER csvs/posts_answers.csv \ --relationships:HAS_TAG csvs/tags_posts_rel.csv \ --relationships:POSTED csvs/users_posts_rel.csvThe actual import only takes 3 minutes, creating a graph store of 18 GB.
IMPORT DONE in 3m 48s 579ms. Imported: 31138559 nodes 77930024 relationships 260665346 properties
Neo4j Configuration
We then wanted to adapt Neo4j’s config inconf/neo4j.properties
to increase the dbms.pagecache.memory
option to 10G. We also edited the conf/neo4j-wrapper.conf
to provide some more heap, like 4G or 8G.
Then we started the Neo4j server with ../neo/bin/neo4j start
Adding Indexes
We then had the option of running the next queries either directly in Neo4j’s server UI or on the command-line with../neo/bin/neo4j-shell
which connects to the running server.
Here’s how much data we had in there:
neo4j-sh (?)$ match (n) return head(labels(n)) as label, count(*); +-------------------+ | label | count(*) | +-------------------+ | "Tag" | 41719 | | "User" | 4551115 | | "Post" | 26545725 | +-------------------+ 3 rowsNext, we created some indexes and constraints for later use:
create index on :Post(title); create index on :Post(createdAt); create index on :Post(score); create index on :Post(views); create index on :Post(favorites); create index on :Post(answers); create index on :Post(score); create index on :User(name); create index on :User(createdAt); create index on :User(reputation); create index on :User(age); create index on :Tag(count); create constraint on (t:Tag) assert t.tagId is unique; create constraint on (u:User) assert u.userId is unique; create constraint on (p:Post) assert p.postId is unique;We then waited for the indexes to be finished.
schema awaitPlease note: Neo4j as a graph database wasn’t originally built for these global-aggregating queries. That’s why the responses are not instant.
Getting Insights with Cypher Queries
Below are just some of the insights we gleaned from the Stack Overflow data using Cypher queries:The Top 10 Stack Overflow Users
match (u:User) with u,size( (u)-[:POSTED]->()) as posts order by posts desc limit 10 return u.name, posts; +---------------------------+ | u.name | posts | +---------------------------+ | "Jon Skeet" | 32174 | | "Gordon Linoff" | 20989 | | "Darin Dimitrov" | 20871 | | "BalusC" | 16579 | | "CommonsWare" | 15493 | | "anubhava" | 15207 | | "Hans Passant" | 15156 | | "Martijn Pieters" | 14167 | | "SLaks" | 14118 | | "Marc Gravell" | 13400 | +---------------------------+ 10 rows 7342 ms
The Top 5 tags That Jon Skeet Used in Asking Questions
It seems he never really asked questions, but only answered. 🙂match (u:User)-[:POSTED]->()-[:HAS_TAG]->(t:Tag) where u.name = "Jon Skeet" return t,count(*) as posts order by posts desc limit 5; +------------------------------------------------+ | t | posts | +------------------------------------------------+ | Node[31096861]{tagId:"c#"} | 14 | | Node[31096855]{tagId:".net"} | 7 | | Node[31101268]{tagId:".net-4.0"} | 4 | | Node[31118174]{tagId:"c#-4.0"} | 4 | | Node[31096911]{tagId:"asp.net"} | 3 | +------------------------------------------------+ 10 rows 36 ms
The Top 5 Tags that BalusC Answered
match (u:User)-[:POSTED]->()-[:HAS_TAG]->(t:Tag) where u.name = "BalusC" return t.tagId,count(*) as posts order by posts desc limit 5; +------------------------+ | t.tagId | posts | +------------------------+ | "java" | 5 | | "jsf" | 3 | | "managed-bean" | 2 | | "eclipse" | 2 | | "cdi" | 2 | +------------------------+ 5 rows 23 ms
How am I Connected to Darin Dimitrov
MATCH path = allShortestPaths( (u:User {name:"Darin Dimitrov"})-[*]-(me:User {name:"Michael Hunger"})) RETURN path;
Result Visualisation in Neo4j Browser
Which Mark Answered the Most Questions about neo4j?
MATCH (u:User)-[:POSTED]->(answer)<-[:PARENT_OF]-()-[:HAS_TAG]-(:Tag {tagId:"neo4j"}) WHERE u.name like "Mark %" RETURN u.name, u.reputation,u.location,count(distinct answer) AS answers ORDER BY answers DESC; +--------------------------------------------------------------------------+ | u.name | u.reputation | u.location | answers | +--------------------------------------------------------------------------+ | "Mark Needham" | 1352 | "United Kingdom" | 36 | | "Mark Leighton Fisher" | 4065 | "Indianapolis, IN" | 3 | | "Mark Byers" | 377313 | "Denmark" | 2 | | "Mark Whitfield" | 899 | <null> | 1 | | "Mark Wojciechowicz" | 1473 | <null> | 1 | | "Mark Hughes" | 586 | "London, UK" | 1 | | "Mark Mandel" | 859 | "Melbourne, Australia" | 1 | | "Mark Jackson" | 56 | "Atlanta, GA" | 1 | +--------------------------------------------------------------------------+ 8 rows 38 ms
Top 20 paths rendered as graph
The Top 5 Tags of All Time
match (t:Tag) with t order by t.count desc limit 5 return t.tagId, t.count; +------------------------+ | t.tagId | t.count | +------------------------+ | "javascript" | 917772 | | "java" | 907289 | | "c#" | 833458 | | "php" | 791534 | | "android" | 710585 | +------------------------+ 5 rows 30 ms
Co-occurrence of the javascript Tag
match (t:Tag {tagId:"javascript"})<-[:HAS_TAG]-()-[:HAS_TAG]->(other:Tag) WITH other, count(*) as freq order by freq desc limit 5 RETURN other.tagId,freq; +----------------------+ | other.tagId | freq | +----------------------+ | "jquery" | 318868 | | "html" | 165725 | | "css" | 76259 | | "php" | 65615 | | "ajax" | 52080 | +----------------------+ 5 rows
The Most Active Answerers for the neo4j Tag
Quick aside: Thank you to everyone who answered Neo4j questions!match (t:Tag {tagId:"neo4j"})<-[:HAS_TAG]-() -[:PARENT_OF]->()<-[:POSTED]-(u:User) WITH u, count(*) as freq order by freq desc limit 10 RETURN u.name,freq; +-------------------------------+ | u.name | freq | +-------------------------------+ | "Michael Hunger" | 1352 | | "Stefan Armbruster" | 760 | | "Peter Neubauer" | 308 | | "Wes Freeman" | 277 | | "FrobberOfBits" | 277 | | "cybersam" | 277 | | "Luanne" | 235 | | "Christophe Willemsen" | 190 | | "Brian Underwood" | 169 | | "jjaderberg" | 161 | +-------------------------------+ 10 rows 45 ms
Where Else Were the Top Answerers Also Active?
MATCH (neo:Tag {tagId:"neo4j"})<-[:HAS_TAG]-() -[:PARENT_OF]->()<-[:POSTED]-(u:User) WITH neo,u, count(*) as freq order by freq desc limit 10 MATCH (u)-[:POSTED]->()<-[:PARENT_OF]-(p)-[:HAS_TAG]->(other:Tag) WHERE NOT (p)-[:HAS_TAG]->(neo) WITH u,other,count(*) as freq2 order by freq2 desc RETURN u.name,collect(distinct other.tagId)[1..5] as tags; +----------------------------------------------------------------------------------------+ | u.name | tags | +----------------------------------------------------------------------------------------+ | "cybersam" | ["java","javascript","node.js","arrays"] | | "Luanne" | ["spring-data-neo4j","java","cypher","spring"] | | "Wes Freeman" | ["go","node.js","java","php"] | | "Peter Neubauer" | ["graph","nosql","data-structures","java"] | | "Brian Underwood" | ["ruby-on-rails","neo4j.rb","ruby-on-rails-3","activerecord"] | | "Michael Hunger" | ["spring-data-neo4j","nosql","cypher","graph-databases"] | | "Christophe Willemsen" | ["php","forms","doctrine2","sonata"] | | "Stefan Armbruster" | ["groovy","intellij-idea","tomcat","grails-plugin"] | | "FrobberOfBits" | ["python","xsd","xml","django"] | | "jjaderberg" | ["vim","logging","python","maven"] | +----------------------------------------------------------------------------------------+ 10 rows 84 msNote that this Cypher query above contains the equivalent of 14 SQL joins.
Rendered in Linkurious Visualizer
People Who Posted the Most Questions about Neo4j
MATCH (t:Tag {tagId:'neo4j'})<-[:HAS_TAG]-(:Post)<-[:POSTED]-(u:User) RETURN u.name,count(*) as count ORDER BY count DESC LIMIT 10; +------------------------+ | c.name | count | +------------------------+ | "LDB" | 39 | | "deemeetree" | 39 | | "alexanoid" | 38 | | "MonkeyBonkey" | 35 | | "Badmiral" | 35 | | "Mik378" | 27 | | "Kiran" | 25 | | "red-devil" | 24 | | "raHul" | 23 | | "Sovos" | 23 | +------------------------+ 10 rows 42 ms
The Top Answerers for the py2neo Tag
MATCH (:Tag {tagId:'py2neo'})<-[:HAS_TAG]-()-[:PARENT_OF]->() <-[:POSTED]-(u:User) RETURN u.name,count(*) as count ORDER BY count DESC LIMIT 10; +--------------------------------+ | u.name | count | +--------------------------------+ | "Nigel Small" | 88 | | "Martin Preusse" | 24 | | "Michael Hunger" | 22 | | "Nicole White" | 9 | | "Stefan Armbruster" | 8 | | "FrobberOfBits" | 6 | | "Peter Neubauer" | 5 | | "Christophe Willemsen" | 5 | | "cybersam" | 4 | | "Wes Freeman" | 4 | +--------------------------------+ 10 rows 2 ms
Which Users Answered Their Own Question
This global graph query takes a bit of time as it touches 200 million paths in the database, it returns after about 60 seconds.If you would want to execute it only on a subset of the 4.5M users you could add a filtering condition, e.g. on reputation.
MATCH (u:User) WHERE u.reputation > 20000 MATCH (u)-[:POSTED]->(question)-[:ANSWER]->(answer)<-[:POSTED]-(u) WITH u,count(distinct question) AS questions ORDER BY questions DESC LIMIT 5 RETURN u.name, u.reputation, questions; +---------------------------------------------+ | u.name | u.reputation | questions | +---------------------------------------------+ | "Stefan Kendall" | 31622 | 133 | | "prosseek" | 31411 | 114 | | "Cheeso" | 100779 | 107 | | "Chase Florell" | 21207 | 99 | | "Shimmy" | 29175 | 96 | +---------------------------------------------+ 5 rows 10 seconds
More Information
We’re happy to provide you with the graph database of the Stack Overflow dump here:- Neo4j database dump for 2.3-SNAPSHOT or 2.2.4
- Running Neo4j Server to explore the data (read-only)
- CSV Files
- LOAD JSON from URL AS Data
- Making Master Data Management Fun with Neo4j
- Visualizing Stack Overflow
- Embrace Relationships with Neo4J, R & Java
- Please also check out the Stack Overflow developer survey. It’s a very interesting read.
Explore: csv import cypher javascript Linkurious neo4j py2neo python stack exchange stack overflow xml to csv
About the Author
Michael Hunger , Developer Relations

Michael Hunger has been passionate about software development for a very long time. For the last few years he has been working on the open source Neo4j graph database filling many roles.
As caretaker of the Neo4j community and ecosystem he especially loves to work with graph-related projects, users and contributors. As a developer, Michael enjoys many aspects of programming languages, learning new things every day, participating in exciting and ambitious open source projects and contributing and writing software related books and articles.
12 Comments
Great article but it would be clearer if the Relationship headers were listed in the article as …
Relationships:
posts_answers.csv
:ANSWER -> :START_ID(Post),:END_ID(Post)
posts_rel.csv
:PARENT_OF -> :START_ID(Post),:END_ID(Post)
tags_posts_rel.csv
:HAS_TAG -> :START_ID(Post),:END_ID(Tag)
users_posts_rel.csv
:POSTED -> :START_ID(User),:END_ID(Post)
Great! I was looking for starting off with Neo4J and what an article i got for start. Well explained and detailed out.
i would like to replicate this.
my problem is: your descriptions are for linux and i am working on windows…
everything is a little different there.
I managed to:
1. install python
2. install setup tools (D:\Programme\dev\Python36-32>python -m pip install -U pip setuptools)
3. install xmltodict (D:\Programme\dev\Python36-32\Scripts>easy_install.exe xmltodict)
But now i am stuck what is to_csv.py and where do i find it?
D:\Programme\dev\Python36-32>python.exe to_csv.py
python.exe: can’t open file ‘to_csv.py’: [Errno 2] No such file or directory
Hi in latest version please app encoding=”utf8″ to open statement to get it working …Great Post BTW
This doc needs update, `like` is no longer a valid cypher in the latest neo4j version.
The generated files csv files are different from what listed in this doc, should remove `–relationships:ANSWER csvs/posts_answers.csv \`
I’m having trouble executing the command:
for i in *.7z; do 7za -y -oextracted x $i; done
When I run it in the directory where my files are, I get 3 errors. When I run it in the directory before I get one error.
The error says:
-bash: 7za: command not found
I am getting the error
“-bash: 7za: command not found”
How do I fix this?
I am having a problem with running the command to convert from xml to csv
python3 to_csv.py extracted
If I have a to_csv.py file in the directory I run the command in, the command goes thru with no errors but doesn’t do anything.
If I run the command without the to_csv.py file existing, I get an error.
Could someone help with this error (warning):
‘utf-8’ codec can’t decode byte 0xc2 in position 975…
after execute command: python3 to_csv.py extracted
Hi. Amazing post and it covers almost every aspect that a beginner like me needs to know while starting with neo4j or graph databases in general. I am currently trying to work on various graph isomorphism algorithms and compare different indexing techniques which would improve the performance. While trying to search for the dataset to work on, I landed on this page and trying to analyze if using stack overflow data would be a wise choice for the experimentation. Also, how many disjointed graphs the database contains? Is it more than one or just one single huge graph of data?
They need to become ready with new approaches to keep their potential customers
interested. It also tells about the hundred miles being taken for your birds
to discover their way home again. Either we can easily
do this using tracing it using illustrator or we can accomplish that naturally by hand.
Leave a Reply

Upcoming Event
Have a Graph Question?
Reach out and connect with the Neo4j staff.
Stack OverflowCommunity Forums
Contact Us
Share your Graph Story?
Email us: content@neo4j.com
The graphviz are awesome!
How many times does it take to build the indexes?