Learn How to Export CSV data from Neo4j using Curl, Cypher and JQI’ve often been asked how to export CSV data from Neo4j. Since we support multiple ways of importing CSV there should naturally be an easy way to export CSV as well.

Here’s how you do it:

You can already export CSV from the Neo4j Browser by clicking the download icon on the table view of your Cypher query results. Alternatively, you can also use my neo4j-shell-tools to export results of a Cypher query to a CSV file.

Something I had wanted to do was to use the awesome jq tool for this purpose, a very powerful and fast command-line JSON processor. With just a few selectors you transform JSON, extract and convert values and much more.

In this example, we’ll use the Pokec social network dataset (from Stanford SNAP), which contains about 1.6M people and 30M connections.

We will execute a Cypher statement against the transactional, HTTP-API endpoint, which is what most Neo4j drivers use as well.

This is the query showing who is connected to whom:

MATCH (p1:PROFILES)-[:RELATION]-(p2)
RETURN p1._key as id1, p1.AGE as age1, p1.gender as gender1,
       p2._key as id2, p2.AGE as age2, p2.gender as gender2
LIMIT 4

The result looks like this:

id1

age1

gender1

id2

age2

gender2

1

26

1

16

23

1

1

26

1

10

22

0

1

26

1

9

0

0

1

26

1

12

26

1

We can POST that query using curl against the Cypher endpoint:

curl -H accept:application/json -H content-type:application/json \
  -d '{"statements":[{"statement":"MATCH ... RETURN ..."}]}' \
  http://localhost:7474/db/data/transaction/commit

Here we only use one of the multiple statements that can be passed as payload of the HTTP request, and no query parameters.
For our query the resulting JSON looks like this:

{"results":
  {"columns":["id1","age1","gender1","id2","age2","gender2"],
   "data":[{"row":[1,26,1,16,23,1]},
           {"row":[1,26,1,10,22,0]},
		   {"row":[1,26,1,9,0,0]},
		   {"row":[1,26,1,12,26,1]}]
   ],
   "errors":[]}

Now we have to take this JSON and transform it to CSV using jq. Follow these simple steps:

  1. Select results with .results[0]
  2. Take the first entry as we only send one query .results[0]
  3. Let’s then take the columns array out of this .results[0].columns and convert it to CSV, piping it to the @csv transformer .results[0].columns | @csv
  4. Next, do the same for the data field .results[0].data
  5. While we’re there, we also need to grab each entry and extract the content of the row field .results[0].data[].row and convert it to CSV too .results[0].data[].row | @csv
  6. Now, we combine both by wrapping the first results object and piping it into both selectors separated by a comma (.results[0]) | .columns,.data[].row and finally convert it to CSV: jq '(.results[0]) | .columns,.data[].row | @csv'
  7. Finally, we use the -r parameter for jq to output raw data, so that quotes are not quoted twice
So when we execute our curl command but pipe the result to jq it looks like this:

curl -H accept:application/json -H content-type:application/json \
     -d '{"statements":[{"statement":"MATCH (p1:PROFILES)-[:RELATION]-(p2) RETURN ... LIMIT 4"}]}' \
     http://localhost:7474/db/data/transaction/commit \
     | jq -r '(.results[0]) | .columns,.data[].row | @csv'

"id1","age1","gender1","id2","age2","gender2"
1,26,1,16,23,1
1,26,1,10,22,0
1,26,1,9,0,0
1,26,1,12,26,1

Of course we also want to see how quickly we can extract data from Neo4j. In total, our query returns 60 million rows (Twice 30 as we ignore the direction) if we remove the LIMIT clause. I don’t really want to store that data, that’s why I pipe it to /dev/null

curl -H accept:application/json -H content-type:application/json \
     -d '{"statements":[{"statement":"MATCH (p1:PROFILES)-[:RELATION]-(p2) RETURN ..."}]}' \
     http://localhost:7474/db/data/transaction/commit \
     | jq -r '(.results[0]) | .columns,.data[].row | @csv' | /dev/null

% Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                               Dload  Upload   Total   Spent    Left  Speed
100  970M    0  970M  100   184  6118k      1  0:03:04  0:02:42  0:00:22 6035k

This transfers 60 million rows totaling 1 Gigabyte with 6.2MB per second on average, taking 3 minutes to finish.

Not bad.

References



Want in on graph database projects like this? Click below to get your free copy of the O’Reilly Graph Databases ebook and start tapping into the power of graph technology for your next project or app.

Get My Free Copy

 

Keywords:  


About the Author

Michael Hunger, Developer Relations

Michael Hunger Image

Michael Hunger has been passionate about software development for a very long time. For the last few years he has been working on the open source Neo4j graph database filling many roles.

As caretaker of the Neo4j community and ecosystem he especially loves to work with graph-related projects, users and contributors. As a developer, Michael enjoys many aspects of programming languages, learning new things every day, participating in exciting and ambitious open source projects and contributing and writing software related books and articles.


3 Comments

Sergei Wallace says:

How would this work if I have my cypher query in a .cql file? Also, does this approach work on Windows command prompt?

Kevin Reschke says:

Is there a streaming or batch configuration for large result sets which don’t fit in Neo4j memory?

My graph has 4.7 million nodes, but the nodes are fairly heavy, and the full graph.db directory weighs 30G.

I’ve tried adapting your approach to dump all of my nodes to a json-lines file, using the query “MATCH (n) RETURN n”. But it hangs, not a single line of output is written, then eventually it crashes.

I found a simple solution which works very well: break the job into batches, as shown below:

“`
#!/bin/bash
for i in `seq 0 47`; # I know I have just under 4.7 million nodes.
do curl -H accept:application/json -H content-type:application/json -d “{\”statements\”:[{\”statement\”:\”MATCH (n) RETURN n SKIP $((100000 * $i)) LIMIT 100000\”}]}” http://localhost:7474/db/data/transaction/commit | jq –compact-output ‘(.results[]) | (.data[]) | .row[0]’ >> all_nodes.jsonl;
done;
“`

The above command only takes a few minutes and produces a 4.6G output file.

My question: Is there a better way? Some native Neo4j configuration for dumping large amounts of data, where I don’t need to explicitly use SKIP and LIMIT for batching?

Yes, batching is the way to go there.
This approach is not meant for many gigabytes
I’d rather add an export.csv procedure that does that.

Leave a Reply

Your email address will not be published. Required fields are marked *

Subscribe

Upcoming Event

 

From the CEO

Emil's Blog


Have a Graph Question?

Stackoverflow
Slack
Contact Us

Share your Graph Story?

Email us: content@neotechnology.com


Popular Graph Topics