Updated: Efficient Neo4j Data Import Using Cypher-Scripts


How the new Cypher parser in Neo4j 4.2 made imports 10x faster


This is an updated version of this article by Andrea Santurbano. Before proceeding with this post, please check that out.

In this updated version we will perform benchmarks for Neo4j 4.2.

What’s New in Neo4j 4.2?

In the new Neo4j 4.2 release, the Cypher parser has been rewritten (from Parboiled to JavaCC), and there will be significant improvements in the import benchmark.

Small Recap

We will be importing and exporting data using three different optimizations types. Just remember that Andrea’s article contains a more in-depth explanation of the three optimizations.

No Optimization

The generated file will contain a CREATE statement for each node to be imported.

CREATE (:Foo:`UNIQUE IMPORT LABEL` {name:”foo”, `UNIQUE IMPORT ID`:0});
CREATE (:Foo:`UNIQUE IMPORT LABEL` {name:”bar”, `UNIQUE IMPORT ID`:1});
...

Unwind Batch

A more efficient statement structure is achieved by using UNWIND. That turns a batch-list of data entries into individual rows, each of which contains the information for the CREATE statement.

UNWIND [{_id:3, properties:{age:12}}] as row

CREATE (n:`UNIQUE IMPORT LABEL`{`UNIQUE IMPORT ID`: row._id}) SET n += row.properties SET n:Bar;

Unwind Batch Parameters

Same as the previous one, but we’ll now use the query parameters support from the shell for speeding things up.

:param rows => [{_id:4, properties:{age:12}}, {_id:5, properties:{age:4}}]

UNWIND $rows AS row
CREATE (n:`UNIQUE IMPORT LABEL`{`UNIQUE IMPORT ID`: row._id}) SET n += row.properties SET n:Bar;

Neo4j 3.5

We used the 3.5.22 release. This is the export benchmark.

$ time cypher-shell -u neo4j -p davide "call apoc.export.cypher.all('3.5_exportDataCypherShellNoOptimizations.cypher',{format:'cypher-shell', useOptimizations: {type: 'none'}, batchSize:100})"
real 0m44.871s
user 0m1.354s
sys 0m0.178s

$ time cypher-shell -u neo4j -p davide "call apoc.export.cypher.all('3.5_exportDataCypherShellUnwindBatch.cypher',{format:'cypher-shell', useOptimizations: {type: 'unwind_batch', unwindBatchSize: 20}, batchSize:100})"
real 0m29.257s
user 0m1.397s
sys 0m0.181s

$ time cypher-shell -u neo4j -p davide "call apoc.export.cypher.all('3.5_exportDataCypherShellUnwindBatchParams.cypher',{format:'cypher-shell', useOptimizations: {type: 'unwind_batch_params', unwindBatchSize:100}})"
real 0m25.333s
user 0m1.393s
sys 0m0.182s

The import benchmark for Neo4j 3.5:

$ time cypher-shell -u neo4j -p davide < "import/3.5_exportDataCypherShellNoOptimizations.cypher"
real 100m24.805s
user 5m39.444s
sys 4m7.330s

$ time cypher-shell -u neo4j -p davide < "import/3.5_exportDataCypherShellUnwindBatch.cypher"
real 31m33.870s
user 1m12.383s
sys 0m30.247s

$ time cypher-shell -u neo4j -p davide < "import/3.5_exportDataCypherShellUnwindBatchParams.cypher"
real 10m28.723s
user 8m4.257s
sys 0m5.748s

Neo4j 4.1

We used the 4.1.4 release. Following the benchmark for the export.

$ time cypher-shell -u neo4j -p davide "call apoc.export.cypher.all('4.1_exportDataCypherShellNoOptimizations.cypher',{format:'cypher-shell', useOptimizations: {type: 'none'}, batchSize:100})"
real 0m42.675s
user 0m1.437s
sys 0m0.218s

$ time cypher-shell -u neo4j -p davide "call apoc.export.cypher.all('4.1_exportDataCypherShellUnwindBatch.cypher',{format:'cypher-shell', useOptimizations: {type: 'unwind_batch', unwindBatchSize: 20}, batchSize:100})"
real 0m30.574s
user 0m1.399s
sys 0m0.214s

$ time cypher-shell -u neo4j -p davide "call apoc.export.cypher.all('4.1_exportDataCypherShellUnwindBatchParams.cypher',{format:'cypher-shell', useOptimizations: {type: 'unwind_batch_params', unwindBatchSize:100}})"
real 0m25.393s
user 0m1.376s
sys 0m0.221s

Import:

$ time cypher-shell -u neo4j -p davide < "import/4.1_exportDataCypherShellNoOptimizations.cypher"
real 135m37.920s
user 4m32.836s
sys 3m43.420s

$ time cypher-shell -u neo4j -p davide < "import/4.1_exportDataCypherShell.cypher"
real 44m13.016s
user 0m53.779s
sys 0m28.362s

$ time cypher-shell -u neo4j -p davide < "import/4.1_exportDataCypherShellUnwindBatchParams.cypher"
real 10m8.991s
user 8m39.109s
sys 0m5.342s

Neo4j 4.2

We used the 4.2 release. Here’s the export benchmark.

$ time cypher-shell -u neo4j -p davide "call apoc.export.cypher.all('4.2_exportDataCypherShellNoOptimizations.cypher',{format:'cypher-shell', useOptimizations: {type: 'none'}, batchSize:100})"
real 0m42.951s
user 0m1.379s
sys 0m0.207s

$ time cypher-shell -u neo4j -p davide "call apoc.export.cypher.all('4.2_exportDataCypherShellUnwindBatch.cypher',{format:'cypher-shell', useOptimizations: {type: 'unwind_batch', unwindBatchSize: 20}, batchSize:100})"
real 0m29.523s
user 0m1.392s
sys 0m0.213s

$ time cypher-shell -u neo4j -p davide "call apoc.export.cypher.all('4.2_exportDataCypherShellUnwindBatchParams.cypher',{format:'cypher-shell', useOptimizations: {type: 'unwind_batch_params', unwindBatchSize:100}})"
real 0m25.900s
user 0m1.381s
sys 0m0.203s

Import:

$ time cypher-shell -u neo4j -p davide < "import/4.2_exportDataCypherShellNoOptimizations.cypher"
real 122m23.241s
user 4m28.974s
sys 3m40.094s

$ time cypher-shell -u neo4j -p davide < "import/4.2_exportDataCypherShellUnwindBatch.cypher"
real 36m51.066s
user 0m51.777s
sys 0m27.773s

$ time cypher-shell -u neo4j -p davide < "import/4.2_exportDataCypherShellUnwindBatchParams.cypher"
real 2m21.473s
user 0m42.900s
sys 0m3.190s

Conclusions

Let’s take a look at the results:

As you can see, the export is a bit slower in Neo4j 4.2, but nothing to worry about — it’s just a matter of seconds.

Here the results are more interesting. Look at the import with the new shell parameters. It went from 10 minutes in Neo4j 4.1 to a bit over two minutes in Neo4j 4.2. The export is overall faster compared to Neo4j 4.1. Not a tremendous difference from 3.5, except for the shell parameters export.


Updated: Efficient Neo4j Data Import Using Cypher-Scripts was originally published in Neo4j Developer Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.