Developer

Importing Data Into Neo4j via CSV

January 14, 2015

4 min read

Importing Data Into Neo4j via CSV

Originally posted on the GrapheneDB Blog

This post will explain how to import data from a CSV file into Neo4j. After outlining the steps to take, we list some special considerations for GrapheneDB users.

One of the most important steps when evaluating a new technology for your stack is importing existing data. CSV is one of the most popular standards for data exchange and most of the popular database engines support exporting data in CSV format.

Starting with 2.1, Neo4j includes a LOAD CSV [Neo4j Docs] Cypher clause for data import, which is a powerful ETL tool:

It can load a CSV file from the local filesystem or from a remote URI (i.e. S3, Dropbox, Github, etc.)
It can perform multiple operations in a single statement
It can be combined with USING PERIODIC COMMIT to group the operations on multiple rows in transactions to load large amounts of data [Neo4j Docs]
Input data is mapped directly into a complex graph structure as outlined by the user
It’s possible to manipulate or compute values in runtime
It allows merging existing data (nodes, relationships, properties) rather than just adding it to the store

Steps

Have your graph data model ready

Before running the import process you will need to know how you want to map your data onto the graph. What are the nodes and relationships, and which properties will they have?

Tune cache and heap configuration

Make sure to increase the heap size generously, specially if importing large datasets, and also make sure the file buffer caches fit the entire dataset.

You can estimate the size of your dataset on disk after the import by using the table in the official Neo4j docs.

Let’s assume we are going to store 100K nodes, 1M relationships and a fixed-size property per node/relationship (i.e. an integer number) :

Node store: 100,000 * 15B = 1.5 MB
Relationship store: 1,000,000 * 34B = 34MB
Property store: 1,100,000 * 41B = 45.1 MB

Those are the minimum values that we should use in your filebuffer cache configuration.

Set up indexes and constraints

Indexes will make lookups faster during and after the load process. Make sure to include an index for every property used to locate nodes in MERGE queries.

An index can be created with the CREATE INDEX clause. Example:

CREATE INDEX ON :User(name);

If a property must be unique, adding a constraint will also implicitly create an index. For example, if you we want to make sure we don’t store any duplicated user nodes, we could use a constraint for the email property.

CREATE CONSTRAINT ON (u:User) ASSERT u.email IS UNIQUE;

Loading and mapping data

The easiest way to load data from CSV is to use the LOAD CSV statement. It supports common options, such as accessing via column header or column index, configuring the terminator character and other common options. Please refer to the official docs for further details.

To speed up the process, make sure to use USE PERIODIC COMMIT, which will group multiple operations (by default 1000) into transactions and reduce the times Neo4j has to hit the disk to commit the changes.

LOAD CSV WITH HEADERS FROM "file:///tmp/users.csv" AS csvLine FIELDTERMINATOR ';' MERGE (u:User { email: csvLine.email, username: csvLine.username, name: csvLine.name });

Please note that values are read as Strings, so make sure you do format conversion where appropriate, i.e. toInt(csv.columns) when loading integer numbers.

The load process can be run from the Neo4j shell, either interactively, or by loading the Cypher code from a file using the option -file filename and -q to quit when finished.

Alternatively, the code can be entered manually into the shell or the browser UI.

Considerations for GrapheneDB users

A few considerations when loading data into your GrapheneDB Neo4j instance:

caches and heap can only be configured on the Standard plans and higher. They are fixed on the lower-end plans
neo4j-shell does not support authentication and thus it can’t be used to load data into an instance hosted on GrapheneDB or otherwise secured with authentication credentials
when running the command from the browser UI, bear in mind Neo4j won’t be able to access your filesystem. You should provide a publicly available URL instead, i.e. a file hosted on AWS S3
for larger datasets, we recommed running the import process locally and once completed, perform a restore on your GrapheneDB instance

For a comprehensive tutorial, including tools to clean up the CSV files, common pitfalls and more advanced tools like the super fast batch importer please refer to this comprehensive CSV import guide.

Please don’t hesitate to post any comments or contact our support team if you are having issues loading data into your GrapheneDB instance.

Want to learn more about graph databases? Click below to get your free copy of O’Reilly’s Graph Databases ebook and discover how to use graph technologies for your application today.

Download My Ebook