By Neo4j Staff | January 14, 2015
Importing Data Into Neo4j via CSVOriginally posted on the GrapheneDB Blog This post will explain how to import data from a CSV file into Neo4j. After outlining the steps to take, we list some special considerations for GrapheneDB users. One of the most important steps when evaluating a new technology for your stack is importing existing data. CSV is one of the most popular standards for data exchange and most of the popular database engines support exporting data in CSV format. Starting with 2.1, Neo4j includes a
LOAD CSV[Neo4j Docs] Cypher clause for data import, which is a powerful ETL tool:
- It can load a CSV file from the local filesystem or from a remote URI (i.e. S3, Dropbox, Github, etc.)
- It can perform multiple operations in a single statement
- It can be combined with
USING PERIODIC COMMITto group the operations on multiple rows in transactions to load large amounts of data [Neo4j Docs]
- Input data is mapped directly into a complex graph structure as outlined by the user
- It’s possible to manipulate or compute values in runtime
- It allows merging existing data (nodes, relationships, properties) rather than just adding it to the store
Have your graph data model readyBefore running the import process you will need to know how you want to map your data onto the graph. What are the nodes and relationships, and which properties will they have?
Tune cache and heap configurationMake sure to increase the heap size generously, specially if importing large datasets, and also make sure the file buffer caches fit the entire dataset. You can estimate the size of your dataset on disk after the import by using the table in the official Neo4j docs. Let’s assume we are going to store 100K nodes, 1M relationships and a fixed-size property per node/relationship (i.e. an integer number) :
- Node store: 100,000 * 15B = 1.5 MB
- Relationship store: 1,000,000 * 34B = 34MB
- Property store: 1,100,000 * 41B = 45.1 MB
Set up indexes and constraintsIndexes will make lookups faster during and after the load process. Make sure to include an index for every property used to locate nodes in MERGE queries. An index can be created with the
CREATE INDEXclause. Example:
Loading and mapping dataThe easiest way to load data from CSV is to use the
LOAD CSVstatement. It supports common options, such as accessing via column header or column index, configuring the terminator character and other common options. Please refer to the official docs for further details. To speed up the process, make sure to use
USE PERIODIC COMMIT, which will group multiple operations (by default 1000) into transactions and reduce the times Neo4j has to hit the disk to commit the changes.
toInt(csv.columns)when loading integer numbers. The load process can be run from the Neo4j shell, either interactively, or by loading the Cypher code from a file using the option
-qto quit when finished. Alternatively, the code can be entered manually into the shell or the browser UI.
Considerations for GrapheneDB usersA few considerations when loading data into your GrapheneDB Neo4j instance:
- caches and heap can only be configured on the Standard plans and higher. They are fixed on the lower-end plans
- neo4j-shell does not support authentication and thus it can’t be used to load data into an instance hosted on GrapheneDB or otherwise secured with authentication credentials
- when running the command from the browser UI, bear in mind Neo4j won’t be able to access your filesystem. You should provide a publicly available URL instead, i.e. a file hosted on AWS S3
- for larger datasets, we recommed running the import process locally and once completed, perform a restore on your GrapheneDB instance
From the CEO
Have a Graph Question?
Reach out and connect with the Neo4j staff.Stackoverflow
Share your Graph Story?
Email us: email@example.com