Neo4j Admin import

This tutorial provides detailed examples to illustrate the capabilities of importing data from CSV files with the command neo4j-admin import.

The neo4j-admin import is a command for loading large amounts of data from CSV files into an unused database. Importing data from CSV files with neo4j-admin import can only be done once into an unused database, it is used for initial graph population only. The neo4j-admin import command can be used on the local Neo4j instance even if the instance is running or not.

The neo4j-admin import command does not create a database, the command only imports data and make it available for the database. It is possible to create the given database either before or after the neo4j-admin import command have been executed. If the database already exists the given database needs to be in a state where no data have been introduced before.

Relationships are created by connecting node IDs, each node should have a unique ID to be able to be referenced when creating relationships between nodes. In the examples below, the node IDs are stored as properties on the nodes. If you do not want the IDs to persist as properties after the import completes, then do not specify a property name in the :ID field.

The examples assume that:

  • The details of CSV file header format can be found at CSV header format.

  • To show available databases, use the Cypher query SHOW DATABASES against the system database.

  • To remove a database, use the Cypher query DROP DATABASE database_name against the system database.

  • To create a database, use the Cypher query CREATE DATABASE database_name against the system database.

1. Import a small data set

In this example you will import a small data set containing nodes and relationships. This example introduces the neo4j-admin import command with a basic setup of CSV files for the data set. The data set is split into three CSV files, each file have a header row describing the data.

It is possible to split the data set into several files and also have the header row in a specific file for ease of working with large data sets. It is also possible to define the label for nodes and type for relationships as an optional argument.

The data

The data set contains information about movies, actors, and roles. Data for movies and actors are stored as nodes and the roles are stored as relationships.

The files you want to import data from are:

  • movies.csv

  • actors.csv

  • roles.csv

Each movie in movies.csv has an ID, a title and a year, stored as properties in the node. All the nodes in movies.csv also have the label Movie. A node can have several labels, as you can see in movies.csv there are nodes that also have the label Sequel. The node labels are optional, they are very useful for grouping nodes into sets where all nodes that have a certain label belongs to the same set.

movies.csv
movieId:ID,title,year:int,:LABEL
tt0133093,"The Matrix",1999,Movie
tt0234215,"The Matrix Reloaded",2003,Movie;Sequel
tt0242653,"The Matrix Revolutions",2003,Movie;Sequel

The actors data in actors.csv consist of an ID and a name, stored as properties in the node. The ID in this case a shorthand of the actors name. All the nodes in actors.csv have the label Actor.

actors.csv
personId:ID,name,:LABEL
keanu,"Keanu Reeves",Actor
laurence,"Laurence Fishburne",Actor
carrieanne,"Carrie-Anne Moss",Actor

The roles data in roles.csv have only one property, role. Roles are represented by relationship data that connects actor nodes with movie nodes.

There are three mandatory fields for relationship data:

  1. :START_ID — ID refering to a node.

  2. :END_ID — ID refering to a node.

  3. :TYPE — The relationship type.

In order to create a relationship between two nodes, the IDs defined in actors.csv and movies.csv are used for the :START_ID and :END_ID fields. You also need to provide a relationship type (in this case ACTED_IN) for the :TYPE field.

roles.csv
:START_ID,role,:END_ID,:TYPE
keanu,"Neo",tt0133093,ACTED_IN
keanu,"Neo",tt0234215,ACTED_IN
keanu,"Neo",tt0242653,ACTED_IN
laurence,"Morpheus",tt0133093,ACTED_IN
laurence,"Morpheus",tt0234215,ACTED_IN
laurence,"Morpheus",tt0242653,ACTED_IN
carrieanne,"Trinity",tt0133093,ACTED_IN
carrieanne,"Trinity",tt0234215,ACTED_IN
carrieanne,"Trinity",tt0242653,ACTED_IN

Importing the data

  • Paths to node data is defined with the --nodes option.

  • Paths to relationship data is defined with the --relationships option.

The call to neo4j-admin import would look like this:

shell
bin/neo4j-admin import --database=neo4j --nodes=import/movies.csv --nodes=import/actors.csv --relationships=import/roles.csv

Query the data

To query the data. Start Neo4j.

The default username and password is neo4j and neo4j.

shell
bin/neo4j start

To query the imported data in the graph, try a simple Cypher query.

shell
bin/cypher-shell --database=neo4j "MATCH (n) RETURN count(n) as nodes"

Stop Neo4j.

shell
bin/neo4j stop

2. CSV file delimiters

We can customize the configuration options that the import tool uses (see Options) if our data does not fit the default format.

The details of CSV file header format can be found at CSV header format.

The data

The following CSV files have the:

  • --delimiter=";"

  • --array-delimiter="|"

  • --quote="'"

movies2.csv
movieId:ID;title;year:int;:LABEL
tt0133093;'The Matrix';1999;Movie
tt0234215;'The Matrix Reloaded';2003;Movie|Sequel
tt0242653;'The Matrix Revolutions';2003;Movie|Sequel
actors2.csv
personId:ID;name;:LABEL
keanu;'Keanu Reeves';Actor
laurence;'Laurence Fishburne';Actor
carrieanne;'Carrie-Anne Moss';Actor
roles2.csv
:START_ID;role;:END_ID;:TYPE
keanu;'Neo';tt0133093;ACTED_IN
keanu;'Neo';tt0234215;ACTED_IN
keanu;'Neo';tt0242653;ACTED_IN
laurence;'Morpheus';tt0133093;ACTED_IN
laurence;'Morpheus';tt0234215;ACTED_IN
laurence;'Morpheus';tt0242653;ACTED_IN
carrieanne;'Trinity';tt0133093;ACTED_IN
carrieanne;'Trinity';tt0234215;ACTED_IN
carrieanne;'Trinity';tt0242653;ACTED_IN

Importing the data

The call to neo4j-admin import would look like this:

shell
bin/neo4j-admin import --database=neo4j --delimiter=";" --array-delimiter="|" --quote="'" --nodes=import/movies2.csv --nodes=import/actors2.csv --relationships=import/roles2.csv

3. Using separate header files

When dealing with very large CSV files it is more convenient to have the header in a separate file. This makes it easier to edit the header as you avoid having to open a huge data file just to change it. The header file must be specified before the rest of the files in each file group.

The import tool can also process single file compressed archives, for example:

  • --nodes=import/nodes.csv.gz

  • --relationships=import/relationships.zip

The data

We will use the same data set as in the previous example but put the headers in separate files.

movies3-header.csv
movieId:ID,title,year:int,:LABEL
movies3.csv
tt0133093,"The Matrix",1999,Movie
tt0234215,"The Matrix Reloaded",2003,Movie;Sequel
tt0242653,"The Matrix Revolutions",2003,Movie;Sequel
actors3-header.csv
personId:ID,name,:LABEL
actors3.csv
keanu,"Keanu Reeves",Actor
laurence,"Laurence Fishburne",Actor
carrieanne,"Carrie-Anne Moss",Actor
roles3-header.csv
:START_ID,role,:END_ID,:TYPE
roles3.csv
keanu,"Neo",tt0133093,ACTED_IN
keanu,"Neo",tt0234215,ACTED_IN
keanu,"Neo",tt0242653,ACTED_IN
laurence,"Morpheus",tt0133093,ACTED_IN
laurence,"Morpheus",tt0234215,ACTED_IN
laurence,"Morpheus",tt0242653,ACTED_IN
carrieanne,"Trinity",tt0133093,ACTED_IN
carrieanne,"Trinity",tt0234215,ACTED_IN
carrieanne,"Trinity",tt0242653,ACTED_IN

Importing the data

The call to neo4j-admin import would look as follows, note how the file groups are enclosed in quotation marks in the command:

The header line for a file group, whether it is the first line of a file in the group or a dedicated header file, must be the first line in the file group.

shell
bin/neo4j-admin import --database=neo4j --nodes=import/movies3-header.csv,import/movies3.csv --nodes=import/actors3-header.csv,import/actors3.csv --relationships=import/roles3-header.csv,import/roles3.csv

4. Multiple input files

In addition to using a separate header file you can also provide multiple nodes or relationships files. Files within such an input group can be specified with multiple match strings, delimited by ,, where each match string can be either the exact file name or a regular expression matching one or more files. Multiple matching files will be sorted according to their characters and their natural number sort order for file names containing numbers.

The data

movies4-header.csv
movieId:ID,title,year:int,:LABEL
movies4-part1.csv
tt0133093,"The Matrix",1999,Movie
tt0234215,"The Matrix Reloaded",2003,Movie;Sequel
movies4-part2.csv
tt0242653,"The Matrix Revolutions",2003,Movie;Sequel
actors4-header.csv
personId:ID,name,:LABEL
actors4-part1.csv
keanu,"Keanu Reeves",Actor
laurence,"Laurence Fishburne",Actor
actors4-part2.csv
carrieanne,"Carrie-Anne Moss",Actor
roles4-header.csv
:START_ID,role,:END_ID,:TYPE
roles4-part1.csv
keanu,"Neo",tt0133093,ACTED_IN
keanu,"Neo",tt0234215,ACTED_IN
keanu,"Neo",tt0242653,ACTED_IN
laurence,"Morpheus",tt0133093,ACTED_IN
laurence,"Morpheus",tt0234215,ACTED_IN
roles4-part2.csv
laurence,"Morpheus",tt0242653,ACTED_IN
carrieanne,"Trinity",tt0133093,ACTED_IN
carrieanne,"Trinity",tt0234215,ACTED_IN
carrieanne,"Trinity",tt0242653,ACTED_IN

Importing the data

The call to neo4j-admin import would look like this:

shell
bin/neo4j-admin import --database=neo4j --nodes=import/movies4-header.csv,import/movies4-part1.csv,import/movies4-part2.csv --nodes=import/actors4-header.csv,import/actors4-part1.csv,import/actors4-part2.csv --relationships=import/roles4-header.csv,import/roles4-part1.csv,import/roles4-part2.csv

Regular expressions

File names can be specified using regular expressions in order to simplify using the command line when there are many data source files. Each file name that matches the regular expression will be included.

As mentioned in a previous section, for the import to work correctly, the header file must be first in the file group. When using regular expressions to specify the input files, the list of files will be sorted according to the names of the files that match the expression. The matching is aware of numbers inside the file names and will sort them accordingly, without the need for padding with zeros.

Example 1. Match order

For example, let’s assume that we have the following files:

  • movies4-header.csv

  • movies4-data1.csv

  • movies4-data2.csv

  • movies4-data12.csv

If we use the regular expression movies4.*, the sorting will place the header file last and the import will fail. A better alternative would be to name the header file explicitly and use a regular expression that only matches the names of the data files. For example: --nodes "import/movies4-header.csv,movies-data.*" will accomplish this.

Importing the data using regular expressions, the call to neo4j-admin import can be simplified to:

shell
bin/neo4j-admin import --database=neo4j --nodes="import/movies4-header.csv,import/movies4-part.*" --nodes="import/actors4-header.csv,import/actors4-part.*" --relationships="import/roles4-header.csv,import/roles4-part.*"

The use of regular expressions should not be confused with file globbing.

The expression .* means: "zero or more occurrences of any character except line break". Therefore, the regular expression movies4.* will list all files starting with movies4. Conversely, with file globbing, ls movies4.* will list all files starting with movies4..

Another important difference to pay attention to is the sorting order. The result of a regular expression matching will place the file movies4-part2.csv before the file movies4-part12.csv. If doing ls movies4-part* in a directory containing the above listed files, the file movies4-part12.csv will be listed before the file movies4-part2.csv.

5. Using the same label for every node

If you want to use the same node label(s) for every node in your nodes file you can do this by specifying the appropriate value as an option to neo4j-admin import. There is then no need to specify the :LABEL column in the header file and each row (node) will apply the specified labels from the command line option.

Example 2. Specify node labels option

--nodes=LabelOne:LabelTwo=import/example-header.csv,import/example-data1.csv

It is possible to apply both the label provided in the file and the one provided on the command line to the node.

The data

In this example we want to have the label Movie on every node specified in movies5a.csv, and we put the labels Movie and Sequel on the nodes specified in sequels5a.csv.

movies5a.csv
movieId:ID,title,year:int
tt0133093,"The Matrix",1999
sequels5a.csv
movieId:ID,title,year:int
tt0234215,"The Matrix Reloaded",2003
tt0242653,"The Matrix Revolutions",2003
actors5a.csv
personId:ID,name
keanu,"Keanu Reeves"
laurence,"Laurence Fishburne"
carrieanne,"Carrie-Anne Moss"
roles5a.csv
:START_ID,role,:END_ID,:TYPE
keanu,"Neo",tt0133093,ACTED_IN
keanu,"Neo",tt0234215,ACTED_IN
keanu,"Neo",tt0242653,ACTED_IN
laurence,"Morpheus",tt0133093,ACTED_IN
laurence,"Morpheus",tt0234215,ACTED_IN
laurence,"Morpheus",tt0242653,ACTED_IN
carrieanne,"Trinity",tt0133093,ACTED_IN
carrieanne,"Trinity",tt0234215,ACTED_IN
carrieanne,"Trinity",tt0242653,ACTED_IN

Importing the data

The call to neo4j-admin import would look like this:

shell
bin/neo4j-admin import --database=neo4j --nodes=Movie=import/movies5a.csv --nodes=Movie:Sequel=import/sequels5a.csv --nodes=Actor=import/actors5a.csv --relationships=import/roles5a.csv

6. Using the same relationship type for every relationship

If you want to use the same relationship type for every relationship in your relationships file this can be done by specifying the appropriate value as an option to neo4j-admin import.

Example 3. Specify relationship type option

--relationships=TYPE=import/example-header.csv,import/example-data1.csv

If you provide a relationship type both on the command line and in the relationships file, the one in the file will be applied.

The data

In this example we want the relationship type ACTED_IN to be applied on every relationship specified in roles5b.csv.

movies5b.csv
movieId:ID,title,year:int,:LABEL
tt0133093,"The Matrix",1999,Movie
tt0234215,"The Matrix Reloaded",2003,Movie;Sequel
tt0242653,"The Matrix Revolutions",2003,Movie;Sequel
actors5b.csv
personId:ID,name,:LABEL
keanu,"Keanu Reeves",Actor
laurence,"Laurence Fishburne",Actor
carrieanne,"Carrie-Anne Moss",Actor
roles5b.csv
:START_ID,role,:END_ID
keanu,"Neo",tt0133093
keanu,"Neo",tt0234215
keanu,"Neo",tt0242653
laurence,"Morpheus",tt0133093
laurence,"Morpheus",tt0234215
laurence,"Morpheus",tt0242653
carrieanne,"Trinity",tt0133093
carrieanne,"Trinity",tt0234215
carrieanne,"Trinity",tt0242653

Importing the data

The call to neo4j-admin import would look like this:

shell
bin/neo4j-admin import --database=neo4j --nodes=import/movies5b.csv --nodes=import/actors5b.csv --relationships=ACTED_IN=import/roles5b.csv

7. Properties

Nodes and relationships can have properties. The property type are specified in the CSV header row, see CSV header format.

The data

The following example creates a small graph containing one actor and one movie connected by one relationship.

There is a roles property on the relationship which contains an array of the characters played by the actor in a movie:

movies6.csv
movieId:ID,title,year:int,:LABEL
tt0099892,"Joe Versus the Volcano",1990,Movie
actors6.csv
personId:ID,name,:LABEL
meg,"Meg Ryan",Actor
roles6.csv
:START_ID,roles:string[],:END_ID,:TYPE
meg,"DeDe;Angelica Graynamore;Patricia Graynamore",tt0099892,ACTED_IN

Importing the data

The call to neo4j-admin import would look like this:

shell
bin/neo4j-admin import --database=neo4j --nodes=import/movies6.csv --nodes=import/actors6.csv --relationships=import/roles6.csv

8. ID space

The import tool makes the assumption that identifiers are unique across node files. This may not be the case for data sets which use sequential, auto incremented or otherwise colliding identifiers. Those data sets can define ID spaces where identifiers are unique within their respective ID space.

In cases where the node ID is only unique within files, using ID spaces is a way to ensure uniqueness across all nodes files. See Using ID spaces.

Each node processed by neo4j-admin import must provide an ID if it is to be connected in any relationships. The node ID is used to find the start node and end node when creating a relationship.

Example 4. ID space

To define a ID space Movie-ID for movieId:ID the syntax will be movieId:ID(Movie-ID).

The data

For example, if movies and people both use sequential identifiers then we would define Movie and Actor ID spaces.

movies7.csv
movieId:ID(Movie-ID),title,year:int,:LABEL
1,"The Matrix",1999,Movie
2,"The Matrix Reloaded",2003,Movie;Sequel
3,"The Matrix Revolutions",2003,Movie;Sequel
actors7.csv
personId:ID(Actor-ID),name,:LABEL
1,"Keanu Reeves",Actor
2,"Laurence Fishburne",Actor
3,"Carrie-Anne Moss",Actor

We also need to reference the appropriate ID space in our relationships file so it knows which nodes to connect together.

roles7.csv
:START_ID(Actor-ID),role,:END_ID(Movie-ID)
1,"Neo",1
1,"Neo",2
1,"Neo",3
2,"Morpheus",1
2,"Morpheus",2
2,"Morpheus",3
3,"Trinity",1
3,"Trinity",2
3,"Trinity",3

Importing the data

The call to neo4j-admin import would look like this:

shell
bin/neo4j-admin import --database=neo4j --nodes=import/movies7.csv --nodes=import/actors7.csv --relationships=ACTED_IN=import/roles7.csv

9. Skip relationships referring to missing nodes

The import tool has no tolerance for bad entities (relationships or nodes) and will fail the import on the first bad entity. You can specify explicitly that you want it to ignore rows that contain bad entities.

There are two different types of bad input:

  1. Bad relationships.

  2. Bad nodes.

Relationships that refer to missing node IDs, either for :START_ID or :END_ID are considered bad relationships. Whether or not such relationships are skipped is controlled with --skip-bad-relationships flag which can have the values true or false or no value, which means true. The default is false, which means that any bad relationship is considered an error and will fail the import. For more information, see the --skip-bad-relationships option.

The data

In the following example there is a missing emil node referenced in the roles file.

movies8a.csv
movieId:ID,title,year:int,:LABEL
tt0133093,"The Matrix",1999,Movie
tt0234215,"The Matrix Reloaded",2003,Movie;Sequel
tt0242653,"The Matrix Revolutions",2003,Movie;Sequel
actors8a.csv
personId:ID,name,:LABEL
keanu,"Keanu Reeves",Actor
laurence,"Laurence Fishburne",Actor
carrieanne,"Carrie-Anne Moss",Actor
roles8a.csv
:START_ID,role,:END_ID,:TYPE
keanu,"Neo",tt0133093,ACTED_IN
keanu,"Neo",tt0234215,ACTED_IN
keanu,"Neo",tt0242653,ACTED_IN
laurence,"Morpheus",tt0133093,ACTED_IN
laurence,"Morpheus",tt0234215,ACTED_IN
laurence,"Morpheus",tt0242653,ACTED_IN
carrieanne,"Trinity",tt0133093,ACTED_IN
carrieanne,"Trinity",tt0234215,ACTED_IN
carrieanne,"Trinity",tt0242653,ACTED_IN
emil,"Emil",tt0133093,ACTED_IN

Importing the data

The call to neo4j-admin import would look like this:

shell
bin/neo4j-admin import --database=neo4j --nodes=import/movies8a.csv --nodes=import/actors8a.csv --relationships=import/roles8a.csv

Since there was a bad relationship in the input data, the import process will fail completely.

Let’s see what happens if we append the --skip-bad-relationships flag:

shell
bin/neo4j-admin import --database=neo4j --skip-bad-relationships --nodes=import/movies8a.csv --nodes=import/actors8a.csv --relationships=import/roles8a.csv

The data files are successfully imported and the bad relationship is ignored. An entry is written to the import.report file.

ignore bad relationships
InputRelationship:
   source: roles8a.csv:11
   properties: [role, Emil]
   startNode: emil (global id space)
   endNode: tt0133093 (global id space)
   type: ACTED_IN
 referring to missing node emil

10. Skip nodes with same ID

Nodes that specify :ID which has already been specified within the ID space are considered bad nodes. Whether or not such nodes are skipped is controlled with --skip-duplicate-nodes flag which can have the values true or false or no value, which means true. The default is false, which means that any duplicate node is considered an error and will fail the import. For more information, see the --skip-duplicate-nodes option.

The data

In the following example there is a node ID, laurence, that is specified twice within the same ID space.

actors8b.csv
personId:ID,name,:LABEL
keanu,"Keanu Reeves",Actor
laurence,"Laurence Fishburne",Actor
carrieanne,"Carrie-Anne Moss",Actor
laurence,"Laurence Harvey",Actor

Importing the data

The call to neo4j-admin import would look like this:

shell
bin/neo4j-admin import --database=neo4j --nodes=import/actors8b.csv

Since there was a bad node in the input data, the import process will fail completely.

Let’s see what happens if we append the --skip-duplicate-nodes flag:

shell
bin/neo4j-admin import --database=neo4j --skip-duplicate-nodes --nodes=import/actors8b.csv

The data files are successfully imported and the bad node is ignored. An entry is written to the import.report file.

ignore bad nodes
ID 'laurence' is defined more than once in global ID space, at least at actors8b.csv:3 and actors8b.csv:5