Import

There are two ways to import data into a Neo4j database. You can use the Cypher command LOAD CSV or the neo4j-admin database import command.

With LOAD CSV, you can import small to medium-sized CSV files into an existing database. LOAD CSV can be run as many times as needed and does not require an empty database.

With the neo4j-admin database import command, you can do batch imports of large amounts of data from CSV files. It is generally faster than LOAD CSV because it is run against a stopped or a non-existent empty database.

The neo4j-admin database import command has two modes:

  • full — used to initially import data into a non-existent empty database.

  • incremental — used to incrementally import data into an existing database.

The user running neo4j-admin database import must have WRITE capabilities into server.directories.data and server.directories.log.

This section describes the neo4j-admin database import option.

For information on LOAD CSV, see the Cypher Manual → LOAD CSV. For in-depth examples of using the command neo4j-admin database import, refer to the Tutorials → Neo4j Admin import.

These are some things you need to keep in mind when creating your input files:

  • Fields are comma-separated by default but a different delimiter can be specified.

  • All files must use the same delimiter.

  • Multiple data sources can be used for both nodes and relationships.

  • A data source can optionally be provided using multiple files.

  • A separate file with a header that provides information on the data fields, must be the first specified file of each data source.

  • Fields without corresponding information in the header will not be read.

  • UTF-8 encoding is used.

  • By default, the importer trims extra whitespace at the beginning and end of strings. Quote your data to preserve leading and trailing whitespaces.

Indexes and constraints

Indexes and constraints are not created during the import. Instead, you have to add these afterward (see Cypher Manual → Indexes).

Full import

Syntax

The syntax for importing a set of CSV files is:

neo4j-admin database import full [-h]
                                 [--expand-commands]
                                 [--verbose]
                                 [--auto-skip-subsequent-headers[=true|false]]
                                 [--ignore-empty-strings[=true|false]]
                                 [--ignore-extra-columns[=true|false]]
                                 [--legacy-style-quoting[=true|false]]
                                 [--multiline-fields[=true|false]]
                                 [--normalize-types[=true|false]]
                                 [--overwrite-destination[=true|false]]
                                 [--skip-bad-entries-logging[=true|false]]
                                 [--skip-bad-relationships[=true|false]]
                                 [--skip-duplicate-nodes[=true|false]]
                                 [--trim-strings[=true|false]]
                                 [--additional-config=<file>]
                                 [--array-delimiter=<char>]
                                 [--bad-tolerance=<num>]
                                 [--delimiter=<char>]
                                 [--format=<format>]
                                 [--high-parallel-io=on|off|auto]
                                 [--id-type=string|integer|actual]
                                 [--input-encoding=<character-set>]
                                 [--max-off-heap-memory=<size>]
                                 [--quote=<char>]
                                 [--read-buffer-size=<size>]
                                 [--report-file=<path>]
                                 [--threads=<num>]
                                 --nodes=[<label>[:<label>]...=]<files>...
                                 [--nodes=[<label>[:<label>]...=]<files>...]...
                                 [--relationships=[<type>=]<files>...]...
                                 <database>

Examples

Example 1. Import data from CSV files

Assume that you have formatted your data as per CSV header format so that you have it in six different files:

  1. movies_header.csv

  2. movies.csv

  3. actors_header.csv

  4. actors.csv

  5. roles_header.csv

  6. roles.csv

The following command imports the three datasets:

neo4j_home$ bin/neo4j-admin database import full --nodes import/movies_header.csv,import/movies.csv \
--nodes import/actors_header.csv,import/actors.csv \
--relationships import/roles_header.csv,import/roles.csv
Example 2. Import data from CSV files using regular expression

Assume that you want to include a header and then multiple files that match a pattern, e.g. containing numbers. In this case, a regular expression can be used. It is guaranteed that groups of digits will be sorted in numerical order, as opposed to lexicograghic order.

For example:

neo4j_home$ bin/neo4j-admin database import full --nodes import/node_header.csv,import/node_data_\d+\.csv
Example 3. Import data from CSV files using a more complex regular expression

For regular expression patterns containing commas, which is also the delimiter between files in a group, the pattern can be quoted to preserve the pattern.

For example:

neo4j_home$ bin/neo4j-admin database import full --nodes import/node_header.csv,'import/node_data_\d{1,5}.csv'

If importing to a database that has not explicitly been created prior to the import, it must be created subsequently in order to be used.

Parameters and options

<database>

Name of the database to import. If the database does not exist prior to importing, you must create it subsequently using CREATE DATABASE.

Default: neo4j.

Some of the options below are marked as Advanced. These options should not be used for experimentation.

For more information, please contact Neo4j Professional Services.

Table 1. neo4j-admin database import full options
Option Description Default

--additional-config=<file>

Path to a configuration file that contains additional configuration options.

--array-delimiter=<char>

Delimiter character between array elements within a value in CSV data. Also accepts 'TAB' and e.g. 'U+20AC' for specifying the character using Unicode.

  • ASCII character — e.g. --array-delimiter=";".

  • \ID — Unicode character with ID, e.g. --array-delimiter="\59".

  • U+XXXX — Unicode character specified with 4 HEX characters, e.g. --array-delimiter="U+20AC".

  • \t — horizontal tabulation (HT), e.g. --array-delimiter="\t".

For horizontal tabulation (HT), use \t or the Unicode character ID \9.

Unicode character ID can be used if prepended by \.

;

--auto-skip-subsequent-headers

Automatically skip accidental header lines in subsequent files in file groups with more than one file.

false

--bad-tolerance=<num>

Number of bad entries before the import is considered failed.

This tolerance threshold is about relationships referring to missing nodes. Format errors in input data are still treated as errors.

1000

--delimiter=<char>

Determines the delimiter between values in CSV data.

  • ASCII character — e.g. --delimiter=",".

  • \ID — Unicode character with ID, e.g. --delimiter="\44".

  • U+XXXX — Unicode character specified with 4 HEX characters, e.g. --delimiter="U+20AC".

  • \t — horizontal tabulation (HT), e.g. --delimiter="\t".

For horizontal tabulation (HT), use \t or the Unicode character ID \9.

Unicode character ID can be used if prepended by \.

,

--expand-commands

Allow command expansion in config value evaluation.

--format=<format>

Name of database format. Imported database will be created of the specified format or use format from configuration if not specified.

-h, --help

Show this help message and exit.

--high-parallel-io[=on/off/auto]

Ignore environment-based heuristics and specify whether the target storage subsystem can support parallel IO with high throughput.

Typically this is on for SSDs, large raid arrays, and network-attached storage.

auto

--id-type=<string|integer|actual>

Each node must provide a unique ID in order to be used for creating relationships during the import.

Possible values are:

  • string — arbitrary strings for identifying nodes.

  • integer — arbitrary integer values for identifying nodes.

  • actual — actual node IDs. Advanced

string

--ignore-empty-strings[=<true/false>]

Determines whether or not empty string fields, such as "", from input source are ignored (treated as null).

false

--ignore-extra-columns[=<true/false>]

If unspecified columns should be ignored during the import.

false

--input-encoding=<character-set>

Character set that input data is encoded in.

UTF-8

--legacy-style-quoting[=<true/false>]

Determines whether or not backslash-escaped quote e.g. " is interpreted as an inner quote.

false

--max-off-heap-memory=<size>

Maximum off-heap memory that neo4j-admin can use for various data structures and caching to improve performance.

Values can be plain numbers such as 10000000, or 20G for 20 gigabytes. It can also be specified as a percentage of the available memory, for example 70%.

90%

--multiline-fields[=<true/false>]

Determines whether or not fields from the input source can span multiple lines, i.e. contain newline characters.

Setting --multiline-fields=true can severely degrade the performance of the importer. Therefore, use it with care, especially with large imports.

false

--nodes=[<label>[:<label>]…​=]<files>…​

Node CSV header and data.

  • Multiple files will be logically seen as one big file from the perspective of the importer.

  • The first line must contain the header.

  • Multiple data sources like these can be specified in one import, where each data source has its own header.

  • Files can also be specified using regular expressions.

--normalize-types[=<true/false>]

Determines whether or not to normalize property types to Cypher types, e.g. int becomes long and float becomes double.

true

--overwrite-destination[=<true/false>]

Deletes any existing database files prior to the import.

Use --overwrite-destination=true to delete all files of the specified database and then import new data. For example:

  • When using Neo4j Community Edition. Since the Community Edition only supports one database and does not support DROP DATABASE name, the only way to re-import data using neo4j-admin database import is to use --overwrite-destination=true.

  • When you first want to see how the data would get imported and maybe do some tweaking before you import your actual data. For example, you can first import a small batch of data (e.g., 1000 rows) and examine it. And then, tweak your actual data (e.g., 10 million rows) and use the option --overwrite-destination=true to re-import it.

false

--quote=<char>

Character to treat as quotation character for values in CSV data.

Quotes can be escaped as per RFC 4180 by doubling them, for example "" would be interpreted as a literal ".

You cannot escape using \.

"

--read-buffer-size=<size>

Size of each buffer for reading input data.

It has to at least be large enough to hold the biggest single value in the input data. Value can be a plain number or byte units string, e.g. 128k, 1m.

4194304

--relationships=[<type>=]<files>…​

Relationship CSV header and data.

  • Multiple files will be logically seen as one big file from the perspective of the importer.

  • The first line must contain the header.

  • Multiple data sources like these can be specified in one import, where each data source has its own header.

  • Files can also be specified using regular expressions.

--report-file=<path>

File in which to store the report of the csv-import.

The location of the import log file can be controlled using the --report-file option. If you run large imports of CSV files that have low data quality, the import log file can grow very large. For example, CSV files that contain duplicate node IDs, or that attempt to create relationships between non-existent nodes, could be classed as having low data quality. In these cases, you may wish to direct the output to a location that can handle the large log file.

If you are running on a UNIX-like system and you are not interested in the output, you can get rid of it altogether by directing the report file to /dev/null.

If you need to debug the import, it might be useful to collect the stack trace. This is done by using --verbose option.

import.report

--skip-bad-entries-logging[=<true/false>]

Determines whether or not to skip logging bad entries detected during import.

false

--skip-bad-relationships[=<true/false>]

Determines whether or not to skip importing relationships that refer to missing node IDs, i.e. either start or end node ID/group referring to the node that was not specified by the node input data.

Skipped relationships will be logged, containing at most the number of entities specified by --bad-tolerance, unless otherwise specified by the --skip-bad-entries-logging option.

false

--skip-duplicate-nodes[=<true/false>]

Determines whether or not to skip importing nodes that have the same ID/group.

In the event of multiple nodes within the same group having the same ID, the first encountered will be imported, whereas consecutive such nodes will be skipped.

Skipped nodes will be logged, containing at most the number of entities specified by --bad-tolerance, unless otherwise specified by the --skip-bad-entries-logging option.

false

--threads=<num> Advanced

Max number of worker threads used by the importer.

Defaults to the number of available processors reported by the JVM. There is a certain amount of minimum threads needed, so for that reason, there is no lower bound for this value.

For optimal performance, this value shouldn’t be greater than the number of available processors.

--trim-strings[=<true/false>]

Determines whether or not strings should be trimmed for whitespaces.

false

--verbose

Enable verbose output.

Heap size for the import

You want to set the maximum heap size to a relevant value for the import. This is done by defining the HEAP_SIZE environment parameter before starting the import. For example, 2G is an appropriate value for smaller imports.

If doing imports in the order of magnitude of 100 billion entities, 20G will be an appropriate value.

Record format

If your import data results in a graph that is larger than 34 billion nodes, 34 billion relationships, or 68 billion properties, you will need to configure the importer to use the high_limit record format. This is achieved by using the format option of the import command and setting the value to high_limit:

neo4j-admin database import full --format=high_limit

The high_limit format is available for Enterprise Edition only.

Incremental import

The neo4j-admin database import incremental command imports a large amount of data from CSV files into an existing, non-empty database.

Incremental import needs to be used with care. These options should not be used for experimentation.

You need to append any incremental import command with --force.

For more information, please contact Neo4j Professional Services.

Syntax

neo4j-admin database import incremental [-h]
                                        [--expand-commands]
                                        --force
                                        [--verbose]
                                        [--auto-skip-subsequent-headers[=true|false]]
                                        [--ignore-empty-strings[=true|false]]
                                        [--ignore-extra-columns[=true|false]]
                                        [--legacy-style-quoting[=true|false]]
                                        [--multiline-fields[=true|false]]
                                        [--normalize-types[=true|false]]
                                        [--skip-bad-entries-logging[=true|false]]
                                        [--skip-bad-relationships[=true|false]]
                                        [--skip-duplicate-nodes[=true|false]]
                                        [--trim-strings[=true|false]]
                                        [--additional-config=<file>]
                                        [--array-delimiter=<char>]
                                        [--bad-tolerance=<num>]
                                        [--delimiter=<char>]
                                        [--high-parallel-io=on|off|auto]
                                        [--id-type=string|integer|actual]
                                        [--input-encoding=<character-set>]
                                        [--max-off-heap-memory=<size>]
                                        [--quote=<char>]
                                        [--read-buffer-size=<size>]
                                        [--report-file=<path>]
                                        [--stage=all|prepare|build|merge]
                                        [--threads=<num>]
                                        --nodes=[<label>[:<label>]...=]<files>...
                                        [--nodes=[<label>[:<label>]...=]<files>...]...
                                        [--relationships=[<type>=]<files>...]...
                                        <database>

Usage and limitations

The incremental import command can be used to add:

  • New nodes with labels and properties.

  • New relationships between existing or new nodes.

The incremental import command cannot be used to:

  • Add new properties to existing nodes or relationships.

  • Update or delete properties in nodes or relationships.

  • Update or delete labels in nodes.

  • Delete existing nodes and relationships.

The importer works well on single instances. In clustering environments with multiple copies of the database, the updated database must be reseeded.

Examples

There are two ways of importing data incrementally:

  • If downtime is not a concern, you can run a single command with the option --stage=all. This option requires the database to be stopped.

  • If you cannot afford a full downtime of your database, you can run the import in three stages:

    • prepare stage:

      During this stage, the import tool analyzes the CSV headers and copies the relevant data over to the new increment database path. The import command is run with the option --stage=prepare and the database must be stopped.

    • build stage:

      During this stage, the import tool imports the data into the database. This is the longest stage and you can put the database in read-only mode to allow read access. The import command is run with the option --stage=build.

    • merge stage:

      During this stage, the import tool merges the new with the existing data in the database. It also updates the affected indexes and upholds the affected uniqueness constraints and property existence constraints. The import command is run with the option --stage=merge and the database must be stopped.

Example 4. Incremental import in a single command
neo4j@system> STOP DATABASE db1 WAIT;
...
$ bin/neo4j-admin database import incremental --stage=all --nodes=N1=../../raw-data/incremental-import/b.csv db1
Example 5. Incremental import in stages
  1. prepare stage:

    1. Stop the database with the WAIT option to ensure a checkpoint happens before you run the incremental import command. The database must be stopped to run --stage=prepare.

      neo4j@system> STOP DATABASE db1 WAIT;
    2. Run the incremental import command with the --stage=prepare option:

      $ bin/neo4j-admin database import incremental --stage=prepare --nodes=N1=../../raw-data/incremental-import/c.csv db1
  2. build stage:

    1. Put the database in read-only mode:

      ALTER DATABASE db1 SET ACCESS READ ONLY;
    2. Run the incremental import command with the --stage=build option:

      $ bin/neo4j-admin database import incremental --stage=build --nodes=N1=../../raw-data/incremental-import/c.csv db1
  3. merge stage:

    It is not necessary to include the --nodes or --relationships options when using --stage=merge.

    1. Stop the database with the WAIT option to ensure a checkpoint happens before you run the incremental import command.

      neo4j@system> STOP DATABASE db1 WAIT;
    2. Run the incremental import command with the --stage=merge option:

      $ bin/neo4j-admin database import incremental --stage=merge db1

Parameters and options

<database>

Name of the database to import. If the database does not exist prior to importing, you must create it subsequently using CREATE DATABASE.

Default: neo4j.

Table 2. neo4j-admin database import incremental options
Option Description Default

--additional-config=<file>

Configuration file with additional configuration.

--array-delimiter=<char>

Determines the array delimiter within a value in CSV data.

  • ASCII character — e.g. --array-delimiter=";".

  • \ID — Unicode character with ID, e.g. --array-delimiter="\59".

  • U+XXXX — Unicode character specified with 4 HEX characters, e.g. --array-delimiter="U+20AC".

  • \t — horizontal tabulation (HT), e.g. --array-delimiter="\t".

For horizontal tabulation (HT), use \t or the Unicode character ID \9.

Unicode character ID can be used if prepended by \.

;

--auto-skip-subsequent-headers[=<true/false>]

Automatically skip accidental header lines in subsequent files in file groups with more than one file.

false

--bad-tolerance=<num>

Number of bad entries before the import is considered failed.

This tolerance threshold is about relationships referring to missing nodes. Format errors in input data are still treated as errors.

1000

--delimiter=<char>

Determines the delimiter between values in CSV data.

  • ASCII character — e.g. --delimiter=",".

  • \ID — Unicode character with ID, e.g. --delimiter="\44".

  • U+XXXX — Unicode character specified with 4 HEX characters, e.g. --delimiter="U+20AC".

  • \t — horizontal tabulation (HT), e.g. --delimiter="\t".

For horizontal tabulation (HT), use \t or the Unicode character ID \9.

Unicode character ID can be used if prepended by \.

,

--expand-commands

Allow command expansion in config value evaluation.

--force

Confirm incremental import by setting this flag.

-h, --help

Show this help message and exit.

--high-parallel-io=on/off/auto

Ignore environment-based heuristics and specify whether the target storage subsystem can support parallel IO with high throughput.

Typically this is on for SSDs, large raid arrays, and network-attached storage.

auto

--id-type=string|integer|actual

Each node must provide a unique ID in order to be used for creating relationships during the import.

Possible values are:

  • string — arbitrary strings for identifying nodes.

  • integer — arbitrary integer values for identifying nodes.

  • actual — actual node IDs. Advanced

string

--ignore-empty-strings[=<true/false>]

Determines whether or not empty string fields, such as "", from input source are ignored (treated as null).

false

--ignore-extra-columns[=<true/false>]

If unspecified columns should be ignored during the import.

false

--input-encoding=<character-set>

Character set that input data is encoded in.

UTF-8

--legacy-style-quoting[=<true/false>]

Determines whether or not backslash-escaped quote e.g. \" is interpreted as an inner quote.

false

--max-off-heap-memory=<size>

Maximum off-heap memory that neo4j-admin can use for various data structures and caching to improve performance.

Values can be plain numbers such as 10000000, or 20G for 20 gigabytes. It can also be specified as a percentage of the available memory, for example 70%.

90%

--multiline-fields[=<true/false>]

Determines whether or not fields from the input source can span multiple lines, i.e. contain newline characters.

Setting --multiline-fields=true can severely degrade the performance of the importer. Therefore, use it with care, especially with large imports.

false

--nodes=[<label>[:<label>]…​=]<files>…​

Node CSV header and data.

  • Multiple files will be logically seen as one big file from the perspective of the importer.

  • The first line must contain the header.

  • Multiple data sources like these can be specified in one import, where each data source has its own header.

  • Files can also be specified using regular expressions.

--normalize-types[=<true/false>]

Determines whether or not to normalize property types to Cypher types, e.g. int becomes long and float becomes double.

true

--quote=<char>

Character to treat as quotation character for values in CSV data.

Quotes can be escaped as per RFC 4180 by doubling them, for example "" would be interpreted as a literal ".

You cannot escape using \.

"

--read-buffer-size=<size>

Size of each buffer for reading input data.

It has to at least be large enough to hold the biggest single value in the input data. Value can be a plain number or byte units string, e.g. 128k, 1m.

4194304

--relationships=[<type>=]<files>…​

Relationship CSV header and data.

  • Multiple files will be logically seen as one big file from the perspective of the importer.

  • The first line must contain the header.

  • Multiple data sources like these can be specified in one import, where each data source has its own header.

  • Files can also be specified using regular expressions.

--report-file=<path>

File in which to store the report of the csv-import.

The location of the import log file can be controlled using the --report-file option. If you run large imports of CSV files that have low data quality, the import log file can grow very large. For example, CSV files that contain duplicate node IDs, or that attempt to create relationships between non-existent nodes, could be classed as having low data quality. In these cases, you may wish to direct the output to a location that can handle the large log file.

If you are running on a UNIX-like system and you are not interested in the output, you can get rid of it altogether by directing the report file to /dev/null.

If you need to debug the import, it might be useful to collect the stack trace. This is done by using --verbose option.

import.report

--skip-bad-entries-logging[=<true/false>]

Determines whether or not to skip logging bad entries detected during import.

false

--skip-bad-relationships[=<true/false>]

Determines whether or not to skip importing relationships that refer to missing node IDs, i.e. either start or end node ID/group referring to a node that was not specified by the node input data.

Skipped relationships will be logged, containing at most the number of entities specified by --bad-tolerance, unless otherwise specified by the --skip-bad-entries-logging option.

false

--skip-duplicate-nodes[=<true/false>]

Determines whether or not to skip importing nodes that have the same ID/group.

In the event of multiple nodes within the same group having the same ID, the first encountered will be imported, whereas consecutive such nodes will be skipped.

Skipped nodes will be logged, containing at most the number of entities specified by --bad-tolerance, unless otherwise specified by the --skip-bad-entries-logging option.

false

--stage=all|prepare|build|merge

Stage of incremental import.

For incremental import into an existing database use all (which requires the database to be stopped).

For semi-online incremental import run prepare (on a stopped database) followed by build (on a potentially running database) and finally merge (on a stopped database)",

all

--threads=<num>

(advanced) Max number of worker threads used by the importer. Defaults to the number of available processors reported by the JVM. There is a certain amount of minimum threads needed so for that reason there is no lower bound for this value. For optimal performance, this value shouldn’t be greater than the number of available processors.

10

--trim-strings[=<true/false>]

Determines whether or not strings should be trimmed for whitespaces.

false

--verbose

Enable verbose output.

CSV header format

The header file of each data source specifies how the data fields should be interpreted. You must use the same delimiter for the header file and the data files.

The header contains information for each field, with the format <name>:<field_type>. The <name> is used for properties and node IDs. In all other cases, the <name> part of the field is ignored.

When using incremental import, you will need to have node uniqueness constraints in place for the property key and label combinations that form the primary key, or the uniquely identifiable nodes. For example, importing nodes with a Person label that are uniquely identified with a uuid property key, the format of the header should be uuid:ID{label:Person}.

This is also true when working with multiple groups. For example, you can use uuid:ID(Person){label:Person}, where the relationship CSV data can refer to different groups for its :START_ID and :END_ID, just like the full import method.

Node files

Files containing node data can have an ID field, a LABEL field as well as properties.

ID

Each node must have a unique ID if it is to be connected by any relationships created in the import. The IDs are used to find the correct nodes when creating relationships. Note that the ID has to be unique across all nodes in the import; even for nodes with different labels. The unique ID can be persisted in a property whose name is defined by the <name> part of the field definition <name>:ID. If no such property name is defined, the unique ID will be used for the import but not be available for reference later. If no ID is specified, the node will be imported but it will not be able to be connected by any relationships during the import. When a property name is provided, the type of that property can only be configured globally via the --id-type option, and can’t be specified with a <field_type> in the header field (as is possible for properties)

LABEL

Read one or more labels from this field. Like array values, multiple labels are separated by ;, or by the character specified with --array-delimiter.

Example 6. Define nodes files

You define the headers for movies in the movies_header.csv file. Movies have the properties movieId, year, and title. You also specify a field for labels.

movieId:ID,title,year:int,:LABEL

You define three movies in the movies.csv file. They contain all the properties defined in the header file. All the movies are given the label Movie. Two of them are also given the label Sequel.

tt0133093,"The Matrix",1999,Movie
tt0234215,"The Matrix Reloaded",2003,Movie;Sequel
tt0242653,"The Matrix Revolutions",2003,Movie;Sequel

Similarly, you also define three actors in the actors_header.csv and actors.csv files. They all have the properties personId and name, and the label Actor.

personId:ID,name,:LABEL
keanu,"Keanu Reeves",Actor
laurence,"Laurence Fishburne",Actor
carrieanne,"Carrie-Anne Moss",Actor

Relationship files

Files containing relationship data have three mandatory fields and can also have properties. The mandatory fields are:

TYPE

The relationship type to use for this relationship.

START_ID

The ID of the start node for this relationship.

END_ID

The ID of the end node for this relationship.

The START_ID and END_ID refer to the unique node ID defined in one of the node data sources, as explained in the previous section. None of these take a name, e.g. if <name>:START_ID or <name>:END_ID is defined, the <name> part will be ignored. Nor do they take a <field_type>, e.g. if :START_ID:int or :END_ID:int is defined, the :int part does not have any meaning in the context of type information.

Example 7. Define relationships files

In this example, you assume that the two node files from the previous example are used together with the following relationships file.

You define relationships between actors and movies in the files roles_header.csv and roles.csv. Each row connects a start node and an end node with a relationship of relationship type ACTED_IN. Notice how you use the unique identifiers personId and movieId from the nodes files above. The name of the character that the actor is playing in this movie is stored as a role property on the relationship.

:START_ID,role,:END_ID,:TYPE
keanu,"Neo",tt0133093,ACTED_IN
keanu,"Neo",tt0234215,ACTED_IN
keanu,"Neo",tt0242653,ACTED_IN
laurence,"Morpheus",tt0133093,ACTED_IN
laurence,"Morpheus",tt0234215,ACTED_IN
laurence,"Morpheus",tt0242653,ACTED_IN
carrieanne,"Trinity",tt0133093,ACTED_IN
carrieanne,"Trinity",tt0234215,ACTED_IN
carrieanne,"Trinity",tt0242653,ACTED_IN

Properties

For properties, the <name> part of the field designates the property key, while the <field_type> part assigns a data type (see below). You can have properties in both node data files and relationship data files.

Data types

Use one of int, long, float, double, boolean, byte, short, char, string, point, date, localtime, time, localdatetime, datetime, and duration to designate the data type for properties. If no data type is given, this defaults to string. To define an array type, append [] to the type. By default, array values are separated by ;. A different delimiter can be specified with --array-delimiter. Boolean values are true if they match exactly the text true. All other values are false. Values that contain the delimiter character need to be escaped by enclosing in double quotation marks, or by using a different delimiter character with the --delimiter option.

Example 8. Header format with data types

This example illustrates several different data types specified in the CSV header.

:ID,name,joined:date,active:boolean,points:int
user01,Joe Soap,2017-05-05,true,10
user02,Jane Doe,2017-08-21,true,15
user03,Moe Know,2018-02-17,false,7
Special considerations for the point data type

A point is specified using the Cypher syntax for maps. The map allows the same keys as the input to the Cypher Manual → Point function. The point data type in the header can be amended with a map of default values used for all values of that column, e.g. point{crs: 'WGS-84'}. Specifying the header this way allows you to have an incomplete map in the value position in the data file. Optionally, a value in a data file may override default values from the header.

Example 9. Property format for point data type

This example illustrates various ways of using the point data type in the import header and the data files.

You are going to import the name and location coordinates for cities. First, you define the header as:

:ID,name,location:point{crs:WGS-84}

You then define cities in the data file.

  • The first city’s location is defined using latitude and longitude, as expected when using the coordinate system defined in the header.

  • The second city uses x and y instead. This would normally lead to a point using the coordinate reference system cartesian. Since the header defines crs:WGS-84, that coordinate reference system will be used.

  • The third city overrides the coordinate reference system defined in the header and sets it explicitly to WGS-84-3D.

:ID,name,location:point{crs:WGS-84}
city01,"Malmö","{latitude:55.6121514, longitude:12.9950357}"
city02,"London","{y:51.507222, x:-0.1275}"
city03,"San Mateo","{latitude:37.554167, longitude:-122.313056, height: 100, crs:'WGS-84-3D'}"

Note that all point maps are within double quotation marks " in order to prevent the enclosed , character from being interpreted as a column separator. An alternative approach would be to use --delimiter='\t' and reformat the file with tab separators, in which case the " characters are not required.

:ID name    location:point{crs:WGS-84}
city01  Malmö   {latitude:55.6121514, longitude:12.9950357}
city02  London  {y:51.507222, x:-0.1275}
city03  San Mateo   {latitude:37.554167, longitude:-122.313056, height: 100, crs:'WGS-84-3D'}
Special considerations for temporal data types

The format for all temporal data types must be defined as described in Cypher Manual → Temporal instants syntax and Cypher Manual → Durations syntax. Two of the temporal types, Time and DateTime, take a time zone parameter that might be common between all or many of the values in the data file. It is therefore possible to specify a default time zone for Time and DateTime values in the header, for example: time{timezone:+02:00} and: datetime{timezone:Europe/Stockholm}. If no default time zone is specified, the default timezone is determined by the db.temporal.timezone configuration setting. The default time zone can be explicitly overridden in the values in the data file.

Example 10. Property format for temporal data types

This example illustrates various ways of using the datetime data type in the import header and the data files.

First, you define the header with two DateTime columns. The first one defines a time zone, but the second one does not:

:ID,date1:datetime{timezone:Europe/Stockholm},date2:datetime

You then define dates in the data file.

  • The first row has two values that do not specify an explicit timezone. The value for date1 will use the Europe/Stockholm time zone that was specified for that field in the header. The value for date2 will use the configured default time zone of the database.

  • In the second row, both date1 and date2 set the time zone explicitly to be Europe/Berlin. This overrides the header definition for date1, as well as the configured default time zone of the database.

1,2018-05-10T10:30,2018-05-10T12:30
2,2018-05-10T10:30[Europe/Berlin],2018-05-10T12:30[Europe/Berlin]

Using ID spaces

By default, the import tool assumes that node identifiers are unique across node files. In many cases, the ID is unique only across each entity file, for example, when your CSV files contain data extracted from a relational database and the ID field is pulled from the primary key column in the corresponding table. To handle this situation you define ID spaces. ID spaces are defined in the ID field of node files using the syntax ID(<ID space identifier>). To reference an ID of an ID space in a relationship file, you use the syntax START_ID(<ID space identifier>) and END_ID(<ID space identifier>).

Example 11. Define and use ID spaces

Define a Movie-ID ID space in the movies_header.csv file.

movieId:ID(Movie-ID),title,year:int,:LABEL
1,"The Matrix",1999,Movie
2,"The Matrix Reloaded",2003,Movie;Sequel
3,"The Matrix Revolutions",2003,Movie;Sequel

Define an Actor-ID ID space in the header of the actors_header.csv file.

personId:ID(Actor-ID),name,:LABEL
1,"Keanu Reeves",Actor
2,"Laurence Fishburne",Actor
3,"Carrie-Anne Moss",Actor

Now use the previously defined ID spaces when connecting the actors to movies.

:START_ID(Actor-ID),role,:END_ID(Movie-ID),:TYPE
1,"Neo",1,ACTED_IN
1,"Neo",2,ACTED_IN
1,"Neo",3,ACTED_IN
2,"Morpheus",1,ACTED_IN
2,"Morpheus",2,ACTED_IN
2,"Morpheus",3,ACTED_IN
3,"Trinity",1,ACTED_IN
3,"Trinity",2,ACTED_IN
3,"Trinity",3,ACTED_IN

Skipping columns

IGNORE

If there are fields in the data that you wish to ignore completely, this can be done using the IGNORE keyword in the header file. IGNORE must be prepended with a :.

Example 12. Skip a column

In this example, you are not interested in the data in the third column of the nodes file and wish to skip over it. Note that the IGNORE keyword is prepended by a :.

personId:ID,name,:IGNORE,:LABEL
keanu,"Keanu Reeves","male",Actor
laurence,"Laurence Fishburne","male",Actor
carrieanne,"Carrie-Anne Moss","female",Actor

If all your superfluous data is placed in columns located to the right of all the columns that you wish to import, you can instead use the command line option --ignore-extra-columns.

Import compressed files

The import tool can handle files compressed with zip or gzip. Each compressed file must contain a single file.

Example 13. Perform an import using compressed files
neo4j_home$ ls import
actors-header.csv  actors.csv.zip  movies-header.csv  movies.csv.gz  roles-header.csv  roles.csv.gz
neo4j_home$ bin/neo4j-admin database import --nodes import/movies-header.csv,import/movies.csv.gz --nodes import/actors-header.csv,import/actors.csv.zip --relationships import/roles-header.csv,import/roles.csv.gz

Resuming a stopped or canceled import

An import that is stopped or fails before completing can be resumed from a point closer to where it was stopped. An import can be resumed from the following points:

  • Linking of relationships

  • Post-processing