Chapter 2. Deployment

Table of Contents

This chapter covers deploying Neo4j: capacity planning, single-instance and cluster installation, and post-installation tasks.

2.1. System requirements

CPU

Performance is generally memory or I/O bound for large graphs, and compute bound for graphs that fit in memory.

Minimum
Intel Core i3
Recommended

Intel Core i7

IBM POWER8

Memory

More memory allows for larger graphs, but it needs to be configured properly to avoid disruptive garbage collection operations. See Section 6.3, “Memory tuning” for suggestions.

Minimum
2GB
Recommended
16—​32GB or more

Disk

Aside from capacity, the performance characteristics of the disk are the most important when selecting storage. Neo4j workloads tend significantly toward random reads. Select media with low average seek time: SSD over spinning disks. Consult Section 6.7, “Disks, RAM and other tips” for more details.

Minimum
10GB SATA
Recommended
SSD w/ SATA

Filesystem

For proper ACID behavior, the filesystem must support flush (fsync, fdatasync). See Section 6.6, “Linux file system tuning” for a discussion on how to configure the filesystem in Linux for optimal performance.

Minimum
ext4 (or similar)
Recommended
ext4, ZFS

Software

Neo4j requires a Java Virtual Machine, JVM, to operate. Community Edition installers for Windows and Mac include a JVM for convenience. All other distributions, including all distributions of Neo4j Enterprise Edition, require a pre-installed JVM.

Java

OpenJDK 8 or Oracle Java 8

IBM Java 8

Operating Systems

Linux (Ubuntu, Debian)

Windows Server 2012

Architectures

x86

OpenPOWER (POWER8)

2.2. File locations

This table shows where important files can be found by default in various Neo4j distribution packages.

Package Configuration Data Logs Metrics Import Bin Lib Plugins

Linux or OS X tarball

<neo4j-home>/conf/neo4j.conf

<neo4j-home>/data

<neo4j-home>/logs

<neo4j-home>/metrics

<neo4j-home>/import

<neo4j-home>/bin

<neo4j-home>/lib

<neo4j-home>/plugins

Windows zip

<neo4j-home>\conf\neo4j.conf

<neo4j-home>\data

<neo4j-home>\logs

<neo4j-home>\metrics

<neo4j-home>\import

<neo4j-home>\bin

<neo4j-home>\lib

<neo4j-home>\plugins

Debian/Ubuntu .deb

/etc/neo4j/neo4j.conf

/var/lib/neo4j/data

/var/log/neo4j

/var/lib/neo4j/metrics

/var/lib/neo4j/import

/var/lib/neo4j/bin

/var/lib/neo4j/lib

/var/lib/neo4j/plugins

Windows desktop

%APPDATA%\Neo4j Community Edition\neo4j.conf

%APPDATA%\Neo4j Community Edition

%APPDATA%\Neo4j Community Edition\logs

%APPDATA%\Neo4j Community Edition\metrics

%APPDATA%\Neo4j Community Edition\import

%ProgramFiles%\Neo4j CE 3.0\bin

(in package)

%ProgramFiles%\Neo4j CE 3.0\plugins

OS X desktop

${HOME}/Documents/Neo4j/neo4j.conf

${HOME}/Documents/Neo4j

${HOME}/Documents/Neo4j/logs

${HOME}/Documents/Neo4j/metrics

${HOME}/Documents/Neo4j/import

(in package)

(in package)

(in package)

Please note that the data directory is internal to Neo4j and its structure subject to change between versions without notice.

2.2.1. Log Files

Filename Description

neo4j.log

The standard log, where general information about Neo4j is written.

debug.log

Information useful when debugging problems with Neo4j.

http.log

Request log for the HTTP API.

gc.log

Garbage Collection logging provided by the JVM.

query.log

Log of executed queries that takes longer than a specified threshold. (Enterprise Edition only.)

service-error.log

Log of errors encountered when installing or running the Windows service. (Windows only.)

2.2.2. Configuration

Some of these paths are configurable with dbms.directories.* settings; see Section A.1, “Configuration settings reference” for details.

The locations of <neo4j-home>, bin and conf can be configured using environment variables.

Location Default Environment variable Notes

<neo4j-home>

parent of bin

NEO4J_HOME

Must be set explicitly if bin is not a subdirectory.

bin

directory where neo4j script is located

NEO4J_BIN

Must be set explicitly if neo4j script is invoked as a symlink.

conf

<neo4j-home>/conf

NEO4J_CONF

Must be set explicitly if it is not a subdirectory of <neo4j-home>.

2.2.3. Permissions

The user that Neo4j runs as must have the following permissions:

Read only
  • conf
  • import
  • bin
  • lib
  • plugins
Read and write
  • data
  • logs
  • metrics
Exectute
  • all files in bin

2.3. Single-instance installation

2.3.1. Linux installation

2.3.1.1. Linux Packages

After installation you may have to do some platform specific configuration and performance tuning. For that, refer to Section 2.5, “Post-installation tasks”.

2.3.1.2. Unix Console Application

  1. Download the latest release from http://neo4j.com/download/.

    • Select the appropriate tar.gz distribution for your platform.
  2. Extract the contents of the archive, using: tar -xf <filename>

    • Refer to the top-level extracted directory as: NEO4J_HOME
  3. Change directory to: $NEO4J_HOME

    • Run: ./bin/neo4j console
  4. Stop the server by typing Ctrl-C in the console.

2.3.1.3. Linux Service

The neo4j command can also be used with start, stop, restart or status instead of console. By using these actions, you can create a Neo4j service.

This approach to running Neo4j as a service is deprecated. We strongly advise you to run Neo4j from a package where feasible.

You can build your own init.d script. See for instance the Linux Standard Base specification on system initialization, or one of the many samples and tutorials.

2.3.2. OS X installation

2.3.2.1. Mac OS X installer

  1. Download the .dmg installer that you want from http://neo4j.com/download/.
  2. Click the downloaded installer file.
  3. Drag the Neo4j icon into the Applications folder.

If you install Neo4j using the Mac installer and already have an existing instance of Neo4j the installer will ensure that both the old and new versions can co-exist on your system.

2.3.2.2. Running Neo4j from the Terminal

The server can be started in the background from the terminal with the command neo4j start, and then stopped again with neo4j stop. The server can also be started in the foreground with neo4j console — then it’s log output will be printed to the terminal.

2.3.2.3. OS X service

Use the standard OS X system tools to create a service based on the neo4j command.

2.3.3. Windows installation

2.3.3.1. Windows Installer

  1. Download the version that you want from http://neo4j.com/download/.

    • Select the appropriate version and architecture for your platform.
  2. Double-click the downloaded installer file.
  3. Follow the prompts.

The installer will prompt to be granted Administrator privileges. Newer versions of Windows come with a SmartScreen feature that may prevent the installer from running — you can make it run anyway by clicking "More info" on the "Windows protected your PC" screen.

If you install Neo4j using the windows installer and you already have an existing instance of Neo4j the installer will select a new install directory by default. If you specify the same directory it will ask if you want to upgrade. This should proceed without issue although some users have reported a JRE is damaged error. If you see this error simply install Neo4j into a different location.

2.3.3.2. Windows Console Application

  1. Download the latest release from http://neo4j.com/download/.

    • Select the appropriate Zip distribution.
  2. Right-click the downloaded file, click Extract All.
  3. Change directory to top-level extracted directory.

    • Run bin\neo4j console
  4. Stop the server by typing Ctrl-C in the console.

2.3.3.3. Windows service

Neo4j can also be run as a Windows service. Install the service with bin\neo4j install-service and start it with bin\neo4j start. Other commands available are stop, restart, status and uninstall-service.

2.3.3.4. Windows PowerShell module

The Neo4j PowerShell module allows administrators to:

  • install, start and stop Neo4j Windows® Services
  • and start tools, such as Neo4j Shell and Neo4j Import.

The PowerShell module is installed as part of the ZIP file distributions of Neo4j.

2.3.3.4.1. System Requirements
  • Requires PowerShell v2.0 or above.
  • Supported on either 32 or 64 bit operating systems.
2.3.3.4.2. Managing Neo4j on Windows

On Windows it is sometimes necessary to Unblock a downloaded zip file before you can import its contents as a module. If you right-click on the zip file and choose "Properties" you will get a dialog. Bottom-right on that dialog you will find an "Unblock" button. Click that. Then you should be able to import the module.

Running scripts has to be enabled on the system. This can for example be achieved by executing the following from an elevated PowerShell prompt:

Set-ExecutionPolicy -ExecutionPolicy RemoteSigned

For more information see About execution policies.

The powershell module will display a warning if it detects that you do not have administrative rights.

2.3.3.4.3. How do I import the module?

The module file is located in the bin directory of your Neo4j installation, i.e. where you unzipped the downloaded file. For example, if Neo4j was installed in C:\Neo4j then the module would be imported like this:

Import-Module C:\Neo4j\bin\Neo4j-Management.psd1

This will add the module to the current session.

Once the module has been imported you can start an interactive console version of a Neo4j Server like this:

Invoke-Neo4j console

To stop the server, issue Ctrl-C in the console window that was created by the command.

2.3.3.4.4. How do I get help about the module?

Once the module is imported you can query the available commands like this:

Get-Command -Module Neo4j-Management

The output should be similar to the following:

CommandType     Name                                Version    Source
-----------     ----                                -------    ------
Function        Invoke-Neo4j                        3.0.0      Neo4j-Management
Function        Invoke-Neo4jAdmin                   3.0.0      Neo4j-Management
Function        Invoke-Neo4jBackup                  3.0.0      Neo4j-Management
Function        Invoke-Neo4jImport                  3.0.0      Neo4j-Management
Function        Invoke-Neo4jShell                   3.0.0      Neo4j-Management

The module also supports the standard PowerShell help commands.

Get-Help Invoke-Neo4j

To see examples for a command, do like this:

Get-Help Invoke-Neo4j -examples
2.3.3.4.5. Example usage
  • List of available commands:

    Invoke-Neo4j
  • Current status of the Neo4j service:

    Invoke-Neo4j status
  • Install the service with verbose output:

    Invoke-Neo4j install-service -Verbose
  • Available commands for administrative tasks:

    Invoke-Neo4jAdmin
2.3.3.4.6. Common PowerShell parameters

The module commands support the common PowerShell parameter of Verbose.

2.4. Cluster installation

2.4.1. Setup and configuration

Neo4j can be configured in cluster mode to accommodate differing requirements for load, fault tolerance and available hardware. Refer to design considerations for a discussion on different design options.

Follow these steps in order to configure a Neo4j cluster:

  1. Download and install the Neo4j Enterprise Edition on each of the servers to be included in the cluster.
  2. If applicable, decide which server(s) that are to be configured as arbiter instance(s).
  3. Edit the Neo4j configuration file on each of the servers to accommodate the design decisions.
  4. Follow installation instructions for a single instance installation.
  5. Modify the configuration files on each server as outlined in the section below. There are many parameters that can be modified to achieve a certain behavior. However, the only ones mandatory for an initial cluster are: dbms.mode, ha.server_id and ha.initial_hosts.

2.4.1.1. Important configuration settings

Each instance in a Neo4j HA cluster must be assigned an integer ID, which serves as its unique identifier. At startup, a Neo4j instance contacts the other instances specified in the ha.initial_hosts configuration option.

When an instance establishes a connection to any other, it determines the current state of the cluster and ensures that it is eligible to join. To be eligible the Neo4j instance must host the same database store as other members of the cluster (although it is allowed to be in an older state), or be a new deployment without a database store.

Please note that IP Addresses or Hostnames should be explicitly configured for the machines participating in the cluster. Neo4j will attempt to configure IP addresses for itself in the absence of explicit configuration.

2.4.1.1.1. dbms.mode

dbms.mode configures the operating mode of the database.

For cluster mode it is set to: dbms.mode=HA

2.4.1.1.2. ha.server_id

ha.server_id is the cluster identifier for each instance. It must be a positive integer and must be unique among all Neo4j instances in the cluster.

For example, ha.server_id=1.

2.4.1.1.3. ha.host.coordination

ha.host.coordination is an address/port setting that specifies where the Neo4j instance will listen for cluster communications (like hearbeat messages). The default port is 5001. In the absence of a specified IP address, Neo4j will attempt to find a valid interface for binding. While this behavior typically results in a well-behaved server, it is strongly recommended that users explicitly choose an IP address bound to the network interface of their choosing to ensure a coherent cluster deployment.

For example, ha.host.coordination=192.168.33.22:5001 will listen for cluster communications on the network interface bound to the 192.168.33.0 subnet on port 5001.

2.4.1.1.4. ha.initial_hosts

ha.initial_hosts is a comma separated list of address/port pairs, which specify how to reach other Neo4j instances in the cluster (as configured via their ha.host.coordination option). These hostname/ports will be used when the Neo4j instances start, to allow them to find and join the cluster. Specifying an instance’s own address is permitted. Do not use any whitespace in this configuration option.

For example, ha.initial_hosts=192.168.33.22:5001,192.168.33.21:5001 will attempt to reach Neo4j instances listening on 192.168.33.22 on port 5001 and 192.168.33.21 on port 5001 on the 192.168.33.0 subnet.

2.4.1.1.5. ha.host.data

ha.host.data is an address/port setting that specifies where the Neo4j instance will listen for transactions from the cluster master. The default port is 6001. In the absence of a specified IP address, Neo4j will attempt to find a valid interface for binding. While this behavior typically results in a well-behaved server, it is strongly recommended that users explicitly choose an IP address bound to the network interface of their choosing to ensure a coherent cluster topology.

ha.host.data must use a different port to ha.host.coordination.

For example, ha.host.data=192.168.33.22:6001 will listen for transactions from the cluster master on the network interface bound to the 192.168.33.0 subnet on port 6001.

The ha.host.coordination and ha.host.data configuration options are specified as <IP address>:<port>.

For ha.host.data the IP address must be the address assigned to one of the host’s network interfaces.

For ha.host.coordination the IP address must be the address assigned to one of the host’s network interfaces, or the value 0.0.0.0, which will cause Neo4j to listen on every network interface.

Either the address or the port can be omitted, in which case the default for that part will be used. If the address is omitted, then the port must be preceded with a colon (eg. :5001).

The syntax for setting the port range is: <hostname>:<first port>[-<second port>]. In this case, Neo4j will test each port in sequence, and select the first that is unused. Note that this usage is not permitted when the hostname is specified as 0.0.0.0 (the "all interfaces" address).

For a hands-on tutorial for setting up a Neo4j cluster, see Section B.1, “Set up a Neo4j cluster”.

Review the Section A.1, “Configuration settings reference” section for a list of all available configuration settings.

2.4.2. Arbiter instances

A typical deployment of Neo4j will use a cluster of 3 machines to provide fault-tolerance and read scalability. This setup is described in Section B.1, “Set up a Neo4j cluster”.

While having at least 3 instances is necessary for failover to happen in case the master becomes unavailable, it is not required for all instances to run the full Neo4j stack. Instead, something called arbiter instances can be deployed. They are regarded as cluster participants in that their role is to take part in master elections with the single purpose of breaking ties in the election process. That makes possible a scenario where you have a cluster of 2 Neo4j database instances and an additional arbiter instance and still enjoy tolerance of a single failure of either of the 3 instances.

Arbiter instances are configured in neo4j.conf using the same settings as standard Neo4j cluster members. The instance is configured to be an arbiter by setting the Table A.51, “dbms.mode” option to ARBITER. Settings that are not cluster specific are of course ignored, so you can easily start up an arbiter instance in place of a properly configured Neo4j instance.

To start the arbiter instance, run neo4j as normal:

neo4j_home$ ./bin/neo4j start

You can stop, install and remove it as a service and ask for its status in exactly the same way as for other Neo4j instances.

2.4.3. Endpoints for status information

2.4.3.1. Introduction

A common use case for Neo4j HA clusters is to direct all write requests to the master while using slaves for read operations, distributing the read load across the cluster and and gain failover capabilities for your deployment. The most common way to achieve this is to place a load balancer in front of the HA cluster, an example being shown with HA Proxy. As you can see in that guide, it makes use of a HTTP endpoint to discover which instance is the master and direct write load to it. In this section, we’ll deal with this HTTP endpoint and explain its semantics.

2.4.3.2. The endpoints

Each HA instance comes with 3 endpoints regarding its HA status. They are complimentary but each may be used depending on your load balancing needs and your production setup. Those are:

  • /db/manage/server/ha/master
  • /db/manage/server/ha/slave
  • /db/manage/server/ha/available

The /master and /slave endpoints can be used to direct write and non-write traffic respectively to specific instances. This is the optimal way to take advantage of Neo4j’s scaling characteristics. The /available endpoint exists for the general case of directing arbitrary request types to instances that are available for transaction processing.

To use the endpoints, perform an HTTP GET operation on either and the following will be returned:

Table 2.1. HA HTTP endpoint responses
Endpoint Instance State Returned Code Body text

/db/manage/server/ha/master

Master

200 OK

true

Slave

404 Not Found

false

Unknown

404 Not Found

UNKNOWN

/db/manage/server/ha/slave

Master

404 Not Found

false

Slave

200 OK

true

Unknown

404 Not Found

UNKNOWN

/db/manage/server/ha/available

Master

200 OK

master

Slave

200 OK

slave

Unknown

404 Not Found

UNKNOWN

2.4.3.3. Examples

From the command line, a common way to ask those endpoints is to use curl. With no arguments, curl will do an HTTP GET on the URI provided and will output the body text, if any. If you also want to get the response code, just add the -v flag for verbose output. Here are some examples:

  • Requesting master endpoint on a running master with verbose output
#> curl -v localhost:7474/db/manage/server/ha/master
* About to connect() to localhost port 7474 (#0)
*   Trying ::1...
* connected
* Connected to localhost (::1) port 7474 (#0)
> GET /db/manage/server/ha/master HTTP/1.1
> User-Agent: curl/7.24.0 (x86_64-apple-darwin12.0) libcurl/7.24.0 OpenSSL/0.9.8r zlib/1.2.5
> Host: localhost:7474
> Accept: */*
>
< HTTP/1.1 200 OK
< Content-Type: text/plain
< Access-Control-Allow-Origin: *
< Transfer-Encoding: chunked
< Server: Jetty(6.1.25)
<
* Connection #0 to host localhost left intact
true* Closing connection #0
  • Requesting slave endpoint on a running master without verbose output:
#> curl localhost:7474/db/manage/server/ha/slave
false
  • Finally, requesting the master endpoint on a slave with verbose output
#> curl -v localhost:7475/db/manage/server/ha/master
* About to connect() to localhost port 7475 (#0)
*   Trying ::1...
* connected
* Connected to localhost (::1) port 7475 (#0)
> GET /db/manage/server/ha/master HTTP/1.1
> User-Agent: curl/7.24.0 (x86_64-apple-darwin12.0) libcurl/7.24.0 OpenSSL/0.9.8r zlib/1.2.5
> Host: localhost:7475
> Accept: */*
>
< HTTP/1.1 404 Not Found
< Content-Type: text/plain
< Access-Control-Allow-Origin: *
< Transfer-Encoding: chunked
< Server: Jetty(6.1.25)
<
* Connection #0 to host localhost left intact
false* Closing connection #0

The UNKNOWN status exists to describe when a Neo4j instance is neither master nor slave. For example, the instance could be transitioning between states (master to slave in a recovery scenario or slave being promoted to master in the event of failure), or the instance could be an arbiter instance. If the UNKNOWN status is returned, the client should not treat the instance as a master or a slave and should instead pick another instance in the cluster to use, wait for the instance to transit from the UNKNOWN state, or undertake restorative action via systems admin.

If the Neo4j server has Basic Security enabled, the HA status endpoints will also require authentication credentials. For some load balancers and proxy servers, providing this with the request is not an option. For those situations, consider disabling authentication of the HA status endpoints by setting dbms.security.ha_status_auth_enabled=false in the neo4j.conf configuration file.

2.4.4. HAProxy for load balancing

In the Neo4j HA architecture, the cluster is typically fronted by a load balancer. In this section we will explore how to set up HAProxy to perform load balancing across the HA cluster.

For this tutorial we will assume a Linux environment with HAProxy already installed. See http://www.haproxy.org/ for downloads and installation instructions.

2.4.4.1. Configuring HAProxy for the Bolt Protocol

In a typical HA deployment, HAProxy will be configured with two open ports, one for routing write operations to the master and one for load balancing read operations over slaves. Each application will have two driver instances, one connected to the master port for performing writes and one connected to the slave port for performing reads.

Let’s first set up the mode and timeouts. The settings below will kill the connection if a server or a client is idle for longer than two hours. Long-running queries may take longer time, but this can be taken care of by enabling HAProxy’s TCP heartbeat feature.

defaults
    mode        tcp

    timeout connect 30s

    timeout client 2h
    timeout server 2h

Set up where drivers wanting to perform writes will connect:

frontend neo4j-write
    bind *:7680
    default_backend current-master

Now, let’s set up the backend that points to the current master instance.

backend current-master
    option  httpchk HEAD /db/manage/server/ha/master HTTP/1.0

    server db01 10.0.1.10:7687 check port 7474
    server db02 10.0.1.11:7687 check port 7474
    server db03 10.0.1.12:7687 check port 7474

In the example above httpchk is configured in the way you would do it if authentication has been disabled for Neo4j. By default however, authentication is enabled and you will need to pass in an authentication header. This would be along the lines of option httpchk HEAD /db/manage/server/ha/master HTTP/1.0\r\nAuthorization:\ Basic\ bmVvNGo6bmVvNGo= where the last part has to be replaced with a base64 encoded value for your username and password.

Configure where drivers wanting to perform reads will connect:

frontend neo4j-read
    bind *:7681
    default_backend slaves

Finally, configure a backend that points to slaves in a round-robin fashion:

backend slaves
    balance roundrobin
    option  httpchk HEAD /db/manage/server/ha/slave HTTP/1.0

    server db01 10.0.1.10:7687 check port 7474
    server db02 10.0.1.11:7687 check port 7474
    server db03 10.0.1.12:7687 check port 7474

Note that the servers in the slave backend are configured the same way as in the current-master backend.

Then by putting all the above configurations into one file, we get a basic workable HAProxy configuration to perform load balancing for applications using the Bolt Protocol.

By default, encryption is enabled between servers and drivers. With encryption turned on, the HAProxy configuration constructed above needs no change to work directly in TLS/SSL passthrough layout for HAProxy. However depending on the driver authentication strategy adopted, some special requirements might apply to the server certificates.

For drivers using trust-on-first-use authentication strategy, each driver would register the HAProxy port it connects to with the first certificate received from the cluster. Then for all subsequent connections, the driver would only establish connections with the server whose certificate is the same as the one registered. Therefore, in order to make it possible for a driver to establish connections with all instances in the cluster, this mode requires all the instances in the cluster sharing the same certificate.

If drivers are configured to run in trusted-certificate mode, then the certificate known to the drivers should be a root certificate to all the certificates installed on the servers in the cluster. Alternatively, for the drivers such as Java driver who supports registering multiple certificates as trusted certificates, the drivers also work well with a cluster if server certificates used in the cluster are all registered as trusted certificates.

To use HAProxy with other encryption layout, please refer to their full documentation at their website.

2.4.4.2. Configuring HAProxy for the HTTP API

HAProxy can be configured in many ways. The full documentation is available at their website.

For this example, we will configure HAProxy to load balance requests to three HA servers. Simply write the following configuration to /etc/haproxy.cfg:

global
    daemon
    maxconn 256

defaults
    mode http
    timeout connect 5000ms
    timeout client 50000ms
    timeout server 50000ms

frontend http-in
    bind *:80
    default_backend neo4j

backend neo4j
    option httpchk GET /db/manage/server/ha/available
    server s1 10.0.1.10:7474 maxconn 32
    server s2 10.0.1.11:7474 maxconn 32
    server s3 10.0.1.12:7474 maxconn 32

listen admin
    bind *:8080
    stats enable

HAProxy can now be started by running:

/usr/sbin/haproxy -f /etc/haproxy.cfg

You can connect to http://<ha-proxy-ip>:8080/haproxy?stats to view the status dashboard. This dashboard can be moved to run on port 80, and authentication can also be added. See the HAProxy documentation for details on this.

2.4.4.3. Optimizing for reads and writes

Neo4j provides a catalogue of health check URLs (see Section 2.4.3, “Endpoints for status information”) that HAProxy (or any load balancer for that matter) can use to distinguish machines using HTTP response codes. In the example above we used the /available endpoint, which directs requests to machines that are generally available for transaction processing (they are alive!).

However, it is possible to have requests directed to slaves only, or to the master only. If you are able to distinguish in your application between requests that write, and requests that only read, then you can take advantage of two (logical) load balancers: one that sends all your writes to the master, and one that sends all your read-only requests to a slave. In HAProxy you build logical load balancers by adding multiple backends.

The trade-off here is that while Neo4j allows slaves to proxy writes for you, this indirection unnecessarily ties up resources on the slave and adds latency to your write requests. Conversely, you don’t particularly want read traffic to tie up resources on the master; Neo4j allows you to scale out for reads, but writes are still constrained to a single instance. If possible, that instance should exclusively do writes to ensure maximum write performance.

The following example excludes the master from the set of machines using the /slave endpoint.

global
    daemon
    maxconn 256

defaults
    mode http
    timeout connect 5000ms
    timeout client 50000ms
    timeout server 50000ms

frontend http-in
    bind *:80
    default_backend neo4j-slaves

backend neo4j-slaves
    option httpchk GET /db/manage/server/ha/slave
    server s1 10.0.1.10:7474 maxconn 32 check
    server s2 10.0.1.11:7474 maxconn 32 check
    server s3 10.0.1.12:7474 maxconn 32 check

listen admin
    bind *:8080
    stats enable

In practice, writing to a slave is uncommon. While writing to slaves has the benefit of ensuring that data is persisted in two places (the slave and the master), it comes at a cost. The cost is that the slave must immediately become consistent with the master by applying any missing transactions and then synchronously apply the new transaction with the master. This is a more expensive operation than writing to the master and having the master push changes to one or more slaves.

2.4.4.4. Cache-based sharding with HAProxy

Neo4j HA enables what is called cache-based sharding. If the dataset is too big to fit into the cache of any single machine, then by applying a consistent routing algorithm to requests, the caches on each machine will actually cache different parts of the graph. A typical routing key could be user ID.

In this example, the user ID is a query parameter in the URL being requested. This will route the same user to the same machine for each request.

global
    daemon
    maxconn 256

defaults
    mode http
    timeout connect 5000ms
    timeout client 50000ms
    timeout server 50000ms

frontend http-in
    bind *:80
    default_backend neo4j-slaves

backend neo4j-slaves
    balance url_param user_id
    server s1 10.0.1.10:7474 maxconn 32
    server s2 10.0.1.11:7474 maxconn 32
    server s3 10.0.1.12:7474 maxconn 32

listen admin
    bind *:8080
    stats enable

Naturally the health check and query parameter-based routing can be combined to only route requests to slaves by user ID. Other load balancing algorithms are also available, such as routing by source IP (source), the URI (uri) or HTTP headers(hdr()).

2.5. Post-installation tasks

2.5.1. Waiting for Neo4j to start

After starting Neo4j it may take some time before the database is ready to serve requests. Systems that depend on the database should be able to retry if it is unavailable in order to cope with network glitches and other brief outages. To specifically wait for Neo4j to be available after starting, poll the Bolt or HTTP endpoint until it gives a successful response.

The details of how to poll depend:

  • Whether the client uses HTTP or Bolt.
  • Whether encryption or authentication are enabled.

It’s important to include a timeout in case Neo4j fails to start. Normally ten seconds should be sufficient, but database recovery or upgrade may take much longer depending on the size of the store. If the instance is part of a cluster then the endpoint will not be available until other instances have started up and the cluster has formed.

Here is an example of polling written in Bash using the HTTP endpoint, with encryption and authentication disabled.

end="$((SECONDS+10))"
while true; do
    [[ "200" = "$(curl --silent --write-out %{http_code} --output /dev/null http://localhost:7474)" ]] && break
    [[ "${SECONDS}" -ge "${end}" ]] && exit 1
    sleep 1
done

2.5.2. Setting the number of open files

Linux platforms impose an upper limit on the number of concurrent files a user may have open. This number is reported for the current user and session with the ulimit -n command:

user@localhost:~$ ulimit -n
1024

The usual default of 1024 is often not enough. This is especially true when many indexes are used or a server installation sees too many connections. Network sockets count against the limit as well. Users are therefore encouraged to increase the limit to a healthy value of 40 000 or more, depending on usage patterns. It is possible to set the limit with the ulimit command, but only for the root user, and it only affects the current session. To set the value system wide, follow the instructions for your platform.

What follows is the procedure to set the open file descriptor limit to 40 000 for user neo4j under Ubuntu 10.04 and later.

If you opted to run the neo4j service as a different user, change the first field in step 2 accordingly.

  1. Become root, since all operations that follow require editing protected system files.

    user@localhost:~$ sudo su -
    Password:
    root@localhost:~$
  2. Edit /etc/security/limits.conf and add these two lines:

    neo4j	soft	nofile	40000
    neo4j	hard	nofile	40000
  3. Edit /etc/pam.d/su and uncomment or add the following line:

    session    required   pam_limits.so
  4. A restart is required for the settings to take effect.

    After the above procedure, the neo4j user will have a limit of 40 000 simultaneous open files. If you continue experiencing exceptions on Too many open files or Could not stat() directory, you may have to raise the limit further.

2.5.3. Setup for remote debugging

In order to configure the Neo4j server for remote debugging sessions, the Java debugging parameters need to be passed to the Java process through the configuration. They live in the conf/neo4j-wrapper.conf file.

In order to specify the parameters, add a line for the additional Java arguments like this:

dbms.jvm.additional=-agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=5005

This configuration will start a Neo4j server ready for remote debugging attachement at localhost and port 5005. Use these parameters to attach to the process from Eclipse, IntelliJ or your remote debugger of choice after starting the server.

2.5.4. Usage Data Collector

The Neo4j Usage Data Collector is a sub-system that gathers usage data, reporting it to the UDC-server at udc.neo4j.org. It is easy to disable, and does not collect any data that is confidential. For more information about what is being sent, see below.

The Neo4j team uses this information as a form of automatic, effortless feedback from the Neo4j community. We want to verify that we are doing the right thing by matching download statistics with usage statistics. After each release, we can see if there is a larger retention span of the server software.

The data collected is clearly stated here. If any future versions of this system collect additional data, we will clearly announce those changes.

The Neo4j team is very concerned about your privacy. We do not disclose any personally identifiable information.

2.5.4.1. Technical Information

To gather good statistics about Neo4j usage, UDC collects this information:

  • Kernel version: The build number, and if there are any modifications to the kernel.
  • Store id: A randomized globally unique id created at the same time a database is created.
  • Ping count: UDC holds an internal counter which is incremented for every ping, and reset for every restart of the kernel.
  • Source: This is either "neo4j" or "maven". If you downloaded Neo4j from the Neo4j website, it’s "neo4j", if you are using Maven to get Neo4j, it will be "maven".
  • Java version: The referrer string shows which version of Java is being used.
  • Registration id: For registered server instances.
  • Tags about the execution context (e.g. test, language, web-container, app-container, spring, ejb).
  • Neo4j Edition (community, enterprise).
  • A hash of the current cluster name (if any).
  • Distribution information for Linux (rpm, dpkg, unknown).
  • User-Agent header for tracking usage of REST client drivers
  • MAC address to uniquely identify instances behind firewalls.
  • The number of processors on the server.
  • The amount of memory on the server.
  • The JVM heap size.
  • The number of nodes, relationships, labels and properties in the database.

After startup, UDC waits for ten minutes before sending the first ping. It does this for two reasons; first, we don’t want the startup to be slower because of UDC, and secondly, we want to keep pings from automatic tests to a minimum. The ping to the UDC servers is done with a HTTP GET.

2.5.4.2. How to disable UDC

UDC is easily turned off by disabling it in the database configuration, in neo4j.conf for Neo4j server or in the configuration passed to the database in embedded mode. See UDC Configuration in the configuration section for details.

2.5.5. Configuring Neo4j connectors

Three different Neo4j connectors are configured by default:

  • One Bolt connector with the default name of bolt.
  • One HTTP connector with the default name of http.
  • One HTTPS connector with the default name of https.

The following shows the default configuration for connectors:

# Bolt connector
dbms.connector.bolt.type=BOLT
dbms.connector.bolt.enabled=true
dbms.connector.bolt.tls_level=OPTIONAL
# To have Bolt accept non-local connections, uncomment this line
# dbms.connector.bolt.address=0.0.0.0:7687

# HTTP Connector
dbms.connector.http.type=HTTP
dbms.connector.http.enabled=true
#dbms.connector.http.encryption=NONE
# To have HTTP accept non-local connections, uncomment this line
#dbms.connector.http.address=0.0.0.0:7474

# HTTPS Connector
dbms.connector.https.type=HTTP
dbms.connector.https.enabled=true
dbms.connector.https.encryption=TLS
dbms.connector.https.address=localhost:7473

The sections below describe the connectors and how they can be modified.

2.5.5.1. Bolt

Bolt connectors are ports that accept connections via the Bolt database protocol. This is the protocol used by the official Neo4j drivers. There must be at least one Bolt driver.

Neo4j can also be configured for multiple Bolt connectors which allows for separate remote and local connections that may have different encryption requirements. Each connector has a unique name to identify it, denoted <bolt-connector-name> below. For example, a connector intended for external use may be named "bolt-public". The name of the Bolt driver in the default configuration is bolt.

Table 2.2. Configuration options for Bolt connectors. <bolt-connector-name> is a placeholder for a unique name for the connector.
Name Description Valid values Default value

dbms.connector.<bolt-connector-name>.address

Address the connector should bind to.

<host>:<port>

localhost:7687

dbms.connector.<bolt-connector-name>.enabled

Enable this connector.

true or false

false

dbms.connector.<bolt-connector-name>.tls_level

Encryption level to require this connector to use.

REQUIRED, OPTIONAL, or DISABLED

OPTIONAL

dbms.connector.<bolt-connector-name>.type

Connector type.

BOLT or HTTP

The value BOLT is mandatory to configure the Bolt connector.

2.5.5.2. HTTP

HTTP connectors expose Neo4j’s HTTP endpoints. HTTPS connectors are configured by setting a connector to require encryption. There must be exactly one HTTP connector and zero or one HTTPS connectors.

Each connector has a unique name to identify it, denoted <http-connector-name> below. For example, a connector intended for external use may be named "http-public". The name of the HTTP driver in the default configuration is http, and the name of the HTTPS driver in the default configuration is https.

Table 2.3. Configuration options for HTTP connectors. <http-connector-name> is a placeholder for a unique name for the connector.
Name Description Valid values Default value

dbms.connector.<http-connector-name>.address

Address the connector should bind to.

<host>:<port>

localhost:7474

dbms.connector.<http-connector-name>.enabled

Enable this connector.

true or false

false

dbms.connector.<http-connector-name>.encryption

Enable TLS for this connector.

NONE or TLS

NONE

dbms.connector.<http-connector-name>.type

Connector type.

BOLT or HTTP

The value HTTP is mandatory to configure the HTTP connector.

2.6. Upgrade

2.6.1. Single-instance upgrade

This section describes upgrading a single Neo4j instance. To upgrade a Neo4j HA cluster (Neo4j Enterprise Edition), a very specific procedure must be followed. Please see Section 2.6.2, “Neo4j cluster upgrade”.

Throughout this instruction, the files used to store the Neo4j data are referred to as database files. These files is are found in the directory specified by dbms.directories.data in neo4j.conf.

An upgrade requires substantial free disk space, as it makes an entire copy of the database. The upgraded database may also require larger data files overall.

It is recommended to make available an extra 50% disk space on top of the existing database files. Determine the pre-upgrade database size by summing up the sizes of the NEO4J_HOME/data/databases/graph.db/*store.db* files. Then add 50% to this number.

In addition to this, don’t forget to reserve the disk space needed for the pre-upgrade backup.

2.6.1.1. Supported upgrade paths

Before upgrading to a new major or minor release, the database must first be upgraded to the latest version within the relevant release. The latest version is available at this page: http://neo4j.com/download/other-releases. The following Neo4j upgrade paths are supported:

  • 2.0.latest → 3.0.4
  • 2.1.latest → 3.0.4
  • 2.2.latest → 3.0.4
  • 2.3.latest → 3.0.4
  • 3.0.any → 3.0.4

2.6.1.2. Upgrade instructions

2.6.1.2.1. Upgrade from 2.x
  1. Cleanly shut down the database if it is running.
  2. Make a backup copy of the database files. If using the online backup tool available with Neo4j Enterprise Edition, ensure that backups have completed successfully.
  3. Install Neo4j 3.0.4.
  4. Review the settings in the configuration files of the previous installation and transfer any custom settings to the 3.0.4 installation. Since many settings have been changed between Neo4j 2.x and 3.0.4, it is advisable to use the config-migrator to migrate the config files for you. The config-migrator can be found in the tools directory, and can be invoked with a command like: java -jar config-migrator.jar path/to/neo4j2.3 path/to/neo4j3.0. Take note of any warnings printed, and manually review the edited config files produced.
  5. Import your data from the old installation using neo4j-admin import --mode=database --database=<database-name> --from=<source-directory>.
  6. If the database is not called graph.db, set dbms.active_database in neo4j.conf to the name of the database.
  7. Set dbms.allow_format_migration=true in neo4j.conf of the 3.0.4 installation. Neo4j will fail to start without this configuration.
  8. Start up Neo4j 3.0.4.
  9. The database upgrade will take place during startup.
  10. Information about the upgrade and a progress indicator are logged into debug.log.
  11. When upgrade has finished, the dbms.allow_format_migration should be set to false or be removed.
  12. It is good practice to make a full backup immediately after the upgrade.

The Cypher language may evolve between Neo4j versions. For backward compatibility, Neo4j provides directives which allow explicitly selecting a previous Cypher language version. This is possible to do globally or for individual statements, as described in the Neo4j Developer Manual.

2.6.1.2.2. Upgrade from 3.x
  1. Cleanly shut down the database if it is running.
  2. Make a backup copy of the database files. If using the online backup tool available with Neo4j Enterprise Edition, ensure that backups have completed successfully.
  3. Install Neo4j 3.0.4.
  4. Review the settings in the configuration files of the previous installation and transfer any custom settings to the 3.0.4 installation.
  5. Wen using the default data directory, copy it from the old installation to the new. If databases are stored in a custom location, configure Table A.21, “dbms.directories.data” for the new installation to point to this custom location.
  6. If the database is not called graph.db, set dbms.active_database in neo4j.conf to the name of the database.
  7. Set dbms.allow_format_migration=true in neo4j.conf of the 3.0.4 installation. Neo4j will fail to start without this configuration.
  8. Start up Neo4j 3.0.4.
  9. The database upgrade will take place during startup.
  10. Information about the upgrade and a progress indicator are logged into debug.log.
  11. When upgrade has finished, the dbms.allow_format_migration should be set to false or be removed.
  12. It is good practice to make a full backup immediately after the upgrade.

2.6.2. Neo4j cluster upgrade

Upgrading a Neo4j HA cluster to Neo4j 3.0.4 requires following a specific process in order to ensure that the cluster remains consistent, and that all cluster instances are able to join and participate in the cluster following their upgrade. Neo4j 3.0.4 does not support rolling upgrades.

2.6.2.1. Back up the Neo4j database

  • Before starting any upgrade procedure, it is very important to make a full backup of your database.
  • For detailed instructions on backing up your Neo4j database, refer to the backup guide.

2.6.2.2. Shut down the cluster

  • Shut down the slave instances one by one.
  • Shut down the master last.

2.6.2.3. Upgrade the master

  1. Install Neo4j 3.0.4 on the master, keeping the database files untouched.
  2. Disable HA in the configuration, by setting dbms.mode=SINGLE in neo4j.conf.
  3. Upgrade as described for a single instance of Neo4j
  4. When upgrade has finished, shut down Neo4j again.
  5. Re-enable HA in the configuration by setting dbms.mode=HA in neo4j.conf.
  6. Make a full backup of the Neo4j database. Please note that backups from before the upgrade are no longer valid for update via the incremental online backup. Therefore it is important to perform a full backup, using an empty target directory, at this point.

2.6.2.4. Upgrade the slaves

On each slave:

  1. Remove all database files.
  2. Install Neo4j 3.0.4.
  3. Review the settings in the configuration files in the previous installation, and transfer any custom settings to the 3.0.4 installation. Be aware of settings that have changed name between versions.
  4. If the database is not called graph.db, set dbms.active_database in neo4j.conf to the name of the database.
  5. If applicable, copy the security configuration from the master, since this is not propagated automatically.

At this point it is an alternative to manually copy database files from the master to the slaves. Doing so will avoid the need to sync from the master when starting. This can save considerable time when upgrading large databases.

2.6.2.5. Restart the cluster

  1. Start the master instance.
  2. Start the slaves, one by one. Once a slave has joined the cluster, it will sync the database from the master instance.

2.7. Import tool

The import tool is used to create a new Neo4j database from data in CSV files.

This chapter explains how to use the tool, format the input data and concludes with an example bringing everything together.

These are some things you’ll need to keep in mind when creating your input files:

  • Fields are comma separated by default but a different delimiter can be specified.
  • All files must use the same delimiter.
  • Multiple data sources can be used for both nodes and relationships.
  • A data source can optionally be provided using multiple files.
  • A header which provides information on the data fields must be on the first row of each data source.
  • Fields without corresponding information in the header will not be read.
  • UTF-8 encoding is used.

Indexes are not created during the import. Instead, you will need to add indexes afterwards (see Developer Manual → Indexes).

Data cannot be imported into an existing database using this tool. If you want to load small to medium sized CSV files use LOAD CSV (see Developer Manual → LOAD CSV).

2.7.1. CSV file header format

The header row of each data source specifies how the fields should be interpreted. The same delimiter is used for the header row as for the rest of the data.

The header contains information for each field, with the format: <name>:<field_type>. The <name> is used as the property key for values, and ignored in other cases. The following <field_type> settings can be used for both nodes and relationships:

Property value
Use one of int, long, float, double, boolean, byte, short, char, string to designate the data type. If no data type is given, this defaults to string. To define an array type, append [] to the type. By default, array values are separated by ;. A different delimiter can be specified with --array-delimiter.
IGNORE
Ignore this field completely.

See below for the specifics of node and relationship data source headers.

2.7.1.1. Nodes

The following field types do additionally apply to node data sources:

ID
Each node must have a unique id which is used during the import. The ids are used to find the correct nodes when creating relationships. Note that the id has to be unique across all nodes in the import, even nodes with different labels.
LABEL
Read one or more labels from this field. Like array values, multiple labels are separated by ;, or by the character specified with --array-delimiter.

2.7.1.2. Relationships

For relationship data sources, there are three mandatory fields:

TYPE
The relationship type to use for the relationship.
START_ID
The id of the start node of the relationship to create.
END_ID
The id of the end node of the relationship to create.

2.7.1.3. ID spaces

The import tool assumes that node identifiers are unique across node files. If this isn’t the case then we can define an id space. Id spaces are defined in the ID field of node files.

For example, to specify the Person id space we would use the field type ID(Person) in our persons node file. We also need to reference that id space in our relationships file i.e. START_ID(Person) or END_ID(Person).

2.7.2. Command line usage

2.7.2.1. Linux

Under Unix/Linux/OSX, the command is named neo4j-import. Depending on the installation type, the tool is either available globally, or used by executing ./bin/neo4j-import from inside the installation directory.

2.7.2.2. Windows

Under Windows, used by executing bin\neo4j-import from inside the installation directory.

For help with running the import tool under Windows, see the reference in Windows.

2.7.2.3. Options

--into <store-dir>
Database directory to import into. Must not contain existing database.
--nodes[:Label1:Label2] "<file1>,<file2>,…​"
Node CSV header and data. Multiple files will be logically seen as one big file from the perspective of the importer. The first line must contain the header. Multiple data sources like these can be specified in one import, where each data source has its own header. Note that file groups must be enclosed in quotation marks.
--relationships[:RELATIONSHIP_TYPE] "<file1>,<file2>,…​"
Relationship CSV header and data. Multiple files will be logically seen as one big file from the perspective of the importer. The first line must contain the header. Multiple data sources like these can be specified in one import, where each data source has its own header. Note that file groups must be enclosed in quotation marks.
--delimiter <delimiter-character>
Delimiter character, or TAB, between values in CSV data. The default option is ,.
--array-delimiter <array-delimiter-character>
Delimiter character, or TAB, between array elements within a value in CSV data. The default option is ;.
--quote <quotation-character>
Character to treat as quotation character for values in CSV data. The default option is ". Quotes inside quotes escaped like """Go away"", he said." and "\"Go away\", he said." are supported. If you have set ' to be used as the quotation character, you could write the previous example like this instead: '"Go away", he said.'
--multiline-fields <true/false>
Whether or not fields from input source can span multiple lines, i.e. contain newline characters. Default value: false
--input-encoding <character set>
Character set that input data is encoded in. Provided value must be one out of the available character sets in the JVM, as provided by Charset#availableCharsets(). If no input encoding is provided, the default character set of the JVM will be used.
--ignore-empty-strings <true/false>
Whether or not empty string fields ("") from input source are ignored, i.e. treated as null. Default value: false
--id-type <id-type>
One out of [STRING, INTEGER, ACTUAL] and specifies how ids in node/relationship input files are treated. STRING: arbitrary strings for identifying nodes. INTEGER: arbitrary integer values for identifying nodes. ACTUAL: (advanced) actual node ids. Default value: STRING
--processors <max processor count>
(advanced) Max number of processors used by the importer. Defaults to the number of available processors reported by the JVM. There is a certain amount of minimum threads needed so for that reason there is no lower bound for this value. For optimal performance this value shouldn’t be greater than the number of available processors.
--stacktrace <true/false>
Enable printing of error stack traces.
--bad-tolerance <max number of bad entries>
Number of bad entries before the import is considered failed. This tolerance threshold is about relationships referring to missing nodes. Format errors in input data are still treated as errors. Default value: 1000
--skip-bad-relationships <true/false>
Whether or not to skip importing relationships that refers to missing node ids, i.e. either start or end node id/group referring to node that wasn’t specified by the node input data. Skipped nodes will be logged, containing at most number of entities specified by bad-tolerance. Default value: true
--skip-duplicate-nodes <true/false>
Whether or not to skip importing nodes that have the same id/group. In the event of multiple nodes within the same group having the same id, the first encountered will be imported whereas consecutive such nodes will be skipped. Skipped nodes will be logged, containing at most number of entities specified by bad-tolerance. Default value: false
--ignore-extra-columns <true/false>
Whether or not to ignore extra columns in the data not specified by the header. Skipped columns will be logged, containing at most number of entities specified by bad-tolerance. Default value: false
--db-config <path/to/neo4j.conf>

(advanced) File specifying database-specific configuration. For more information consult manual about available configuration options for a neo4j configuration file. Only configuration affecting store at time of creation will be read. Examples of supported config are:

2.7.2.4. Output and statistics

While an import is running through its different stages, some statistics and figures are printed in the console. The general interpretation of that output is to look at the horizontal line, which is divided up into sections, each section representing one type of work going on in parallel with the other sections. The wider a section is, the more time is spent there relative to the other sections, the widest being the bottleneck, also marked with *. If a section has a double line, instead of just a single line, it means that multiple threads are executing the work in that section. To the far right a number is displayed telling how many entities (nodes or relationships) have been processed by that stage.

As an example:

[*>:20,25 MB/s------------------|PREPARE(3)====================|RELATIONSHIP(2)===============] 16M

Would be interpreted as:

  • > data being read, and perhaps parsed, at 20,25 MB/s, data that is being passed on to …​
  • PREPARE preparing the data for …​
  • RELATIONSHIP creating actual relationship records and …​
  • v writing the relationships to the store. This step isn’t visible in this example, because it’s so cheap compared to the other sections.

Observing the section sizes can give hints about where performance can be improved. In the example above, the bottleneck is the data read section (marked with >), which might indicate that the disk is being slow, or is poorly handling simultaneous read and write operations (since the last section often revolves around writing to disk).

2.7.2.5. Verbose error information

In some cases if an unexpected error occurs it might be useful to supply the command line option --stacktrace to the import (and rerun the import to actually see the additional information). This will have the error printed with additional debug information, useful for both developers and issue reporting.

2.7.3. Import tool examples

Let’s look at a few examples. We’ll use a data set containing movies, actors and roles.

While you’ll usually want to store your node identifier as a property on the node for looking it up later, it’s not mandatory. If you don’t want the identifier to be persisted then don’t specify a property name in the :ID field.

2.7.3.1. Basic example

First we’ll look at the movies. Each movie has an id, which is used to refer to it in other data sources, a title and a year Along with these properties we’ll also add the node labels Movie and Sequel.

By default the import tool expects CSV files to be comma delimited.

movies.csv. 

movieId:ID,title,year:int,:LABEL
tt0133093,"The Matrix",1999,Movie
tt0234215,"The Matrix Reloaded",2003,Movie;Sequel
tt0242653,"The Matrix Revolutions",2003,Movie;Sequel

Next up are the actors. They have an id - in this case a shorthand - and a name and all have the Actor label.

actors.csv. 

personId:ID,name,:LABEL
keanu,"Keanu Reeves",Actor
laurence,"Laurence Fishburne",Actor
carrieanne,"Carrie-Anne Moss",Actor

Finally we have the roles that an actor plays in a movie which will be represented by relationships in the database. In order to create a relationship between nodes we refer to the ids used in actors.csv and movies.csv in the START_ID and END_ID fields. We also need to provide a relationship type (in this case ACTS_IN) in the :TYPE field.

roles.csv. 

:START_ID,role,:END_ID,:TYPE
keanu,"Neo",tt0133093,ACTED_IN
keanu,"Neo",tt0234215,ACTED_IN
keanu,"Neo",tt0242653,ACTED_IN
laurence,"Morpheus",tt0133093,ACTED_IN
laurence,"Morpheus",tt0234215,ACTED_IN
laurence,"Morpheus",tt0242653,ACTED_IN
carrieanne,"Trinity",tt0133093,ACTED_IN
carrieanne,"Trinity",tt0234215,ACTED_IN
carrieanne,"Trinity",tt0242653,ACTED_IN

With all data in place, we execute the following command:

neo4j-import --into path_to_target_directory --nodes movies.csv --nodes actors.csv --relationships roles.csv

We’re now ready to start up a database from the target directory. (see Section 2.3, “Single-instance installation”)

Once we’ve got the database up and running we can add appropriate indexes. (see Developer Manual → Constraints and indexes.)

It is possible to import only nodes using the import tool - just don’t specify a relationships file when calling neo4j-import. If you do this you’ll need to create relationships later by another method - the import tool only works for initial graph population.

2.7.3.2. Customizing configuration options

We can customize the configuration options that the import tool uses (see Section 2.7.2.3, “Options”) if our data doesn’t fit the default format. The following CSV files are delimited by ;, use | as their array delimiter and use ' for quotes.

movies2.csv. 

movieId:ID;title;year:int;:LABEL
tt0133093;'The Matrix';1999;Movie
tt0234215;'The Matrix Reloaded';2003;Movie|Sequel
tt0242653;'The Matrix Revolutions';2003;Movie|Sequel

actors2.csv. 

personId:ID;name;:LABEL
keanu;'Keanu Reeves';Actor
laurence;'Laurence Fishburne';Actor
carrieanne;'Carrie-Anne Moss';Actor

roles2.csv. 

:START_ID;role;:END_ID;:TYPE
keanu;'Neo';tt0133093;ACTED_IN
keanu;'Neo';tt0234215;ACTED_IN
keanu;'Neo';tt0242653;ACTED_IN
laurence;'Morpheus';tt0133093;ACTED_IN
laurence;'Morpheus';tt0234215;ACTED_IN
laurence;'Morpheus';tt0242653;ACTED_IN
carrieanne;'Trinity';tt0133093;ACTED_IN
carrieanne;'Trinity';tt0234215;ACTED_IN
carrieanne;'Trinity';tt0242653;ACTED_IN

We can then import these files with the following command line options:

neo4j-import --into path_to_target_directory --nodes movies2.csv --nodes actors2.csv --relationships roles2.csv --delimiter ";" --array-delimiter "|" --quote "'"

2.7.3.3. Using separate header files

When dealing with very large CSV files it’s more convenient to have the header in a separate file. This makes it easier to edit the header as you avoid having to open a huge data file just to change it.

import-tool can also process single file compressed archives. e.g. --nodes nodes.csv.gz or --relationships rels.zip

We’ll use the same data as in the previous example but put the headers in separate files.

movies3-header.csv. 

movieId:ID,title,year:int,:LABEL

movies3.csv. 

tt0133093,"The Matrix",1999,Movie
tt0234215,"The Matrix Reloaded",2003,Movie;Sequel
tt0242653,"The Matrix Revolutions",2003,Movie;Sequel

actors3-header.csv. 

personId:ID,name,:LABEL

actors3.csv. 

keanu,"Keanu Reeves",Actor
laurence,"Laurence Fishburne",Actor
carrieanne,"Carrie-Anne Moss",Actor

roles3-header.csv. 

:START_ID,role,:END_ID,:TYPE

roles3.csv. 

keanu,"Neo",tt0133093,ACTED_IN
keanu,"Neo",tt0234215,ACTED_IN
keanu,"Neo",tt0242653,ACTED_IN
laurence,"Morpheus",tt0133093,ACTED_IN
laurence,"Morpheus",tt0234215,ACTED_IN
laurence,"Morpheus",tt0242653,ACTED_IN
carrieanne,"Trinity",tt0133093,ACTED_IN
carrieanne,"Trinity",tt0234215,ACTED_IN
carrieanne,"Trinity",tt0242653,ACTED_IN

Note how the file groups are enclosed in quotation marks in the command:

neo4j-import --into path_to_target_directory --nodes "movies3-header.csv,movies3.csv" --nodes "actors3-header.csv,actors3.csv" --relationships "roles3-header.csv,roles3.csv"

2.7.3.4. Multiple input files

As well as using a separate header file you can also provide multiple nodes or relationships files. This may be useful when processing the output from a Hadoop pipeline for example. Files within such an input group can be specified with multiple match strings, delimited by ,, where each match string can be either: the exact file name or a regular expression matching one or more files. Multiple matching files will be sorted according to their characters and their natural number sort order for file names containing numbers.

movies4-header.csv. 

movieId:ID,title,year:int,:LABEL

movies4-part1.csv. 

tt0133093,"The Matrix",1999,Movie
tt0234215,"The Matrix Reloaded",2003,Movie;Sequel

movies4-part2.csv. 

tt0242653,"The Matrix Revolutions",2003,Movie;Sequel

actors4-header.csv. 

personId:ID,name,:LABEL

actors4-part1.csv. 

keanu,"Keanu Reeves",Actor
laurence,"Laurence Fishburne",Actor

actors4-part2.csv. 

carrieanne,"Carrie-Anne Moss",Actor

roles4-header.csv. 

:START_ID,role,:END_ID,:TYPE

roles4-part1.csv. 

keanu,"Neo",tt0133093,ACTED_IN
keanu,"Neo",tt0234215,ACTED_IN
keanu,"Neo",tt0242653,ACTED_IN
laurence,"Morpheus",tt0133093,ACTED_IN
laurence,"Morpheus",tt0234215,ACTED_IN

roles4-part2.csv. 

laurence,"Morpheus",tt0242653,ACTED_IN
carrieanne,"Trinity",tt0133093,ACTED_IN
carrieanne,"Trinity",tt0234215,ACTED_IN
carrieanne,"Trinity",tt0242653,ACTED_IN

The call to neo4j-import would look like this:

neo4j-import --into path_to_target_directory --nodes "movies4-header.csv,movies4-part1.csv,movies4-part2.csv" --nodes "actors4-header.csv,actors4-part1.csv,actors4-part2.csv" --relationships "roles4-header.csv,roles4-part1.csv,roles4-part2.csv"

2.7.3.5. Types and labels

2.7.3.5.1. Using the same label for every node

If you want to use the same node label(s) for every node in your nodes file you can do this by specifying the appropriate value as an option to neo4j-import. In this example we’ll put the label Movie on every node specified in movies5.csv:

movies5.csv. 

movieId:ID,title,year:int
tt0133093,"The Matrix",1999

There’s then no need to specify the :LABEL field in the node file if you pass it as a command line option. If you do then both the label provided in the file and the one provided on the command line will be added to the node.

In this case, we’ll put the labels Movie and Sequel on the nodes specified in sequels5.csv.

sequels5.csv. 

movieId:ID,title,year:int
tt0234215,"The Matrix Reloaded",2003
tt0242653,"The Matrix Revolutions",2003

actors5.csv. 

personId:ID,name
keanu,"Keanu Reeves"
laurence,"Laurence Fishburne"
carrieanne,"Carrie-Anne Moss"

roles5.csv. 

:START_ID,role,:END_ID,:TYPE
keanu,"Neo",tt0133093,ACTED_IN
keanu,"Neo",tt0234215,ACTED_IN
keanu,"Neo",tt0242653,ACTED_IN
laurence,"Morpheus",tt0133093,ACTED_IN
laurence,"Morpheus",tt0234215,ACTED_IN
laurence,"Morpheus",tt0242653,ACTED_IN
carrieanne,"Trinity",tt0133093,ACTED_IN
carrieanne,"Trinity",tt0234215,ACTED_IN
carrieanne,"Trinity",tt0242653,ACTED_IN

The call to neo4j-import would look like this:

neo4j-import --into path_to_target_directory --nodes:Movie movies5.csv --nodes:Movie:Sequel sequels5.csv --nodes:Actor actors5.csv --relationships roles5.csv
2.7.3.5.2. Using the same relationship type for every relationship

If you want to use the same relationship type for every relationship in your relationships file you can do this by specifying the appropriate value as an option to neo4j-import. In this example we’ll put the relationship type ACTS_IN on every relationship specified in roles6.csv:

movies6.csv. 

movieId:ID,title,year:int,:LABEL
tt0133093,"The Matrix",1999,Movie
tt0234215,"The Matrix Reloaded",2003,Movie;Sequel
tt0242653,"The Matrix Revolutions",2003,Movie;Sequel

actors6.csv. 

personId:ID,name,:LABEL
keanu,"Keanu Reeves",Actor
laurence,"Laurence Fishburne",Actor
carrieanne,"Carrie-Anne Moss",Actor

roles6.csv. 

:START_ID,role,:END_ID
keanu,"Neo",tt0133093
keanu,"Neo",tt0234215
keanu,"Neo",tt0242653
laurence,"Morpheus",tt0133093
laurence,"Morpheus",tt0234215
laurence,"Morpheus",tt0242653
carrieanne,"Trinity",tt0133093
carrieanne,"Trinity",tt0234215
carrieanne,"Trinity",tt0242653

If you provide a relationship type on the command line and in the relationships file the one in the file will be applied.

The call to neo4j-import would look like this:

neo4j-import --into path_to_target_directory --nodes movies6.csv --nodes actors6.csv --relationships:ACTED_IN roles6.csv

2.7.3.6. Property types

The type for properties specified in nodes and relationships files is defined in the header row. (see Section 2.7.1, “CSV file header format”)

The following example creates a small graph containing one actor and one movie connected by an ACTED_IN relationship. There is a roles property on the relationship which contains an array of the characters played by the actor in a movie.

movies7.csv. 

movieId:ID,title,year:int,:LABEL
tt0099892,"Joe Versus the Volcano",1990,Movie

actors7.csv. 

personId:ID,name,:LABEL
meg,"Meg Ryan",Actor

roles7.csv. 

:START_ID,roles:string[],:END_ID,:TYPE
meg,"DeDe;Angelica Graynamore;Patricia Graynamore",tt0099892,ACTED_IN

The arguments to neo4j-import would be the following:

neo4j-import --into path_to_target_directory --nodes movies7.csv --nodes actors7.csv --relationships roles7.csv

2.7.3.7. ID handling

Each node processed by neo4j-import must provide a unique id. We use this id to find the correct nodes when creating relationships.

2.7.3.7.1. Working with sequential or auto incrementing identifiers

The import tool makes the assumption that identifiers are unique across node files. This may not be the case for data sets which use sequential, auto incremented or otherwise colliding identifiers. Those data sets can define id spaces where identifiers are unique within their respective id space.

For example if movies and people both use sequential identifiers then we would define Movie and Actor id spaces.

movies8.csv. 

movieId:ID(Movie),title,year:int,:LABEL
1,"The Matrix",1999,Movie
2,"The Matrix Reloaded",2003,Movie;Sequel
3,"The Matrix Revolutions",2003,Movie;Sequel

actors8.csv. 

personId:ID(Actor),name,:LABEL
1,"Keanu Reeves",Actor
2,"Laurence Fishburne",Actor
3,"Carrie-Anne Moss",Actor

We also need to reference the appropriate id space in our relationships file so it knows which nodes to connect together:

roles8.csv. 

:START_ID(Actor),role,:END_ID(Movie)
1,"Neo",1
1,"Neo",2
1,"Neo",3
2,"Morpheus",1
2,"Morpheus",2
2,"Morpheus",3
3,"Trinity",1
3,"Trinity",2
3,"Trinity",3

The command line arguments would remain the same as before:

neo4j-import --into path_to_target_directory --nodes movies8.csv --nodes actors8.csv --relationships:ACTED_IN roles8.csv

2.7.3.8. Bad input data

The import tool has a threshold of how many bad entities (nodes/relationships) to tolerate and skip before failing the import. By default 1000 bad entities are tolerated. A bad tolerance of 0 will as an example fail the import on the first bad entity. For more information, see the --bad-tolerance option.

There are different types of bad input, which we will look into.

2.7.3.8.1. Relationships referring to missing nodes

Relationships that refer to missing node ids, either for :START_ID or :END_ID are considered bad relationships. Whether or not such relationships are skipped is controlled with --skip-bad-relationships flag which can have the values true or false or no value, which means true. Specifying false means that any bad relationship is considered an error and will fail the import. For more information, see the --skip-bad-relationships option.

In the following example there is a missing emil node referenced in the roles file.

movies9.csv. 

movieId:ID,title,year:int,:LABEL
tt0133093,"The Matrix",1999,Movie
tt0234215,"The Matrix Reloaded",2003,Movie;Sequel
tt0242653,"The Matrix Revolutions",2003,Movie;Sequel

actors9.csv. 

personId:ID,name,:LABEL
keanu,"Keanu Reeves",Actor
laurence,"Laurence Fishburne",Actor
carrieanne,"Carrie-Anne Moss",Actor

roles9.csv. 

:START_ID,role,:END_ID,:TYPE
keanu,"Neo",tt0133093,ACTED_IN
keanu,"Neo",tt0234215,ACTED_IN
keanu,"Neo",tt0242653,ACTED_IN
laurence,"Morpheus",tt0133093,ACTED_IN
laurence,"Morpheus",tt0234215,ACTED_IN
laurence,"Morpheus",tt0242653,ACTED_IN
carrieanne,"Trinity",tt0133093,ACTED_IN
carrieanne,"Trinity",tt0234215,ACTED_IN
carrieanne,"Trinity",tt0242653,ACTED_IN
emil,"Emil",tt0133093,ACTED_IN

The command line arguments would remain the same as before:

neo4j-import --into path_to_target_directory --nodes movies9.csv --nodes actors9.csv --relationships roles9.csv

Since there was only one bad relationship the import process will complete successfully and a not-imported.bad file will be created and populated with the bad relationships.

not-imported.bad. 

InputRelationship:
   source: roles9.csv:11
   properties: [role, Emil]
   startNode: emil
   endNode: tt0133093
   type: ACTED_IN
 refering to missing node emil

2.7.3.8.2. Multiple nodes with same id within same id space

Nodes that specify :ID which has already been specified within the id space are considered bad nodes. Whether or not such nodes are skipped is controlled with --skip-duplicate-nodes flag which can have the values true or false or no value, which means true. Specifying false means that any duplicate node is considered an error and will fail the import. For more information, see the --skip-duplicate-nodes option.

In the following example there is a node id that is specified twice within the same id space.

actors10.csv. 

personId:ID,name,:LABEL
keanu,"Keanu Reeves",Actor
laurence,"Laurence Fishburne",Actor
carrieanne,"Carrie-Anne Moss",Actor
laurence,"Laurence Harvey",Actor

neo4j-import --into path_to_target_directory --nodes actors10.csv --skip-duplicate-nodes

Since there was only one bad node the import process will complete successfully and a not-imported.bad file will be created and populated with the bad node.

not-imported.bad. 

Id 'laurence' is defined more than once in global id space, at least at actors10.csv:3 and actors10.csv:5