Importing CSV data into Neo4j

This article demonstrates different approaches to importing CSV data into Neo4j and provides solutions to potential issues that might arise during the process.

CSV is a file of comma-separated values, often viewed in Excel or some other spreadsheet tool. There can be other types of values as the delimiter, but the most standard is the comma. Many systems and processes today already convert their data into CSV format for file outputs to other systems, human-friendly reports, and other needs. It is a standard file format that humans and systems are already familiar with using and handling.

Ways to import CSV files

There are a few different approaches to get CSV data into Neo4j, each with varying criteria and functionality. The option you choose depends on the dataset size, as well as your degree of comfort with various tools.

Let us see some of the ways Neo4j can read and import CSV files.

  1. LOAD CSV Cypher® command: this command is a great starting point and handles small- to medium-sized datasets (up to 10 million records). Works with any setup, including AuraDB.

  2. The neo4j-admin database import command: command-line tool useful for straightforward loading of large datasets. Works with Neo4j Desktop, Neo4j EE Docker image and local installations.

  3. Neo4j ETL tool: Neo4j Labs project. For more details and documentation, visit Neo4j ETL Tool page.

  4. Kettle import tool: maps and executes steps for the data process flow and works well for very large datasets, especially if you are already familiar with using this tool. Works with any setup, including AuraDB.

In the below section, you can find a brief oview of the LOAD CSV Cypher command and neo4j-admin database import command, how they operate, and how to get started with a general use case. Data quality can also be an issue for any type of data import to any system, so a few of these potential difficulties and ways to overcome them will be covered in this section.

LOAD CSV command with Cypher

The LOAD CSV clause is a part of the Cypher query language. For more information about the LOAD CSV clause, see the Cypher Manual → LOAD CSV. It is widely applicable. LOAD CSV is not simply a basic data ingestion mechanism. It performs multiple actions in a single operation:

  • Supports loading/ingesting CSV data from a URI.

  • Directly maps input data into complex graph/domain structures.

  • Handles data conversion.

  • Supports complex computations.

  • Creates or merges entities, relationships, and structures.

For better control, you can run LOAD CSV command with Cypher Shell instead of in Neo4j Browser. For more information about Cypher Shell, see Operations Manual → Cypher shell.

Reading CSV files

LOAD CSV can handle local and remote files, and there is some syntax associated with each. This can be an easy thing to miss and end up with an access error, so the rules are clarified here.

Local files may be loaded using a file:/// prefix before the file name.
Since AuraDB is cloud based, this local file approach does not work with AuraDB.

Due to security reasons, local files, by default, can only be read from the Neo4j import directory, which is different based on your operating system. File locations for each OS are listed in our Neo4j Operations Manual → File locations. It is recommended to put files in Neo4j’s import directory, as it keeps the environment secure. However, if you need to access files in other locations, you can find out which setting to alter in our Cypher manual → LOAD CSV introduction.

Examples
//Example 1 - file directly placed in import directory (import/data.csv)
LOAD CSV FROM "file:///data.csv"

//Example 2 - file placed in subdirectory within import directory (import/northwind/customers.csv)
LOAD CSV FROM "file:///northwind/customers.csv"

Web-hosted files can be referenced directly with their URL, like https://host/path/data.csv. However, permissions must be set so that an external source can read the file. To read files from your local file system you need to check that the configuration setting dbms.security.allow_csv_import_from_file_urls is set to true. For more information about access related to online file imports, see this Knowledge Base article. But keep in mind that in Neo4j v5 configuration settings have been renamed, and dbms.directories.import was changed to server.directories.import.

Examples
//Example 1 - website
LOAD CSV FROM 'https://data.neo4j.com/northwind/customers.csv'

//Example 2 - Google
LOAD CSV WITH HEADERS FROM 'https://docs.google.com/spreadsheets/d/<yourFilePath>/export?format=csv'

Important tips for LOAD CSV

There are a few things to keep in mind with LOAD CSV and a few helpful tips for handling the variety of data scenarios you are likely to encounter.

  • All data from the CSV file is read as a string, so you need to use toInteger(), toFloat(), split(), or similar functions to convert values.

  • Check your Cypher import statement for typos. Labels, property names, relationship types, and variables are case-sensitive.

  • The cleaner the data, the easier the load. Try to handle complex cleanup/manipulation before load.

Converting data values with LOAD CSV

Cypher has some scrubbing and conversion capabilities to help with data cleanup. These are extremely useful for handling missing data or splitting a field into multiple values for the graph.

First, remember that Neo4j does not store null values. Null or empty fields in a CSV files can be skipped or replaced with default values in LOAD CSV.

Suppose you have this CSV file:

companies.csv
Id,Name,Location,Email,BusinessType
1,Neo4j,San Mateo,contact@neo4j.com,P
2,AAA,,info@aaa.com,
3,BBB,Chicago,,G

The default location for CSV files for import is the import directory for your Neo4j instance.

Here are some examples of importing this data.

Examples
//skip null values
LOAD CSV WITH HEADERS FROM 'file:///companies.csv' AS row
WITH row WHERE row.Id IS NOT NULL
MERGE (c:Company {companyId: row.Id});

// clear data
MATCH (n:Company) DELETE n;

//set default for null values
LOAD CSV WITH HEADERS FROM 'file:///companies.csv' AS row
MERGE (c:Company {companyId: row.Id, hqLocation: coalesce(row.Location, "Unknown")})

// clear data
MATCH (n:Company) DELETE n;

//change empty strings to null values (not stored)
LOAD CSV WITH HEADERS FROM 'file:///companies.csv' AS row
MERGE (c:Company {companyId: row.Id})
SET c.emailAddress = CASE trim(row.Email) WHEN "" THEN null ELSE row.Email END

Next, if you have a field in the CSV that is a list of items that you want to split, you can use the Cypher split() function to separate arrays in a cell.

Suppose you have this CSV file:

employees.csv
Id,Name,Skills,Email
1,Joe Smith,Cypher:Java:JavaScript,joe@neo4j.com
2,Mary Jones,Java,mary@neo4j.com
3,Trevor Scott,Java:JavaScript,trevor@neo4j.com
Example
LOAD CSV WITH HEADERS FROM 'file:///employees.csv' AS row
MERGE (e:Employee {employeeId: row.Id, email: row.Email})
WITH e, row
UNWIND split(row.Skills, ':') AS skill
MERGE (s:Skill {name: skill})
MERGE (e)-[r:HAS_EXPERIENCE]->(s)

Conditional conversions can be achieved with CASE. You saw one example of this when we were checking for null values or empty strings, but let us look at another example.

Example
// clear data
MATCH (n:Company) DELETE n;

//set businessType property based on shortened value in CSV
LOAD CSV WITH HEADERS FROM 'file:///companies.csv' AS row
WITH row WHERE row.Id IS NOT NULL
WITH row,
(CASE row.BusinessType
 WHEN 'P' THEN 'Public'
 WHEN 'R' THEN 'Private'
 WHEN 'G' THEN 'Government'
 ELSE 'Other' END) AS type
MERGE (c:Company {companyId: row.Id, hqLocation: coalesce(row.Location, "Unknown")})
SET c.emailAddress = CASE trim(row.Email) WHEN "" THEN null ELSE row.Email END
SET c.businessType = type
RETURN *

Optimizing LOAD CSV for performance

Often, there are ways to improve performance during data load, which are especially helpful when dealing with large amounts of data or complex loading.

To improve inserting or updating unique entities into your graph (using MERGE or MATCH with updates), you can create indexes and constraints declared for each of the labels and properties you plan to merge or match on.

For best performance, always MATCH and MERGE on a single label with the indexed primary-key property.

Suppose you use the preceding companies.csv file, and now you have a file that contains people and which companies they work for:

people.csv
employeeId,Name,Company
1,Bob Smith,1
2,Joe Jones,3
3,Susan Scott,2
4,Karen White,1

You should also separate node and relationship creation into separate processing. For instance, instead of the following:

MERGE (e:Employee {employeeId: row.employeeId})
MERGE (c:Company {companyId: row.companyId})
MERGE (e)-[r:WORKS_FOR]->(c)

You can write it like this:

// clear data
MATCH (n)
DETACH DELETE n;
// load Employee nodes
LOAD CSV WITH HEADERS FROM 'file:///people.csv' AS row
MERGE (e:Employee {employeeId: row.employeeId, name: row.Name})
RETURN count(e);
// load Company nodes
LOAD CSV WITH HEADERS FROM 'file:///companies.csv' AS row
WITH row WHERE row.Id IS NOT NULL
WITH row,
(CASE row.BusinessType
 WHEN 'P' THEN 'Public'
 WHEN 'R' THEN 'Private'
 WHEN 'G' THEN 'Government'
 ELSE 'Other' END) AS type
MERGE (c:Company {companyId: row.Id, hqLocation: coalesce(row.Location, "Unknown")})
SET c.emailAddress = CASE trim(row.Email) WHEN "" THEN null ELSE row.Email END
SET c.businessType = type
RETURN count(c);
// create relationships
LOAD CSV WITH HEADERS FROM 'file:///people.csv' AS row
MATCH (e:Employee {employeeId: row.employeeId})
MATCH (c:Company {companyId: row.Company})
MERGE (e)-[:WORKS_FOR]->(c)
RETURN *;

This way, the load is only doing one piece of the import at a time and can move through large amounts of data quickly and efficiently, reducing heavy processing.

When the amount of data being loaded is too much to fit into memory, there are a couple of different approaches you can use to combat running out of memory during the data load.

  1. Batch the import into sections with CALL { …​ } IN TRANSACTIONS.

    This subquery can be added after the LOAD CSV clause to tell Cypher to only process so many rows of the file before clearing memory and the transaction state. For more information, see Cypher Manual → Subqueries.

    Example
    LOAD CSV FROM 'file:///people.csv' AS line
    CALL {
     WITH line
     MATCH (e:Employee {id: line[0]})
     CREATE (e)-[:REL {prop: line[1]}]->(e)
    } IN TRANSACTIONS OF 100000 ROWS;
  2. Avoid the Eager operator.

    Some statements pull in more rows than it is necessary, adding extra processing up front. To avoid this, you can run PROFILE on your queries to see if they use Eager loading and either modify queries or run multiple passes on the same file, so it does not do this. For more information about the Eager operator, see the Cypher manual → Execution plan operators in detail.

  3. Adjust configuration for the database on heap and memory to avoid page-faults.

    To help handle larger volumes of transactions, you can increase some configuration settings for the database and restart the instance for them to take effect. Usually, you can create or update 1 million records in a single transaction per 2 GB of heap. In neo4j.conf:

    • server.memory.heap.initial_size and server.memory.heap.max_size: set to at least 4G.

    • server.memory.pagecache.size: ideally, value large enough to keep the whole database in memory.

The neo4j-admin database import command

LOAD CSV is great for importing small- or medium-sized datasets (up to 10 million records). For datasets larger than this, you can use the neo4j-admin database import command. This allows you to import CSV data to an unused database by specifying node files and relationship files.

The neo4j-admin database import command can be used for the initial graph population only.

Suppose you want to import order data via neo4j-admin database import into a Neo4j instance. Notice that some of the following CSV files include headers and some have separate header files. If you want to perform the import, you place them in the import folder for your Neo4j instance.

customers.csv
customerId:ID(Customer), name
23, Delicatessen Inc
42, Delicious Bakery
products.csv
productId:ID(Product), name, price, :LABEL
11,Chocolate,10,Product;Food
orders_header.csv
orderId:ID(Order),date,total,customerId:IGNORE
customer_orders_header.csv
:END_ID(Order),date:IGNORE,total:IGNORE,:START_ID(Customer)
orders1.csv
1041,2020-05-10,130,23
orders2.csv
1042,2020-05-12,20,42
order_details.csv
:START_ID(Order),amount,price,:END_ID(Product)
1041,13,130,11
1042,2,20,11

The neo4j-admin database import command has two modes:

  • full — used to initially import data into a non-existent empty database.

  • incremental — used to incrementally import data into an existing database.

The tool is located in <neo4j-instance-location>/bin/neo4j-admin and you run the command in a terminal window where you have navigated to the import directory for your Neo4j instance.

Here is an example of importing the preceding CSV files in Neo4j 5.x. You must specify the name of the database. In this case we specify orders.

bin/neo4j-admin database import full
     --nodes=Customer=import/customers.csv
     --nodes=import/products.csv
     --nodes=Order=import/orders_header.csv, import/orders1.csv, import/orders2.csv
     --relationships=CONTAINS=import/order_details.csv
     --relationships=ORDERED=import/customer_orders_header.csv, import/orders1.csv, import/orders2.csv
     --trim-strings=true orders

You must specify the parameters to this script on a single line. Line feeds are shown here for readability.

When you run this command, it imports data and make it available for the database. The neo4j-admin database import command does not create a new database.

The repeated --nodes and --relationships parameters are groups of multiple (potentially split) CSV files of the same entity, i.e. with the same column structure.

All files per group are treated as if they could be concatenated as a single large file. A header row in the first file of the group or in a separate, single-line file is required. Placing the header in a separate file can make it easier to handle and edit than having it in a multi-gigabyte text file. Compressed files are also supported.

  • The --id-type=string indicates that all :ID columns contain alphanumeric values (there is an optimization for numeric-only IDs).

  • The customers.csv is imported directly as nodes with the :Customer label and the properties are taken directly from the file.

  • Product nodes follow the same pattern where the node labels are taken from the :LABEL column.

  • The Order nodes are taken from three files - one header and two content files.

  • Line item relationships typed :CONTAINS are created from order_details.csv, relating orders with the contained products via their IDs.

  • Orders are connected to customers by using the order CSV files again, but this time with a different header, which :IGNORE’s the non-relevant columns.

The column names are used for property-names of your nodes and relationships. There is specific markup on specific columns:

  • name:ID - global id column used to look up the node later reconnecting.

    • if the property name is left off, it will be not stored (temporary), which is what the --id-type refers to.

    • if you have repeated IDs across entities, you have to provide the entity (id-group) in parentheses like :ID(Order).

    • if your IDs are globally unique, you can leave that off.

  • :LABEL - label column for nodes. Multiple labels can be separated by delimiter.

  • :START_ID, :END_ID - relationship file columns referring to the node IDs. For id-groups, use :END_ID(Order).

  • :TYPE - column to specify relationship-type.

  • All other columns are treated as properties but skipped if empty or annotated with :IGNORE.

  • Type conversion is possible by suffixing the name with indicators like :INT, :BOOLEAN, etc.

For more details on this header format and the tool, see the section in the Neo4j Operations Manual → Neo4j Admin import and the accompanying tutorial.

If you use Neo4j 4.4, go to the Operations manual → Tutorials: Neo4j Admin import for instructions.

CSV data quality

Real-world data is messy. Any time you work with data, you will see some values that need cleaned up or transformed before you move it to another system. Small syntax errors, format descriptions, consistency, correct quoting, and even differing assumptions on data requirements or standards can easily cause hours of cleanup down the road.

We will highlight some of the data quality issues easily missed when loading data from other systems into Neo4j and try to help avoid problems with data import and cleanup.

Common pitfalls

  • Headers are inconsistent with data (missing, too many columns, different delimiter in header).

    Verify headers match the data in the file. Adjusting formatting, delimiters, columns, etc. at this stage will save a great deal of time later.

  • Extra or missing quotes throughout file.

    Standalone double or single quotes in the middle of non-quoted text or non-escaped quotes in quoted text can cause issues reading the file for loading. It is best to either escape or remove stray quotes. Documentation for proper escaping is in the Cypher style guide.

  • Special or Newline characters in file.

    When dealing with any special characters in a file, ensure they are quoted or remove them. For newline characters in quoted or unquoted fields, either add quotes for these or remove them.

  • Inconsistent line breaks.

    One thing that computers do not handle well is inconsistent data. Ensure line breaks are consistent throughout. We recommend choosing the Unix style for compatibility with Linux systems (common format for import tools).

  • Binary zeros, BOM byte order mark (2 UTF-8 bytes) at beginning of the file, or other non-text characters.

    Any unusual characters or tool-specific formatting are sometimes hidden in application tools, but become easily apparent in basic editors. If you come across these types of characters in your file, it is best to remove them entirely.

Tools

As mentioned above, certain applications have special formatting to make documents look nice, but this hidden extra code is not handled by regular file readers and scripts. Other times, it is hard to find small syntax changes or make broad adjustments for files with a lot of data.

For handling these types of situations or general data cleanup, there are a number of tools that help you check and validate your CSV data files.

Basic tools, such as hexdump, vi, emacs, UltraEdit, and Notepad++ work well for handling shortcut-based commands for editing and manipulating files. However, there are also other more efficient or user-friendly options available that assist in data cleanup and formatting.

  • Cypher - what Cypher sees is what will be imported, so you can use that to your advantage. Using LOAD CSV without creating graph structure just outputs samples, counts, or distributions to make it possible to detect incorrect header column counts, delimiters, quotes, escapes, or header name spellings.

  • CSVKit - a set of Python tools that provides statistics (csvstat), search (csvgrep), and more for your CSV files.

  • CSVLint - an online service to validate CSV files. You can upload the file or provide an URL to load it.

  • Papa Parse - a comprehensive Javascript library for CSV parsing that allows you to stream CSV data and provides good, human-readable error reporting on issues.

// assert correct line count
LOAD CSV FROM "file-url" AS line
RETURN count(*);

// check first 5 line-sample with header-mapping
LOAD CSV WITH HEADERS FROM "file-url" AS line
RETURN line
LIMIT 5;