Developer Guides Getting Started Getting Started What is a Graph Database? Intro to Graph DBs Video Series Concepts: RDBMS to Graph Concepts: NoSQL to Graph Getting Started Resources Neo4j Graph Platform Graph Platform Overview Neo4j Desktop Intro Neo4j Browser Intro… Read more →

Developer Guides

Want to Speak? Get $ back.

Importing CSV Data into Neo4j

Goals
This article demonstrates different approaches to importing CSV data into Neo4j and solutions to potential issues that might arise during the process.
Prerequisites
Before importing data, you should be familiar with what a graph database is, how to construct a property graph data model, and some basics of the Cypher query language. All of these skills are a part of the data import process.
Intermediate

CSV is a file of comma-separated values, often viewed in Excel or some other spreadsheet tool. There can be other types of values as the delimiter, but the most standard is the comma. Many systems and processes today already convert their data into CSV format for file outputs to other systems, human-friendly reports, and other needs. It is a standard file format that humans and systems are already familiar with using and handling.

Giving Neo4j the ability to read and load a CSV file helps reduces the friction of getting data from various formats and systems into Neo4j.

Ways to Import CSV Files

There are a few different approaches to get CSV data into Neo4j, each with varying criteria and functionality. The option you choose will depend on the data set size, as well as your degree of comfort with various tools.

Let us see some of the ways Neo4j can read and import CSV files.

  1. LOAD CSV Cypher command: this command is a great starting point and handles small- to medium-sized data sets (up to 10 million records).
  2. neo4j-admin bulk import tool: command line tool useful for straightforward loading of large data sets.
  3. Kettle import tool: maps and executes steps for the data process flow and works well for very large data sets, especially if developers are already familiar with using this tool.

We will take a brief view of each one of these tools, how they operate, and how to get started with a general use case. More documentation and information for each will also be included for help on more complex scenarios. Data quality can also be an issue for any type of data import to any system, so we will cover a few of these potential difficulties and how to solve them.

LOAD CSV command with Cypher

The LOAD CSV clause is part of the Cypher query language. Our Cypher manual contains a page devoted to its usage, and a variety of Neo4j’s blogs, videos, solutions, and other material utilizes this command. It is simple to use and widely applicable. LOAD CSV is not just your basic data ingestion mechanism because it combines multiple aspects into a single operation.

  • Supports loading / ingesting CSV data from an URI
  • Directly maps input data into complex graph/domain structure
  • Handles data conversion
  • Supports complex computations
  • Creates or merges entities, relationships, and structure
For better control, you can run LOAD CSV commands with cypher-shell instead of in the browser. More information is in the manual page on Cypher shell.

Reading CSV Files

LOAD CSV can handle local and remote files, and there is some syntax associated with each. This can be an easy thing to miss and end up with an access error, so we will try to clarify the rules here.

Local files are referenced with a file:/// prefix before the file name. Neo4j security has a default setting that local files can only be read from the Neo4j import directory, which is different based on your operating system. File locations for each OS are listed in our Neo4j Operations Manual. We recommend putting files in Neo4j’s import directory, as it keeps the environment secure. However, if you need to access files in other locations, you can find out which setting to alter in our manual.

Examples:
//Example 1 - file directly placed in import directory (import/data.csv)
LOAD CSV FROM "file:///data.csv"

//Example 2 - file placed in subdirectory within import directory (import/northwind/customers.csv)
LOAD CSV FROM "file:///northwind/customers.csv"

Web-hosted files can be referenced directly with their URL, like https://host/path/data.csv. However, permissions must be set so that an external source can read the file. For more information about access related to online file imports, see this knowledge base article.

Examples:
//Example 1 - website
LOAD CSV FROM 'https://neo4j.com/docs/cypher-manual/3.5/csv/artists.csv'

//Example 2 - Google
LOAD CSV WITH HEADERS FROM 'https://docs.google.com/spreadsheets/d/<yourFilePath>'

Important Tips for LOAD CSV

There are a few things to keep in mind with LOAD CSV and a few helpful tips for handling the variety of data scenarios you are likely to encounter.

  • Newer versions of Neo4j will most likely be faster due to continued optimization.
  • All data from the CSV file is read as a string, so you need to use toInteger(), toFloat(), split() or similar functions to convert values.
  • Check your Cypher import statement for typos. Labels, property names, relationship-types, and variables are case-sensitive.
  • The cleaner the data, the easier the load. Try to handle complex cleanup/manipulation before load.

Converting Data Values with LOAD CSV

Cypher has some scrubbing and conversion capabilities to help with data cleanup. These are extremely useful for handling missing data or splitting a field into multiple values for the graph.

First, remember that Neo4j does not store null values. Null or empty fields in a CSV files can be skipped or replaced with default values in LOAD CSV.

Example
//skip null values
LOAD CSV WITH HEADERS FROM 'file:///data.csv' AS row
WITH row WHERE row.Company IS NOT NULL
MERGE (c:Company {companyId: row.Id})

//set default for null values
LOAD CSV WITH HEADERS FROM 'file:///data.csv' AS row
MERGE (c:Company {companyId: row.Id, hqLocation: coalesce(row.Location, "Unknown")})

//change empty strings to null values (not stored)
LOAD CSV WITH HEADERS FROM 'file:///data.csv' AS row
MERGE (c:Company {companyId: row.Id})
SET c.emailAddress = CASE trim(row.Email) WHEN "" THEN null ELSE row.Email END

Next, if you have a field in the CSV that is a list of items that you want to split, you can use the Cypher split() function to separate arrays in a cell.

Example
//split string of employee skills into separate nodes
LOAD CSV FROM 'file:///data.csv' AS row
MERGE (e:Employee {employeeId: row.Id})
UNWIND split(row.skills, ',') AS skill
MERGE (s:Skill {name: skill})
MERGE (e)-[r:HAS_EXPERIENCE]->(s);

Conditional conversions can be achieved with CASE. You saw one example of this when we were checking for null values or empty strings, but let us look at another example.

Example
//set businessType property based on shortened value in CSV
LOAD CSV WITH HEADERS FROM 'file:///data.csv' AS row
WITH row,
(CASE row.BusinessType
 WHEN 'P' THEN 'Public'
 WHEN 'R' THEN 'Private'
 WHEN 'G' THEN 'Government'
 ELSE 'Other' END) AS type
MERGE (c:Company {companyId: row.CompanyId})
SET c.businessType = type
RETURN *

Optimizing LOAD CSV for Performance

Often, there are ways to improve performance during data load, which are especially helpful when dealing with large amounts of data or complex loading.

To improve inserting or updating unique entities into your graph (using MERGE or MATCH with updates), you can create indexes and constraints declared for each of the labels and properties you plan to merge or match on.

For best performance, always MATCH and MERGE on a single label with the indexed primary-key property.

You should also separate node and relationship creation into separate statements. For instance, instead of the following:

MERGE (e:Employee {employeeId: row.employeeId})
MERGE (c:Company {companyId: row.companyId})
MERGE (e)-[r:WORKS_FOR]->(c)

You can write it like this:

LOAD CSV WITH HEADERS FROM 'file:///data.csv' AS row
MERGE (e:Employee {employeeId: row.employeeId})
RETURN count(e);

LOAD CSV WITH HEADERS FROM 'file:///data.csv' AS row
MERGE (c:Company {companyId: row.companyId})
RETURN count(c);

LOAD CSV WITH HEADERS FROM 'file:///data.csv' AS row
MATCH (e:Employee {employeeId: row.employeeId})
MATCH (c:Company {companyId: row.companyId})
MERGE (e)-[r:WORKS_FOR]->(c)
RETURN count(*);

This way, the load is only doing one piece of the import at a time and can move through large amounts of data quickly and efficiently, reducing heavy processing.

When the amount of data being loaded is too much to fit into memory, there are a couple of different approaches you can use to combat running out of memory during the data load.

  1. Batch the import into sections with PERIODIC COMMIT. This clause can be added before the LOAD CSV clause to tell Cypher to only process so many rows of the file before clearing memory and transaction state. For more information, see the manual page on PERIODIC COMMIT.
Example
USING PERIODIC COMMIT 500
LOAD CSV WITH HEADERS FROM 'file:///data.csv' AS row
MERGE (pet:Pet {petId: row.PetId})
MERGE (owner:Owner {ownerId: row.OwnerId})
 ON CREATE SET owner.name = row.OwnerName
MERGE (pet)-[r:OWNED_BY]->(owner)
  1. Avoid the EAGER operator. Some statements pull in more rows than what is necessary, adding extra processing up front. To avoid this, you can run PROFILE on your queries to see if they use EAGER loading and either modify queries or run multiple passes on the same file, so it does not do this. More information about EAGER loading and how to avoid can be found in Mark’s blog post.
  2. Adjust configuration for the database on heap and memory to avoid page-faults. To help handle larger volumes of transactions, you can increase some configuration settings for the database and restart the instance for them to take effect. Usually, you can create or update 1M records in a single transaction per 2 GB of heap. In neo4j.conf:
    • dbms.memory.heap.initial_size and dbms.memory.heap.max_size: set to at least 4G.
    • dbms.memory.pagecache.size: ideally, value large enough to keep the whole database in memory.

Bulk Importer For Large Datasets

LOAD CSV is great for importing small- or medium-sized data (up to 10M records). For data sets larger than this, we have access to a command line bulk importer. The neo4j-admin import tool allows you to import CSV data to an empty database by specifying node files and relationship files.

We want to use it to import order data into Neo4j: customers, orders, and ordered products.

The tool is located in <neo4j-home>/bin/neo4j-admin and is used as follows:

bin/neo4j-admin import --id-type=STRING \
                       --nodes:Customer=customers.csv --nodes=products.csv  \
                       --nodes="orders_header.csv,orders1.csv,orders2.csv" \
                       --relationships:CONTAINS=order_details.csv \
                       --relationships:ORDERED="customer_orders_header.csv,orders1.csv,orders2.csv"

The first few rows of data used for this import look like this:

Table 1. customers.csv
customerId:ID(Customer) name

23

Delicatessen Inc

42

Delicous Bakery

Table 2. products.csv
productId:ID(Product) name price :LABEL

11

Chocolate

10

Product;Food

Table 3. orders_header.csv,orders1.csv,orders2.csv
orderId:ID(Order) date total customerId:IGNORE

1041

2015-05-10

130

23

1042

2015-05-12

20

42

Table 4. order_details.csv
:START_ID(Order) amount price :END_ID(Product)

1041

13

130

11

1042

2

20

11

Table 5. customer_orders_header.csv,orders1.csv,orders2.csv
:END_ID(Order) date:IGNORE total:IGNORE :START_ID(Customer)

1041

2015-05-10

130

23

1042

2015-05-12

20

42

If you call the bin/neo4j-admin import without parameters it will list a comprehensive help page.

The repeated --nodes and --relationships parameters are groups of multiple (potentially split) CSV files of the same entity, i.e. with the same column structure.

All files per group are treated as if they could be concatenated as a single large file. A header row in the first file of the group or in a separate, single-line file is required. Placing the header in a separate file can make it easier to handle and edit than having it in a multi-gigabyte text file. Compressed files are also supported.

  • The --id-type=STRING indicates that all :ID columns contain alphanumeric values (there is an optimization for numeric-only IDs).
  • The customers.csv is imported directly as nodes with the :Customer label and the properties are taken directly from the file.
  • Product nodes follow the same pattern where the node-labels are taken from the :LABEL column.
  • The Order nodes are taken from 3 files – one header and two content files.
  • Line item relationships typed :CONTAINS are created from order_details.csv, relating orders with the contained products via their IDs.
  • Orders are connected to customers by using the order CSV files again, but this time with a different header, which :IGNORE’s the non-relevant columns.

The column names are used for property-names of your nodes and relationships. There is specific markup on specific columns, which we will explain.

  • name:ID – global id column used to look up the node later reconnecting.
    • if the property name is left off, it will be not stored (temporary), which is what the --id-type refers to.
    • if you have repeated IDs across entities, you have to provide the entity (id-group) in parentheses like :ID(Order).
    • if your IDs are globally unique, you can leave that off.
  • :LABEL – label column for nodes. Multiple labels can be separated by delimiter.
  • :START_ID, :END_ID – relationship file columns referring to the node ids. For id-groups, use :END_ID(Order).
  • :TYPE – column to specify relationship-type.
  • All other columns are treated as properties but skipped if empty or annotated with :IGNORE.
  • Type conversion is possible by suffixing the name with indicators like :INT, :BOOLEAN, etc.

For more details on this header format and the tool, see the documentation in the Neo4j Manual and the accompanying tutorial.

CSV Data Quality

Real-world data is messy. Any time you work with data, you will see some values that need cleaned up or transformed before you move it to another system. Small syntax errors, format descriptions, consistency or correct quoting, and even differing assumptions on data requirements or standards can easily cause hours of cleanup down the road.

We will highlight some of the data quality issues easily missed when loading data from other systems into Neo4j and try to help avoid problems with data import and cleanup.

Common Pitfalls

Headers are inconsistent with data (missing, too many columns, different delimiter in header) Verify headers match the data in the file. Adjusting formatting, delimiters, columns, etc. at this stage will save a great deal of time later.

Extra or missing quotes throughout file Standalone double or single quotes in the middle of non-quoted text or non-escaped quotes in quoted text can cause issues reading the file for loading. It is best to either escape or remove stray quotes. Documentation for proper escaping is in the Cypher style guide and a knowledgebase article.

Special or Newline characters in file When dealing with any special characters in a file, ensure they are quoted or remove them. For newline characters in quoted or unquoted fields, either add quotes for these or remove them.

Inconsistent line breaks One thing that computers do not handle well is inconsistent data. Ensure line breaks are consistent throughout. We recommend choosing the Unix style for compatibility with Linux systems (common format for import tools).

Binary zeros, BOM byte order mark (2 UTF-8 bytes) at beginning of file, or other non-text characters Any unusual characters or tool-specific formatting (Excel or Word) are sometimes hidden in application tools, but become easily apparent in basic editors. If you come across these types of characters in your file, it is best to remove them entirely.

Tools

As mentioned above, certain applications have special formatting to make documents look nice, but this hidden extra code is not handled by regular file readers and scripts. Other times, it is hard to find small syntax changes or make broad adjustments for files with a lot of data.

For handling these types of situations or general data cleanup, there are a number of tools that help you check and validate your CSV data files.

Basic tools, such as hexdump, vi, emacs, UltraEdit, and Notepad++ work well for handling shortcut-based commands for editing and manipulating files. However, there are also other more efficient or user-friendly options available that assist in data cleanup and formatting.

  • CSVKit – a set of Python tools that provides statistics (csvstat), search (csvgrep), and more for your CSV files.
  • CSVLint – an online service to validate CSV files. You can upload the file or provide an URL to load it.
  • Papa Parse – a comprehensive Javascript library for CSV parsing that allows you to stream CSV data and provides good, human-readable error reporting on issues.
  • Cypher – what Cypher sees is what will be imported, so you can use that to your advantage. Using LOAD CSV without creating graph structure will just output samples, counts, or distributions to make it possible to detect incorrect header column counts, delimiters, quotes, escapes, or header name spellings.
// assert correct line count
LOAD CSV FROM "file-url" AS line
RETURN count(*);

// check first 5 line-sample with header-mapping
LOAD CSV WITH HEADERS FROM "file-url" AS line
RETURN line
LIMIT 5;