Goals This article demonstrates how to import CSV data into Neo4j and solutions to potential issues that might arise during this process. Prerequisites Before importing data you should have modeled the first iteration of your domain in the property graph… Learn More →

Goals
This article demonstrates how to import CSV data into Neo4j and solutions to potential issues that might arise during this process.
Prerequisites
Before importing data you should have modeled the first iteration of your domain in the property graph model. And to use LOAD CSV efficiently it pays off to be comfortable with our query language.
Intermediate


CSV Data Quality

Real World Data Considerations

Real world data is messy. Don’t assume what someone told you is in a CSV file is actually, in there, don’t rely on format descriptions, consistency or correct quoting. Only trust data validity that you checked yourself.

Common Pitfalls

  • BOM byte order mark (2 UTF-8) bytes at the beginning of a file ← remove them
  • Binary zeros or other non-text-characters dispersed throughout the file ← remove them
  • Inconsisent line breaks – mixed Windows and Unix linebreaks ← make sure they are consistent, best choose Unix style
  • Header inconsistent with data (missing, too many columns, different delimiter in header) ← fix headers
  • Special character in non-quoted text ← make sure unusual text is always quoted
  • Unexpected newlines in quoted and unquoted text-fields ← either quote text or remove newlines
  • stray quotes – standalone double or single quote in the middle of non-quoted text, or non-escaped quotes in quoted text ← escape or remove stray quotes

Tools

There are a number of tools that help you check and validate your CSV data files.

The most basic is hexdump, and the hex-mode of editors like vi, emacs, UltraEdit and Notepad++.

But there are other tools available, that are more end-user friendly.

CSVKit

CSVKit is a set of Python tools that provide statistics (csvstat), search (csvgrep), …​ and more for your CSV files. Especially csvstat is very helpful to get an overview and statistic over your file, if you know your domain and data you can immediately spot inaccuracies. Field length is important as excess field length indicates stray quotes.

CSVLint

CSVLint is an online service to validate CSV files. You can upload the file or provide an URL to load it from, there is the option to provide additional schema information.

Papa Parse

Papa Parse is a comprehensive Javascript library for CSV parsing, that allows you to stream CSV data and files even with your browser and provides good, human readable error reporting on issues. In the Demo area you can test your files yourself or use the library directly. There is also a fork called BabyParse running on node.js.

Cypher

What Cypher sees, is what will be imported, so you can use that to your advantage. You can use LOAD CSV without creating graph structure and just output samples, counts or distributions. So it is also possible to detect incorrect header column counts, delimiters, quotes, escapes or spelling of header names.

// assert correct line count
LOAD CSV FROM "file-url" AS line
RETURN count(*);

// check first few raw lines
LOAD CSV FROM "file-url" AS line WITH line
RETURN line
LIMIT 5;

// check first 5 line-sample with header-mapping
LOAD CSV WITH HEADERS FROM "file-url" AS line WITH line
RETURN line
LIMIT 5;

LOAD CSV for Medium Sized Datasets

The real secret of LOAD CSV.

It is not just your basic data ingestion mechanism, but actually an ETL Power Tool. Why?

It combines multiple aspects in a single operation:

  • supports loading / ingesting CSV data from an URI
  • direct mapping of input data into complex graph/domain structure
  • data conversion
  • supports complex computations
  • create or merge data, relationships and structure

Important Tips for LOAD CSV

  • Always use the latest version of Neo4j, it will most probably be faster than earlier ones.
Data Quality and Conversion
  • See the data quality section above
  • Empty fields have to be skipped or replaced with default values during LOAD CSV
  • All data from the CSV file is read as a string, you have to use toInt, toFloat, split or similar functions to convert
  • Split arrays in a cell by delimiter using split (combine with extract for conversions)
  • Check your Cypher import statement for typos: labels, property names and relationship-types are case-sensitive
  • Conditional conversions can be achieved with CASE
Indexing and Performance
  • Make sure to have indexes and constraints declared and ONLINE for entities you want to MATCH or MERGE on
  • Always MATCH and MERGE on a single label and the indexed primary-key property
  • Prefix your load statements with USING PERIODIC COMMIT 10000
  • If possible, separate node creation from relationship creation into different statements
    If your import is slow or runs into memory issues, see Mark’s blog post on Eager loading.
Memory Config
  • Make sure to have enough memory (at least 4G heap in neo4j-wrapper.conf) reserved for your Neo4j-Server or Neo4j-Shell (export JAVA_OPTS="-Xmx4G")
  • Configure the memory mapping settings (Neo4j 2.1) according to your expected file sizes
    (Neo4j 2.1 neo4j.properties keep these ratios: nodestore=100M, relationshipstore=2G, propertystore=500M, stringstore=500M)
File-URLs and Neo4j-Shell
  • Make sure to use the right URLs esp. file URLs.+ On OSX and Unix use file:///path/to/data.csv, on Windows, please use file:c:/path/to/data.csv
  • Use the bin/neo4j-shell instead of the browser for better error messages and control
    by default it connects to a running server, but you can also use bin/neo4j-shell -path import.db -config conf/neo4j.properties for direct database directory access (when no server is running with that db).

Step by Step Example for LOAD CSV

In our guide on ETL import from a relational database we explain how to import CSV data step by step, from data modeling, creating indexes to writing the individual LOAD CSV statements.

Webinar “LOAD CSV in the Real World”

In this very hands-on webinar Nicole White, Neo Technology’s Data Scientist, shows how to use LOAD CSV to import a real world dataset (consumer complaints from consumerfinance.gov) into Neo4j.

After a quick modeling discussion she walks through the steps of preparing indexes and constraints and then imports one part of the dataset at a time into Neo4j.

You can find all scripts and data on her GitHub account.

Super Fast Batch Importer For Huge Datasets

With Neo4j 2.2, a new, massively parallel and scalable CSV importer was released as part of the product.

We want to use it to import similar order data into Neo4j: customers, orders and contained products.

The tool is located in path/to/neo4j/bin/neo4j-import and is used as follows:

bin/neo4j-import --into retail.db --id-type string \
                 --nodes:Customer customers.csv --nodes products.csv  \
                 --nodes orders_header.csv,orders1.csv,orders2.csv \
                 --relationships:CONTAINS order_details.csv \
                 --relationships:ORDERED customer_orders_header.csv,orders1.csv,orders2.csv

The first few rows of data used for this import look like this:

Table 1. customers.csv
customerId:ID(Customer) name

23

Delicatessen Inc

42

Delicous Bakery

Table 2. products.csv
productId:ID(Product) name price :LABEL

11

Chocolate

10

Product;Food

Table 3. orders_header.csv,orders1.csv,orders2.csv
orderId:ID(Order) date total customerId:IGNORE

1041

2015-05-10

130

23

1042

2015-05-12

20

42

Table 4. order_details.csv
:START_ID(Order) amount price :END_ID(Product)

1041

13

130

11

1042

2

20

11

Table 5. customer_orders_header.csv,orders1.csv,orders2.csv
:END_ID(Order) date:IGNORE total:IGNORE :START_ID(Customer)

1041

2015-05-10

130

23

1042

2015-05-12

20

42

If you call the neo4j-import script without parameters it will list a comprehensive help page.

The --into retail.db is obviously the target database, which must not contain an existing database.

The repeated --nodes and --relationships parameters are groups of multiple (potentially split) csv-files of the same entity, i.e. with the same column structure.

All files per group are treated as if they could just be concatenated to form a single large file. A header row in the first file of the group is required, it might even be contained in a single-line file which might be easier to handle and edit than a multi-gigabyte text file. Compressed files are supported too.

  1. The customers.csv is imported directly as nodes with the :Customer label and the properties are taken directly from the file.
  2. Same for the products, where the node-labels are taken from the :LABEL column.
  3. The order nodes are taken from 3 files, one header and two content files.
  4. Line item relationships typed :CONTAINS are created from order_details.csv, relating orders with the contained products via their id’s.
  5. Orders are connected to customers by using the orders csv files again, but this time with a different header, which :IGNORE’s the non-relevant columns

The --id-type string is indicating that all :ID columns contain alphanumeric values (there is an optimization for numeric-only id’s).

The column names are used for property-names of your nodes and relationships, there is some extra markup for specific columns

  • name:ID – global id column by which the node is looked up for later reconnecting,
    • if property name is left off it will be not stored (temporary), this is what the --id-type refers to
    • if you have repeated ids across entities you have to provide the entity (id-group) in parentheses like :ID(Order)
    • if your ids are globally unique you can leave that off
  • :LABEL – label column for nodes, multiple labels can be separated by delimiter
  • :START_ID, :END_ID – relationship file columns referring to the node ids, for id-groups use :END_ID(Order)
  • :TYPE – relationship-type column
  • all other columns are treated as properties but skipped if empty or annotated with :IGNORE
  • type conversion is possible by suffixing the name, e.g. by :INT, :BOOLEAN, etc.

For more details on this header format and the tool, see the extensive documentation in the Neo4j Manual and the accompanying tutorial.

And the resulting output would look like this:

[INPUT------|ENCODER--------|WRITER: W:24.59 MB/s--------------------------]  4M
Calculate dense nodes
[INPU|CALCULATOR-----------------------------------------------------------]  3M
Relationships
[INPU|ENCODER-----------------------------------------------------------|WR]  4M
Node first rel

Relationship back link
[LINKER--------------------------------------------------------------------]  3M
Node counts

Relationship counts

IMPORT DONE. Took: 58s 124ms