Working with CSV files

This page presents some points to consider to optimize your CSV files before importing them to Neo4j.

Structure of a CSV file

There are several elements in a CSV file that can help you understand the data that you are about to import and eventually model.

Data format

Neo4j reads all data from the CSV file as a string. For other data types, you need to use toInteger(), toFloat(), toBoolean(), or similar functions to convert data to the appropriate type. Remember also that labels, property names, relationship types, and variables are case-sensitive.

See Converting data values for more information on how to manipulate data formats.

Field terminator

Also known as delimiter, a field terminator is a character used to separate each field in a CSV file. In this example, a comma (,) is used, but other characters, such as a tab (\t) or a pipe (|) also work and they can be blended:

personId,name,birthYear
23945,"Gerard Pires"|1942
553509,"Helen Reddy"|1941
113934,"Susan Flannery"|1939

You can copy and paste the example into any text editor (e.g. Notepad or TextEdit) or a spreadsheet application (e.g. Excel or Google Sheets) and then save it as a CSV file.

Header

The header is typically the first line in the CSV file. They are not mandatory, but adding them is a good practice. In the previous example, the header is:

personId,name,birthYear

If your CSV files have no header row, you need to know the order of the columns and refer to them using indexes instead. This can make working with the data more complicated than it needs to be.

Quotes

Quotation marks (") define what text should be stored as a single value. For example:

personId,name,birthYear
23945,"Pires, Gerard",1942
553509,"Reddy, Helen",1941
113934,"Flannery, Susan",1939

This CSV file uses commas as the field terminator character and, in the second row names, entries such as Pires, Gerard also contain a comma. If the entry Pires, Gerard wasn’t enclosed with quotation marks, it would count as two different entries (i.e. one for Pires and another for Gerard).

Normalized data

If the source data is normalized (e.g. when exported from a relational data model), there are typically multiple CSV files. Each CSV file represents a table in the relational data model, and the files are related to each other by unique IDs.

In this normalized data example, there are three files for people, movies, and roles:

person.csv
personId,name,birthYear
23945,Gerard Pires,1942
553509,Helen Reddy,1941
113934,Susan Flannery,1939
movies.csv
movieId,title,avgVote,releaseYear,genres
189,Sin City,8.000000,2005,Crime|Thriller
2300,The Fifth Element,7.700000,1997,Action|Adventure|Sci-Fi
11969,Tombstone,7.800000,1993,Action|Romance|Western
roles.csv
personId,movieId,character
2295,189,Marv
56731,189,Nancy
16851,189,Dwight

Note that the person.csv file has a unique ID for every person, and the movies.csv file has a unique ID for every movie. The roles.csv file relates a person to a movie and provides the characters. It corresponds to the relationships in the graph as it contains the links needed to bind nodes together.

De-normalized data

De-normalized data typically represents data from multiple tables. If the source data is de-normalized, there is typically a single CSV file which contains all the data, often duplicated where there are relationships between entities. For example:

movies-n.csv
movieId,title,avgVote,releaseYear,genres,personType,name,birthYear,character
2300,The Fifth Element,7.700000,1997,Action|Adventure|Sci-Fi,ACTOR,Bruce Willis,1955,Korben Dallas
2300,The Fifth Element,7.700000,1997,Action|Adventure|Sci-Fi,ACTOR,Gary Oldman,1958,Jean-Baptiste Emanuel Zorg
2300,The Fifth Element,7.700000,1997,Action|Adventure|Sci-Fi,ACTOR,Ian Holm,1931,Father Vito Cornelius
11969,Tombstone,7.800000,1993,Action|Romance|Western,ACTOR,Kurt Russell,1951,Wyatt Earp
11969,Tombstone,7.800000,1993,Action|Romance|Western,ACTOR,Val Kilmer,1959,Doc Holliday
11969,Tombstone,7.800000,1993,Action|Romance|Western,ACTOR,Sam Elliott,1944,Virgil Earp

Here, the movie and person data (including the IDs) is repeated in different rows every time new information about a particular actor’s role is featured. This sort of duplication compromises the graph data structure. In this case, it is then advisable to prepare your file before importing.

File location

When using the LOAD CSV command to load your CSV data, the CSV file is accessed via URL, either over the internet:

LOAD CSV WITH HEADERS
FROM 'https://data.neo4j.com/importing-cypher/people.csv' AS row
RETURN row

Or from a local folder, if you use an on-premise deployment. In this case, you need to add a file:/// prefix before the file name:

LOAD CSV WITH HEADERS
FROM 'file:///people.csv' AS row
RETURN row

Due to security reasons, local files, by default, can only be read from the Neo4j import directory, which is located differently based on your operating system. See Operations → File locations for more information.

If you want to open your CSV file from another location, you need to change the server.directories.import settings.

Security

It is strongly recommended to permit resource loading only over secure protocols such as HTTPS instead of insecure protocols like HTTP. This can be done by limiting the load privileges to only trusted sources that use secure protocols.

If allowing an insecure protocol is unavoidable, you need to add the JVM argument -Dsun.net.http.allowRestrictedHeaders=true to the configuration setting server.jvm.additional to avoid automatic blocking from Neo4j’s built-in security checks.

File preparation

Before you import CSV data you should consider the source of the data. It can come from:

  • Relational databases

  • Web APIs

  • Public data directories

  • BI tools

  • Speadsheets (e.g. Excel or Google Sheets)

Most data systems have an option for exporting data as CSV files as it is a common format for data exchange. However, real-world data is often messy, which means some values need to be cleaned up or transformed before imported to another system.

These are some common issues you may encounter:

  1. The source files contains more data than you need

    For example, if you are interested in only one director and the movies they were involved in, the Movies dataset contains a lot of data which is irrelevant to you. To make the import process more efficient, you need to remove the unnecessary data before you import the CSV files.

  2. Inconsistency between headers and data

    Headers can be inconsistent with data. They could be missing or be lost in too many columns.

    To avoid this problem:

    • Check if headers match the data in the file.

    • Adjust formatting, columns, etc before you import for a smooth process.

  3. Extra or missing quotes

    Standalone double (") or single quotes (') in the middle of non-quoted text or non-escaped quotes in quoted text can cause issues when reading the file for loading. It is best to either escape or remove stray quotes. Find documentation for proper escaping in the Cypher® style guide.

  4. Special or newline characters

    When dealing with any special characters in a file, ensure they are quoted or remove them. For newline characters in quoted or unquoted fields, either add quotes for these or remove them.

  5. Inconsistent line breaks

    Ensure line breaks are consistent throughout the file. For Linux users, the recommendation is to use the Unix style for compatibility.

  6. Binary zeros, BOM byte order mark (2 UTF-8 bytes) or other non-text characters

    Unusual characters or tool-specific formatting are sometimes hidden in application tools. You can use a basic editor to detect and remove this type of characters from your files or use backticks to escape them.

Data types

Since Neo4j reads all imported values as a string, you need to convert any values that are not string. You can do it using functions in Cypher:

  • toInteger(): converts a value to an integer.

  • toFloat(): converts a value to a float (e.g. for monetary amounts).

  • datetime(): converts a value to a DateTime.

Depending on what sort of data you have in your CSV file, you need to convert values according to their type. See Cypher → Values and types for more information on what values and types are available in Cypher.

Cleaning up

Some issues in the CSV files need to be addressed before you load them, but others can be addressed while you load them by adding additional clauses to the LOAD CSV command.

Null values

Neo4j does not store null values, but you can skip or replace them with default values by adding clauses or functions to the LOAD CSV command. Suppose you have this CSV file:

companies.csv
Id,Name,Location,Email,BusinessType
1,Neo4j,San Mateo,contact@neo4j.com,P
2,AAA,,info@aaa.com,
3,BBB,Chicago,,G

The third and the fourth lines have no entry for some of the headers, which means that they have null values that need to be skipped. You can use the WHERE clause to specify it:

LOAD CSV WITH HEADERS FROM 'file:///companies.csv' AS row
WITH row WHERE row.Id IS NOT NULL
MERGE (c:Company {companyId: row.Id});

Or have a default value (e.g. "Unknown") set for them and use the coalesce function:

LOAD CSV WITH HEADERS FROM 'file:///companies.csv' AS row
MERGE (c:Company {companyId: row.Id, hqLocation: coalesce(row.Location, "Unknown")})

You can also change the empty strings to null values which will not be stored using the SET clause:

LOAD CSV WITH HEADERS FROM 'file:///companies.csv' AS row
MERGE (c:Company {companyId: row.Id})
SET c.emailAddress = CASE trim(row.Email) WHEN "" THEN null ELSE row.Email END

Conditional conversions

Conditional conversions can be achieved with CASE. The previous example checked for null values or empty strings, but you can also set a property in this cleaning stage based on a value in the CSV file.

For example, you can set the businessType property based on an abbreviated value in the CSV file:

LOAD CSV WITH HEADERS FROM 'file:///companies.csv' AS row
WITH row WHERE row.Id IS NOT NULL
WITH row,
(CASE row.BusinessType
 WHEN 'P' THEN 'Public'
 WHEN 'R' THEN 'Private'
 WHEN 'G' THEN 'Government'
 ELSE 'Other' END) AS type
MERGE (c:Company {companyId: row.Id, hqLocation: coalesce(row.Location, "Unknown")})
SET c.emailAddress = CASE trim(row.Email) WHEN "" THEN null ELSE row.Email END
SET c.businessType = type
RETURN *

Lists as entries

If you have a field in the CSV file that is a list of items that you want to split into separate rows, you can use the Cypher split() function to separate arrays in a cell. For example:

employees.csv
Id,Name,Skills,Email
1,Joe Smith,Cypher:Java:JavaScript,joe@neo4j.com
2,Mary Jones,Java,mary@neo4j.com
3,Trevor Scott,Java:JavaScript,trevor@neo4j.com

Both Joe and Trevor have multiple skills listed in this file. You can split them using the split() function with the UNWIND clause like this:

LOAD CSV WITH HEADERS FROM 'file:///employees.csv' AS row
MERGE (e:Employee {employeeId: row.Id, email: row.Email})
WITH e, row
UNWIND split(row.Skills, ':') AS skill
MERGE (s:Skill {name: skill})
MERGE (e)-[r:HAS_EXPERIENCE]->(s)

Clean-up tools

You can use the following third-party tools to make sure your CSV file is in good shape to allow you to import data efficiently:

  • CSVKit a set of Python tools that provides statistics (csvstat), search (csvgrep), and more.

  • CSVLint an online service to validate CSV files. You can upload the file or provide an URL to load it.

  • Papa Parse a comprehensive Javascript library for CSV parsing that allows you to stream CSV data and provides good, human-readable error reporting on issues.

File size

You can use most Neo4j’s import methods for importing small or medium-sized datasets (up to 10 million records). If you want to import larger datasets, it is recommended to use neo4j-admin database import. See the tutorial for Neo4j-admin import to learn more.

Optimization

Performance can be a problem when working with large amounts of data or complex loading. Some strategies can, however, improve the processing of large amounts of information.

For example, if you want to create a graph using the preceding companies.csv file and the following one:

people.csv
employeeId,Name,companyId
1,Bob Smith,1
2,Joe Jones,3
3,Susan Scott,2
4,Karen White,1

In this case, you should separate node and relationship creation on a separate part of the processing. For example, instead of the following:

LOAD CSV WITH HEADERS FROM 'file:///people.csv' AS row
MERGE (e:Employee {employeeId: row.employeeId})
MERGE (c:Company {companyId: row.companyId})
MERGE (e)-[r:WORKS_FOR]->(c)

You can write it like this:

Load the Employee nodes
LOAD CSV WITH HEADERS FROM 'file:///people.csv' AS row
MERGE (e:Employee {employeeId: row.employeeId, name: row.Name})
RETURN count(e);
Load the Company nodes
LOAD CSV WITH HEADERS FROM 'file:///people.csv' AS row
MERGE (c:Company {companyId: row.companyId})
RETURN count(c);
Create relationships
LOAD CSV WITH HEADERS FROM 'file:///people.csv' AS row
MATCH (e:Employee {employeeId: row.employeeId, name: row.Name})
MATCH (c:Company {companyId: row.companyId})
MERGE (e)-[:WORKS_FOR]->(c)
RETURN *

This way, the load is only doing one piece of the import at a time and can move through large amounts of data quickly and efficiently, reducing heavy processing.

The result is this graph: