Goals You will learn how to import data from CSV files to a Neo4j graph database using the Neo4j Desktop application. This tutorial is designed for you to follow along and step through the process. Prerequisites This guide builds upon… Read more →
You will learn how to import data from CSV files to a Neo4j graph database using the Neo4j Desktop application. This tutorial is designed for you to follow along and step through the process.
This guide builds upon the basic concepts discussed in earlier guides. You should be familiar with the following:
You should also have downloaded Neo4j Desktop and be familiar with navigating the application, as well as creating and starting a database.
Current screenshots and filepaths are for Neo4j Desktop on Mac OS. Related screenshots and filepaths for Windows are coming soon!
Neo4j Desktop Import
Neo4j Desktop provides a user-friendly interface for starting and creating Neo4j databases, adding or removing plugins, changing configurations, and other functionality. It also includes some shortcuts and easy access for importing files (such as CSVs) into Neo4j.
In this guide, we will work with a zipped folder containing three CSV files and import the data to a graph with Neo4j Desktop. The CSV files contain data for products, orders, and order line items. We will look at the data in the files in an upcoming paragraph.
1. Creating and starting a database instance
If you already know how to create a database instance and find the
Manage page for the instance, you can skip to Step 2.
To begin, we need to open the Neo4j Desktop application on your computer. When you start the application, you should see something like the screen below.
You may already have database instances from other projects or exercises, but let us create a separate database for this import.
To do that, click on the section that says
Add Graph in the image above.
You should see two buttons appear for creating a local graph or connecting to a remote graph.
Create a Local Graph and fill in details on the entry form for your database.
For the dropdown field, you can leave whatever version is defaulted.
Once that is complete, click
It may take a minute or two to create the instance, but once it finishes, you should see something like the screenshot below.
You can start your database by clicking the
Start button on it.
When it starts, the button should change to
Stop so you see the screen below.
The other button is to manage the database.
Let’s click on
Manage to show a screen like this one.
The top panel in the image above shows some basic functions you can execute on the database.
The next step will show us how to use this panel to put our files in the
2. Putting CSV files in the expected folder
First, download this zip file containing 3 CSV files for products, orders, and order details. We will need to put these files in a folder that Neo4j expects for imports.
On the Mac, open a Finder window and navigate to where the zip file was downloaded. We will come back to this location in a minute.
In Neo4j Desktop’s
Manage pane, you should see the button for
Open Folder in the upper panel.
Right next to that button is a dropdown arrow button.
If you click that, you will see a menu of options for different folders.
This will open up another Finder window on the Mac.
Now you can click on the other Finder window where the zip file downloaded and move the entire .zip file to the folder Neo4j Desktop opened. You should see something like this when you are finished.
Double-click on the zip file, and it should unzip and create a
If you click on that folder, you should see 3 .csv files – products.csv, orders.csv, and order-details.csv.
You can open each of the files to see what the data looks like.
Now that our files are in the import folder, we can look at how to put the data into our database.
We will need to take the current table and column format in the CSV files and translate it into nodes and relationships.
This can be done a few different ways, but we will use Cypher’s
LOAD CSV command in this guide.
LOAD CSV is a built-in command in Cypher that allows us to read CSV files and append regular Cypher statements to create or update the data as a graph.
You can also use
LOAD CSV without creating the graph to output samples, counts, or distributions.
This helps us detect incorrect header column counts, delimiters, quotes, escapes, or spelling of header names before the data is written and stored.
In the previous step, we downloaded the .zip file, copied it to our Neo4j import folder, and unzipped the file into a directory containing 3 CSVs.
Before we insert anything into our graph database, we should inspect the data in the files a bit.
To do this, we can use the
LOAD CSV command.
If you opened the files previously, you may have noticed that two of the files have headers and one does not (products.csv).
Let us see how to inspect each type of file.
First, we can check how many lines are in the csv files to ensure they didn’t get corrupted or cut off from a potential export process.
For the files with headers, you simply add the
WITH HEADERS clause after LOAD CSV, so that it excludes the header row in the count and only counts the rows of data.
Running these commands should return the following counts:
- 77 rows for products.csv
- 830 rows for orders.csv
- 2155 rows for order-details.csv
Next, we can take a look at what the data looks like in the CSV files and how LOAD CSV sees it.
The only line we need to change from our Cypher above is the
Since all of these files have a decent numbers of rows, we will use
LIMIT to only get a sample.
Notice that the orders.csv and the order-details.csv return in a different format from the products.csv.
This is because those files have headers, so the column names are returned with the values for those rows.
Since the products.csv does not have column names, then
LOAD CSV just returns the plain data row from the file.
LOAD CSV command reads all values as a string.
No matter how the value appears in a file, it will be loaded as a string with
So, before we import, we want to ensure we convert any values that are non-string.
There are a variety of conversion functions in Cypher. The ones we will use for this exercise are as follows:
- toInteger(): converts a value to an integer.
- toFloat(): converts a value to a float (in this case, for monetary amounts).
- datetime(): converts a value to a datetime.
We will look at the values in each CSV file to determine what needs to be converted.
The values in the products.csv files are for product ID, product name, and unit cost.
Product ID looks like an integer value that increases with each row, so we can convert this to an integer using the
toInteger() function in Cypher.
Product name can remain a string since it consists of characters.
The final column is the product unit cost.
Though the sample values from our inspection are all whole numbers, we know that monetary amounts often have decimal place values, so we will convert these values to floats using the
We can see the Cypher to handle all of these conversions below; however, we are still not loading the values into Neo4j yet. We are just viewing the CSV files with converted values.
Note that we are using collection positions (row, row, row) to refer to the columns in the row and improve readability by using aliases to reference them in the return. In a file that has no headers, this is how to reference values in each position.
The values in the orders.csv (per the column names) are for orderID, orderDate, and shipCountry. Again, we can evaluate the values and determine any conversions to apply.
OrderID looks like an integer, so we can convert that using the
The orderDate column is certainly in a date format and will require us to format it using the
Finally, the shipCountry values are characters, so we can leave that as a string.
If you are using a version of Neo4j prior to 3.4, the
Just as we did with the last CSV file, let us look at the results of these conversions without importing the data.
There was one tricky thing with this CSV in the
Neo4j’s datetime uses the ISO 8601 format which uses the delimiter
T between the date and time values.
The CSV file does not have the ‘T’ joining the date and time values but has a space between them instead.
We used the
replace() function to change the space to the character ‘T’ and get the string into the expected format.
Then, we wrapped the
datetime() function around that to convert the changed string to a datetime value.
The values in the order-details.csv (from column names) are for productID, orderID, and quantity. Let us look at which ones need to be converted.
Our product ID is also from our products.csv file, where we converted that value to an integer.
We will do the same here to ensure we match formats.
The order ID field contains values from our orders.csv file, so we will match our previous conversion and translate this field to an integer, as well.
The quantity field in this file is a numeric value.
We can convert this to an integer with the
toInteger() function we have been using.
The results of these conversions are in the code below. Remember that we still are not loading any data yet.
Loading the data!
Now that we have determined that the CSV file data looks ok, and we have verified how
LOAD CSV sees the data and converted any non-string values, we are almost ready to create the data in our graph database!
To do that, we use Cypher statements alongside the
LOAD CSV commands we used above.
LOAD CSV will read the files, and the Cypher statements will create the data in our database.
Graph data model
An important step we need before writing Cypher statements, though, is to determine what the graph structure should look like once we import our file data. After all, importing the data in the existing table and column data will not provide the value we want to achieve from a graph. To utilize the graph database fully, we need a graph data model.
Though there are a variety ways to organize the products and orders in our files, we will save that for another guide and use the below version of the model for this exercise.
We have two nodes – one for a product and one for an order.
Each of those nodes have properties from our CSV files.
Product, we have ID, name, and unit cost.
Order, we have ID, date/time, and country where it is going.
The order-details.csv file defines the relationship between those two nodes.
That file has the product ID, the order ID it belongs to, and the quantity of the product on the order.
So, in our data model, this becomes our
CONTAINS relationship between
We also include a property of
quantityOrdered on the relationship because the product quantity value only exists when a product is related to an order.
Now that we know the types of nodes and relationships we will have and the properties involved, we can construct the Cypher statements to create the data for this model.
Avoiding duplicates and increasing performance
One final thing we need to think about before we create data in the graph is ensuring values are unique and performance is efficient. To handle this, we can use constraints. Just as with other databases, constraints ensure data integrity criteria is not violated, while simultaneously indexing the property with the constraint for faster query performance.
There are cases for applying indexes to a database before any data and with existing data.
In this exercise, we will add two constraints before we create any data – one for
productId and one for
This will ensure that, when we create a new node of each of those types or a relationship to connect them, we know the entities are unique and indexed.
Below is the Cypher for adding indexes.
Now we are ready to write the Cypher for creating the data in the graph!
We could use
CREATE statements where we are sure that we will not have duplicate rows in our CSV file and use
MATCH to find existing data for updates.
However, since it is hard to completely scrub all data and import perfectly clean data from any source, we will use
MERGE statements to check if the data exists before inserting.
If the node or relationship exists, Cypher will match and return (without any writes), but if it does not exist, Cypher will insert it.
MERGE can have some performance overhead, but often it is the better approach to maintain high data integrity.
Why both constraints and MERGE?
Using constraints is different from using MERGE.
Statements that create data in violation of the constraint will error, while statements that use
To start, let us load the products into the graph.
We start with our
LOAD CSV statement from above, then we add the Cypher to create the data from the CSV into our graph model.
We will use the
MERGE statement to check if the
Product already exists before we create it.
The properties will be set to the converted values we handled earlier in this guide.
If you run that statement, it will return the number of product nodes (
count(p)) that were created in the database.
You can cross-check that number with the number of rows in the CSV file from earlier (77 rows in products.csv).
You can also run a validation query to return a sample of nodes and review that the properties look accurate.
Next, we can load the orders.
Again, since we want to verify we do not create duplicate
Order nodes, we can use the
Just as with products, we start with the
LOAD CSV command, then add Cypher statements and include our data conversions.
We can also run a validation query, as before, to verify the graph data looks correct.
Last, but not least, we create the relationship between the products and the orders.
Since we expect all of our products and all of our orders to already exist in the graph (that data should have been loaded with the last two files), then we start with
MATCH to find the existing
MERGE statement will add the new relationship or match an existing one.
As we found when we ran a count on the order-details file above, there are 2,155 rows in the CSV.
While this is not a huge number for file imports, we decide to have Cypher periodically commit the data to the database to reduce the memory overhead of the transaction state.
For this, we can add the
USING PERIODIC COMMIT clause before the
LOAD CSV command.
The default value for periodic commit is 1,000, but for this exercise, we will ask Cypher to commit every 500 rows.
You could decrease this number if you have a lot of memory already allocated to other tasks, or if it is limited.
Just as we did above, we can validate the data with the query below.
Congratulations! You have successfully loaded 3 CSV files into a Neo4j graph database using Neo4j Desktop!
LOAD CSV functionality, coupled with Cypher, is exceptionally useful for getting data from files into a graph structure.
The best way to advance your skills in this area is to load a variety of files for various data sets and models.
Practice makes perfect!
If you work through this exercise again at a later time, feel free to increase the challenge by coming up with your own data model for these files or try to load some other CSV files to a graph.
If you have any questions or need assistance using
LOAD CSV, reach out to us on the Community Site!
To learn more about
LOAD CSV and its uses, check out our guide on LOAD CSV.