Importing Data into the Graph

About this module

At the end of this module, you will be able to:

Write Cypher code to import CSV data into the graph.
Confirm that the data has been loaded.

Because the code examples in this lesson modify the database, it is recommended that you do not execute them against your database as you will be doing so in the hands-on exercises.

Options for importing data into the graph

You have many options for importing data into Neo4j. Which option you choose depends on:

How much data you have.
What tools you are comfortable using.
How much time you have to perform the import.

In this training, we will use Cypher to import data into the graph. Using Cypher enables us to control and customize how data is created and refactored as our model changes.

Prepare for the import

Before you import data into the graph, you must have an idea of the target graph data model you want to achieve. Work with the data architects for your application so that everybody agrees upon:

Names of entities (node labels).
Names of relationships.
Names of properties for nodes and relationships.
Constraints to be defined.
Indexes required.
The most important queries.

Review: Steps for loading CSV data with Cypher

CSV import is commonly used to import data into a graph. If you want to import data from CSV, you will need to first develop a model that describes how data from your CSV maps to data in your graph.

Assuming that you have an agreed-upon data model, here are the basic steps you follow for importing using Cypher and CSV files:

Determine how the CSV file will be structured.
Determine if normalized or denormalized data.
Ensure IDs to be used in the data are unique.
Ensure data in CSV files is "clean".
Execute Cypher code to inspect the data.
Determine if data needs to be transformed.
If required, ensure constraints are created in the graph.
Determine the size of the data to be loaded.
Execute Cypher code to load the data.
Add indexes to the graph.

These steps are covered in the course Importing Data with Neo4j 4.x: Using LOAD CSV for Import.

Review: Using `LOAD CSV`

Here is the simplified syntax for using LOAD CSV:

LOAD CSV     // load csv data
WITH HEADERS // optionally use first header row as keys in "row" map
FROM "url"   // file:/// file relative to $NEO4J_HOME/import or http://
AS row       // return each row of the CSV as list of strings or map
// ... rest of the Cypher statement ...

You can use LOAD CSV for CSV files that contain fewer than 100k lines.

Example: Inspecting data from the CSV file on network

Example: Inspecting data from the CSV file in import folder

Example: Creating nodes and relationships

You use LOAD CSV to read the data from the CSV file as a row to create nodes and relationships, for example:

LOAD CSV WITH HEADERS FROM 'https://r.neo4j.com/flights_2019_1k' AS row
MERGE (origin:Airport {code: row.Origin})
MERGE (destination:Airport {code: row.Dest})
MERGE (origin)-[connection:CONNECTED_TO {
  airline: row.UniqueCarrier,
  flightNumber: row.FlightNum,
  date: toInteger(row.Year) + '-' + toInteger(row.Month) + '-' + toInteger(row.DayofMonth)}]->(destination)
ON CREATE SET connection.departure = toInteger(row.CRSDepTime), connection.arrival = toInteger(row.CRSArrTime)

As each row is read from the file, Airport nodes are created with code property values of row.Origin and row.Dest. From the row values, we create the connection between the two nodes based upon the row.uniqueCarrier value for setting the airline property, row.flightNumber for the FlightNum property, and row.Year + row.Month + row.DayOfMonth for the date property. We use MERGE to ensure that duplicate nodes and relationships are not created with the same property values. If the connection is being created, we provide additional properties, departure and arrival.

For large datasets, you must ensure that uniqueness constraints (indexes) are created on the Airport code property before you load the data. This will dramatically improve the performance of the load as it will use the index during the MERGE. This dataset is small so load performance is not an issue at this point.

Exercise 2: Loading airport data

Your first import of airline data will use a CSV file with 1K lines so you will use the standard LOAD CSV statement. This CSV file has already been cleaned up and is in a normalized format.

In the query edit pane of Neo4j Browser, execute the browser command:

:play 4.0-neo4j-modeling-exercises

and follow the instructions for Exercise 2.

This exercise has 9 steps. Estimated time to complete: 30 minutes.

Check your understanding

Question 1

What Cypher statement do you use to import data from a CSV file?

Select the correct answer.

LOAD DATA
IMPORT DATA
LOAD CSV
IMPORT CSV

Question 2

Up to how many lines can you import data using LOAD CSV?

Select the correct answer.

1K
10K
100K
1M

Question 3

When you import data using LOAD CSV, where can the CSV data come from?

Select the correct answers.

File that has been placed in the import folder relative to the database instance.
File that has been placed in the Neo4j Desktop project.
File at a network location accessible via http/https.
A JDBC connection that is open.

Summary

You can now:

Write Cypher code to import CSV data with Cypher.
Confirm that the data has been loaded.

Check Answers