Loading CSV files

Follow along with a notebook in Colab Google Colab

A CSV file can be loaded into an AuraDS instance using the LOAD CSV Cypher clause. For security reasons it is not possible to load local CSV files, which must be instead publicly accessible on HTTP or HTTPS servers such as GitHub, Google Drive, and Dropbox. Another way to make CSV files available is to upload them to a cloud bucket storage (such as Google Cloud Storage or Amazon S3) and configure the bucket as a static website.

In this example we will load three CSV files:

  • movies.csv: a list of movies with their title, release year and a short description

  • people.csv: a list of actors with their year of birth

  • actors.csv: a list of acting roles, where actors are matched with the movies they had a role in

Warning: The LOAD CSV command is built to handle small to medium-sized data sets, such as anything up to 10 million nodes and relationships. You should avoid using this command for any data sets exceeding this limit.

Setup

For more information on how to get started using Python, refer to the Connecting with Python tutorial.

pip install graphdatascience
# Import the client
from graphdatascience import GraphDataScience

# Replace with the actual URI, username, and password
AURA_CONNECTION_URI = "neo4j+s://xxxxxxxx.databases.neo4j.io"
AURA_USERNAME = "neo4j"
AURA_PASSWORD = ""

# Configure the client with AuraDS-recommended settings
gds = GraphDataScience(
    AURA_CONNECTION_URI,
    auth=(AURA_USERNAME, AURA_PASSWORD),
    aura_ds=True
)

In the following code examples we use the print function to print Pandas DataFrame and Series objects. You can try different ways to print a Pandas object, for instance via the to_string and to_json methods; if you use a JSON representation, in some cases you may need to include a default handler to handle Neo4j DateTime objects. Check the Python connection section for some examples.

For more information on how to get started using the Cypher Shell, refer to the Neo4j Cypher Shell tutorial.

Run the following commands from the directory where the Cypher shell is installed.
export AURA_CONNECTION_URI="neo4j+s://xxxxxxxx.databases.neo4j.io"
export AURA_USERNAME="neo4j"
export AURA_PASSWORD=""

./cypher-shell -a $AURA_CONNECTION_URI -u $AURA_USERNAME -p $AURA_PASSWORD

For more information on how to get started using Python, refer to the Connecting with Python tutorial.

pip install neo4j
# Import the driver
from neo4j import GraphDatabase

# Replace with the actual URI, username, and password
AURA_CONNECTION_URI = "neo4j+s://xxxxxxxx.databases.neo4j.io"
AURA_USERNAME = "neo4j"
AURA_PASSWORD = ""

# Instantiate the driver
driver = GraphDatabase.driver(
    AURA_CONNECTION_URI,
    auth=(AURA_USERNAME, AURA_PASSWORD)
)
# Import to prettify results
import json

# Import for the JSON helper function
from neo4j.time import DateTime

# Helper function for serializing Neo4j DateTime in JSON dumps
def default(o):
    if isinstance(o, (DateTime)):
        return o.isoformat()

Create constraints

Adding constraints before loading any data usally improves data loading performance. In fact, besides adding an integrity check, a unique constraint adds an index on a property at the same time, so that MATCH and MERGE operations during loading are faster.

Warning: For best performance when using MERGE or MATCH with LOAD CSV, make sure an index or a unique constraint has been created on the property used for merging. For more information, follow the Import data tutorial in the Cypher documentation.

In this example we add uniqueness constraints on both movie titles and actors' names.

# Make movie titles unique
gds.run_cypher("""
    CREATE CONSTRAINT FOR (movie:Movie) REQUIRE movie.title IS UNIQUE
""")

# Make person names unique
gds.run_cypher("""
    CREATE CONSTRAINT FOR (person:Person) REQUIRE person.name IS UNIQUE
""")
CREATE CONSTRAINT FOR (movie:Movie) REQUIRE movie.title IS UNIQUE;
CREATE CONSTRAINT FOR (person:Person) REQUIRE person.name IS UNIQUE;
movie_title_constraint = """
    CREATE CONSTRAINT FOR (movie:Movie) REQUIRE movie.title IS UNIQUE
"""

person_name_constraint = """
    CREATE CONSTRAINT FOR (person:Person) REQUIRE person.name IS UNIQUE
"""

# Create the driver session
with driver.session() as session:
    # Make movie titles unique
    session.run(movie_title_constraint).data()
    # Make person names unique
    session.run(person_name_constraint).data()

Add nodes from CSV files

We are now ready to load the CSV files from their URIs and create nodes from the data they contain. In the following examples, LOAD CSV is used with WITH HEADERS to access row fields by their corresponding column name. Furthermore:

  • MERGE is used with the indexed properties to take advantage of the constraints created in the Create constraints section.

  • ON CREATE SET is used to set the value of a node property when a new one is created.

  • RETURN count(*) is used to show the number of processed rows.

Note that the CSV files in this example are curated, so some assumptions are made for simplicity. In a real-world scenario, for example, a CSV file could contain multiple rows that would attempt to assign different property values to the same node; in this case, an ON MATCH SET clause must be added to ensure this case is dealt with appropriately.

gds.run_cypher("""
    LOAD CSV
      WITH HEADERS
      FROM 'https://data.neo4j.com/intro/movies/movies.csv' AS row
    MERGE (m:Movie {title: row.title})
      ON CREATE SET m.released = toInteger(row.released), m.tagline = row.tagline
    RETURN count(*)
""")
gds.run_cypher("""
    LOAD CSV
      WITH HEADERS
      FROM 'https://data.neo4j.com/intro/movies/people.csv' AS row
    MERGE (p:Person {name: row.name})
      ON CREATE SET p.born = toInteger(row.born)
    RETURN count(*)
""")
LOAD CSV
  WITH HEADERS
  FROM 'https://data.neo4j.com/intro/movies/movies.csv' AS row
MERGE (m:Movie {title: row.title})
  ON CREATE SET m.released = toInteger(row.released), m.tagline = row.tagline
RETURN count(*);
LOAD CSV
  WITH HEADERS
  FROM 'https://data.neo4j.com/intro/movies/people.csv' AS row
MERGE (p:Person {name: row.name})
  ON CREATE SET p.born = toInteger(row.born)
RETURN count(*);
load_movies_csv = """
    LOAD CSV
      WITH HEADERS
      FROM 'https://data.neo4j.com/intro/movies/movies.csv' AS row
    MERGE (m:Movie {title: row.title})
      ON CREATE SET m.released = toInteger(row.released), m.tagline = row.tagline
    RETURN count(*)
"""

# Create the driver session
with driver.session() as session:
    # Load the CSV file
    session.run(load_movies_csv).data()
load_people_csv = """
    LOAD CSV
      WITH HEADERS
      FROM 'https://data.neo4j.com/intro/movies/people.csv' AS row
    MERGE (p:Person {name: row.name})
      ON CREATE SET p.born = toInteger(row.born)
    RETURN count(*)
"""

# Create the driver session
with driver.session() as session:
    # Load the CSV file
    session.run(load_people_csv).data()

Add relationships from CSV files

Similarly to what we have done for nodes, we now create relationships from the actors.csv file. In the following examples, LOAD CSV is used with:

  • WITH HEADERS to access row fields by their corresponding column name;

  • USING PERIODIC COMMIT is used to control memory usage by creating a commit every n processed rows rather than after processing the whole file. This is especially useful when loading large files;

  • FIELDTERMINATOR is used to explicitly set the field terminator character (although the comma is the default).

Furthermore:

  • MATCH and MERGE are used to find nodes (taking advantage of the constraints created in the Create constraints section) and create a relationship between them.

  • ON CREATE SET is used to set the value of a relationship property when a new one is created.

  • RETURN count(*) is used to show the number of processed rows.

gds.run_cypher("""
    USING PERIODIC COMMIT 50
    LOAD CSV
      WITH HEADERS
      FROM 'https://data.neo4j.com/intro/movies/actors.csv' AS row
      FIELDTERMINATOR ','
    MATCH (p:Person {name: row.person})
    MATCH (m:Movie {title: row.movie})
    MERGE (p)-[actedIn:ACTED_IN]->(m)
      ON CREATE SET actedIn.roles = split(row.roles, ';')
    RETURN count(*)
""")
USING PERIODIC COMMIT 50
LOAD CSV
  WITH HEADERS
  FROM 'https://data.neo4j.com/intro/movies/actors.csv' AS row
  FIELDTERMINATOR ','
MATCH (p:Person {name: row.person})
MATCH (m:Movie {title: row.movie})
MERGE (p)-[actedIn:ACTED_IN]->(m)
  ON CREATE SET actedIn.roles = split(row.roles, ';')
RETURN count(*)
load_actors_csv = """
    USING PERIODIC COMMIT 50
    LOAD CSV
      WITH HEADERS
      FROM 'https://data.neo4j.com/intro/movies/actors.csv' AS row
      FIELDTERMINATOR ','
    MATCH (p:Person {name: row.person})
    MATCH (m:Movie {title: row.movie})
    MERGE (p)-[actedIn:ACTED_IN]->(m)
      ON CREATE SET actedIn.roles = split(row.roles, ';')
    RETURN count(*)
"""

# Create the driver session
with driver.session() as session:
    # Load the CSV file
    session.run(load_actors_csv).data()

Run a Cypher query

Once all the nodes and relationships have been created, we can run a query to check that the data have been inserted correctly. The following query looks for movies with Keanu Reeves, orders them by release date and groups their titles.

gds.run_cypher("""
    MATCH (person:Person {name: "Keanu Reeves"})-[:ACTED_IN]->(movie)
    RETURN movie.released, COLLECT(movie.title) AS movies
    ORDER BY movie.released
""")
MATCH (person:Person {name: "Keanu Reeves"})-[:ACTED_IN]->(movie)
RETURN movie.released, COLLECT(movie.title) AS movies
ORDER BY movie.released
query = """
    MATCH (person:Person {name: "Keanu Reeves"})-[:ACTED_IN]->(movie)
    RETURN movie.released, COLLECT(movie.title) AS movies
    ORDER BY movie.released
"""

# Create the driver session
with driver.session() as session:
    # Run the Cypher query
    session.run(query).data()

Cleanup

When the data are no longer useful, the database can be cleaned up.

# Delete data
gds.run_cypher("""
    MATCH (n)
    DETACH DELETE n
""")
MATCH (n)
DETACH DELETE n
delete_data = """
    MATCH (n)
    DETACH DELETE n
"""

# Create the driver session
with driver.session() as session:
    # Delete the data
    session.run(delete_data).data()