You have learned how to set up your development environment for accessing a Neo4j graph and how to write basic Cypher statements for querying the graph modifying the graph.
At the end of this module, you should be able to:
In a deployed application, you should not hard code values in your Cypher statements. You use a variety values when you are testing your Cypher statements. But you don’t want to change the Cypher statement every time you test. In addition, you typically include Cypher statements in an application where parameters are passed in to the Cypher statement before it executes. For these scenarios, you should parameterize values in your Cypher statements.
In your Cypher statements, a parameter name begins with the
Here is an example where we have parameterized the query:
At runtime, if the parameter
$actorName has a value, it will be used in the Cypher statement when it runs in the graph engine.
In Neo4j Browser, you can set values for Cypher parameters that will be in effect during your session.
You can set the value of a single parameter in the query editor pane as shown in this example where the value Tom Hanks is set for the parameter
You can even specify a Cypher expression to the right of
Here is the result of executing the
Notice here that
:param is a client-side browser command. It takes a name and expression and stores the value of that expression for the name in the session.
After the actorName parameter is set, you can run the query that uses the parameter:
Subsequently, you need only change the value of the parameter and not the Cypher statement to test with different values.
After we have changed the actorName parameter to ‘Tom Cruise’, we get a different result with the same Cypher query:
You can also use the JSON-style syntax to set all of the parameters in your Neo4j Browser session. The values you can specify in this object are numbers, strings, and booleans. In this example we set two parameters for our session:
With the result:
If you want to remove an existing parameter from your session, you do so by using the JSON-style syntax and excluding the parameter for your session.
If you want to view the current parameters and their values, simply type
If you want to clear all parameters, you can simply type:
In the query edit pane of Neo4j Browser, execute the browser command: :play intro-neo4j-exercises and follow the instructions for Exercise 12.
The Movie graph that you have been using during training is a very small graph. As you start working with large datasets, it will be important to not only add appropriate indexes to your graph, but also write Cypher statements that execute as efficiently as possible.
There are two Cypher keywords you can prefix a Cypher statement with to analyze a query:
EXPLAINprovides estimates of the graph engine processing that will occur, but does not execute the Cypher statement.
PROFILEprovides real profiling information for what has occurred in the graph engine during the query and executes the Cypher statement.
EXPLAIN option provides the Cypher query plan. You can compare different Cypher statements to understand the stages of processing that will occur when the Cypher executes.
Here is an example where we have set the actorName and year parameters for our session and we execute this Cypher statement:
Here is the query plan returned:
You can expand each phase of the Cypher execution to examine what code is expected to run. Each phase of the query presents you with an estimate of the number of rows expected to be returned. With
EXPLAIN, the query does not run, the graph engine simply produces the query plan.
For a better metric for analyzing how the Cypher statement will run you use the
PROFILE keyword which runs the Cypher statement and gives you run-time performance metrics.
Here is the result returned using
PROFILE for this Cypher statement:
Here we see that for each phase of the graph engine processing, we can view the cache hits and most importantly the number of times the graph engine accessed the database (db hits). This is an important metric that will affect the performance of the Cypher statement at run-time.
For example, if we were to change the Cypher statement so that the node labels are not specified, we see these metrics when we profile:
Here we see more db hits which makes sense because all nodes need to be scanned for perform this query.
If you are testing an application and have run several queries against the graph, there may be times when your Neo4j Browser session hangs with what seems to be a very long-running query. There are two reasons why a Cypher query may take a long time:
MATCH (a)--(b)--(c)--(d)--(e)--(f) RETURN a
MATCH (a), (b), (c), (d), (e) RETURN count(id(a))
If the query executes and then returns a lot of data, there is no way to monitor it or kill the query. All that you can do is close your Neo4j Browser session and start a new one. If the server has many of these rogue queries running, it will slow down considerably so you should aim to limit these types of queries. If you are running Neo4j Desktop, you can simply restart the database to clear things up, but if you are using a Neo4j Sandbox, you cannot do so. The database server is always running and you cannot restart it. Your only option is to shut down the Neo4j Sandbox and create a new Neo4j Sandbox, but then you lose any data you have worked with.
If, however, the query is a long-running query, you can monitor it by using the
:queries command. Here is a screenshot where we are monitoring a long-running query in another Neo4j Browser session:
:queries command calls
dbms.listQueries under the hood, which is why we see two queries here. We have turned on AUTO-REFRESH so we can monitor the number of ms used by the graph engine thus far. You can kill the running query by double-clicking the icon in the Kill column. Alternatively, you can execute the statement
Here is what happens in the Neo4j Browser session where the long-running query was run:
In the query edit pane of Neo4j Browser, execute the browser command: :play intro-neo4j-exercises and follow the instructions for Exercise 13.
You have seen that you can accidentally create duplicate nodes in the graph if you’re not protected. In most graphs, you will want to prevent duplication of data. Unfortunately, you cannot prevent duplication by checking the existence of the exact node (with properties) as this type of test is not cluster or multi-thread safe as no locks are used. This is one reason why
MERGE is preferred over
MERGE does use locks.
In addition, you have learned that a node or relationship need not have a particular property. What if you want to ensure that all nodes or relationships of a specific type (label) must set values for certain properties?
A third scenario with graph data is where you want to ensure that a set of property values for nodes of the same type, have a unique value. This is the same thing as a primary key in a relational database.
All of these scenarios are common in many graphs. In Neo4j, you can use Cypher to:
Constraints and node keys that enforce uniqueness are related to indexes which you will learn about later in this module.
|Existence constraints and node keys are only available in Enterprise Edition of Neo4j.|
You add a uniqueness constraint to the graph by creating a constraint that asserts that a particular node property is unique in the graph for a particular type of node.
Here is an example for ensuring that the title for a node of type Movie is unique:
This Cypher statement will fail if the graph already has multiple Movie nodes with the same value for the title property. Note that you can create a uniqueness constraint, even if some Movie nodes do not have a title property.
Here is the result of running this Cypher statement on the Movie graph:
And if we attempt to create a Movie with the title, The Matrix, the Cypher statement will fail because the graph already has a movie with that title:
Here is the result of running this Cypher statement on the Movie graph:
In addition, if you attempt to modify the value of a property where the uniqueness assertion fails, the property will not be updated.
Having uniqueness for a property value is only useful in the graph if the property exists. In most cases, you will want your graph to also enforce the existence of properties, not only for those node properties that require uniqueness, but for other nodes and relationships where you require a property to be set. Uniqueness constraints can only be created for nodes, but existence constraints can be created for node or relationship properties.
You add an existence constraint to the graph by creating a constraint that asserts that a particular type of node or relationship property must exist in the graph when a node or relationship of that type is created or updated.
Recall that in the Movie graph, the movie, Something’s Gotta Give has no tagline property:
Here is an example for adding the existence constraint to the tagline property of all Movie nodes in the graph:
Here is the result of running this Cypher statement:
The constraint cannot be added to the graph because a node has been detected that violates the constraint.
We know that in the Movie graph, all :REVIEWED relationships currently have a property, rating. We can create an existence constraint on that property as follows:
Notice that when you create the constraint on a relationship, you need not specify the direction of the relationship. With the result:
So after creating this constraint, if we attempt to create a :REVIEWED relationship without setting the rating property:
We see this error:
You will also see this error if you attempt to remove a property from a node or relationship where the existence constraint has been created in the graph.
You can run the browser command
:schema to view existing indexes and constraints defined for the graph.
Just as you have used other db related methods to query the schema of the graph, you can query for the set of constraints defined in the graph as follows:
And here is what is returned from the graph:
|Using the method notation for the CALL statement enables you to use the call for returning results that may be used later in the Cypher statement.|
You use similar syntax to drop an existence or uniqueness constraint, except that you use the
DROP keyword rather than
Here we drop the existence constraint for the rating property for all REVIEWED relationships in the graph:
With the result:
A node key is used to define the uniqueness constraint for multiple properties of a node of a certain type. A node key is also used as a composite index in the graph.
Suppose that in our Movie graph, we will not allow a Person node to be created where both the name and born properties are the same. We can create a constraint that will be a node key to ensure that this uniqueness for the set of properties is asserted.
Here is an example to create this node key:
Here is the result of running this Cypher statement on our Movie graph:
This attempt to create the constraint failed because there are Person nodes in the graph that do not have the born property defined.
If we set these properties for all nodes in the graph that do not have born properties with:
Then the creation of the node key succeeds:
Any subsequent attempt to create or modify an existing Person node with name or born values that violate the uniqueness constraint as a node key will fail.
For example, executing this Cypher statement will fail:
Here is the result:
In the query edit pane of Neo4j Browser, execute the browser command: :play intro-neo4j-exercises and follow the instructions for Exercise 14.
The uniqueness and node key constraints that you add to a graph are essentially single-property and composite indexes respectively. Indexes are used to improve initial node lookup performance, but they require additional storage in the graph to maintain and also add to the cost of creating or modifying property values that are indexed. Indexes store redundant data that points to nodes with the specific property value or values. Unlike SQL, there is no such thing as a primary key in Neo4j. You can have multiple properties on nodes that must be unique.
Here is a brief summary of when single-property indexes are used:
Composite indexes are used only for equality checks and list membership.
In this module, we introduce the basics of Neo4j indexes, but you should consult the Neo4j Operations Manual for more details about creating and maintaining indexes.
Because index maintenance incurs additional overhead when nodes are created, We recommend that for large graphs, indexes are created after the data has been loaded into the graph. You can view the progress of the creation of an index when you use the
When you add an index for a property of a node, it can greatly reduce the number of nodes the graph engine needs to visit in order to satisfy a query.
In this query we are testing the value of the released property of a Movie node using ranges:
The graph engine, using an index, will find the pointers to all nodes that satisfy the query without having to visit all of the nodes:
You create an index to improve graph engine performance. A unique constraint on a property is an index so you need not create an index for any properties you have created uniqueness constraints for. An index on its own does not guarantee uniqueness.
Here is an example of how we would create a single-property index on the released property of all nodes of type Movie:
With the result:
If a set of properties for a node must be unique for every node, then you should create a constraint as a node key, rather than an index.
If, however, there can be duplication for a set of property values, but you want faster access to them, then you can create a composite index. A composite index is based upon multiple properties for a node.
Suppose we added the property, videoFormat to every Movie node and set its value, based upon the released date of the movie as follows:
With the result:
Notice that in the above Cypher statements we use the semi-colon
Now that the graph has Movie nodes with both the properties, released and videoFormat, we can create a composite index on these properties as follows:
With the result:
Just as you can retrieve the constraints defined for the graph using
CALL db.constraints(), you can retrieve the indexes:
With the result:
Notice that the unique constraints and node keys are also shown as indexes in the graph.
You can drop an existing index that you created with
Here is an example of dropping the composite index that we just created:
Here is the result:
In the query edit pane of Neo4j Browser, execute the browser command: :play intro-neo4j-exercises and follow the instructions for Exercise 15.
In this video, you will learn how developers use Neo4j for implementing all or part of their relational models.
In many applications, it is the case that the data that you want to populate your graph with comes from data that was written to .csv files or files of other types. There are many nuances and best practices for loading data into a graph from files. In this module, you will be introduced to some simple steps for loading CSV data into your graph with Cypher. If you are interested in direct loading of data from a relational DBMS into a graph, you should read about the Neo4j Extract Transform Load (ETL) tool at http://neo4j.com/developer/neo4j-etl/, as well as many of the useful pre-written procedures that are available for your use in the APOC library.
In Cypher, you can:
CSV import is commonly used to import data into a graph. If you want to import data from CSV, you will need to first develop a model that describes how data from your CSV maps to data in your graph.
Cypher provides an elegant built-in way to import tabular CSV data into graph structures.
LOAD CSV clause parses a local file in the import directory of your Neo4j installation or a remote file into a stream of rows which represent maps (with headers) or lists.
Then you can use whichever Cypher operations you want to either create nodes or relationships or to merge with the existing graph.
Here is the simplified syntax for using
The first line of the file must contain a comma-separated list of column names. The url-value can be a resource or a file on your system. Each line contains data that is interpreted as values for each column name. When each line is read from the file, you can perform the necessary processing to create or merge data into the graph.
As CSV files usually represent either node or relationship lists, you will run multiple passes to create nodes and relationships separately.
The movies_to_load.csv file (sample below) contains the data that will add Movie nodes:
Before you load data from CSV files into your graph, you should first confirm that the data retrieved looks OK. Rather than creating nodes or relationships, you can simply return information about the data to be loaded.
For example you can execute this Cypher statement to get a count of the data to be loaded from the movies_to_load.csv file so you have an idea of how much data will be loaded:
Here is the count result for this particular file:
You might even want to visually inspect the data before you load it to see if it is what you were expecting:
Here is the result of running the Cypher statement to visually inspect the data:
Notice here that the summary column’s data has an extra space before the data in the file. In order to ensure that all tagline values in our graph do not have an extra space, we will trim the value before assigning it to the tagline property. Once we are sure you want to load the data into your graph, we do so by assigning values from each row read in to a new node.
You may want to format the data before it is loaded to confirm it matches what you want in your graph:
Here we see how the data will be formatted before it is loaded:
The following query creates the Movie nodes using some of the data from movies_to_load.csv as properties:
We assign a value to movieId from the id data in the CSV file. In addition, we assign the data from summary to the tagline property, with a trim. We also convert the data read from year to an integer using the built-in function
toInteger() before assigning it to the released property.
Here is the result of loading the movies_to_load. csv data into the graph:
The persons_to_load.csv file (sample below) holds the data that will populate the Person nodes.
In case you already have people in your database, you will want to avoid creating duplicates.
That’s why instead of just creating them, we use
MERGE to ensure unique entries after the import.
We use the
ON CREATE clause to set the values for name and born.
There are a couple of things to note here. The name of the column is case-sensitive. In addition, notice that the data for the birthyear column as an extra space before the data. To allow this data to be converted to an integer, we must first trim the whitespace using the
trim() built-in function.
Here is the result of loading the persons_to_load. csv data into the graph:
The roles_to_load.csv file (sample below) holds the data that will populate the relationships between the nodes.
The query below matches the entries of line.personId and line.movieId to their respective Movie and Person nodes, and creates an ACTED_IN relationship between the person and the movie. This model includes a relationship property of role, which is passed via line.role.
Here is the result of loading the roles_to_load. csv data into the graph:
If your file contains denormalized data, you can run the same file with multiple passes and simple operations as shown above. Alternatively, you might have to use
MERGE to create nodes and relationships uniquely.
For our use case, we can import the data using a CSV structure like this:
Here are the Cypher statements to load this data:
Notice a couple of things in this Cypher statement. This file uses a semi-colon as a field terminator, rather than the default comma. In addition, the built-in method
split() is used to create the list for the roles property.
Here is the result of loading the movie_actor_roles_to_load. csv data into the graph:
For large denormalized files, it may still make sense to create nodes and relationships separately in multiple passes. That would depend on the complexity of the operations and the experienced performance.
If you import a larger amount of data (more than 10,000 rows), it is recommended to prefix your
LOAD CSV clause with a
PERIODIC COMMIT hint.
This allows the database to regularly commit the import transactions to avoid memory churn for large transaction-states.
In the query edit pane of Neo4j Browser, execute the browser command: :play intro-neo4j-exercises and follow the instructions for Exercise 16.
There are many ways that you can learn more about Neo4j. A good starting point for learning about the resources available to you is the Neo4j Learning Resources page at https://neo4j.com/developer/resources/.
What Cypher keyword can you use to prefix any Cypher statement to examine how many db hits occurred when the statement executed?
Select the correct answer.
What types of constraints can you define for a graph that are asserted when a node or relationship is created or updated?
Select the correct answers.
In general, what is the maximum number of nodes or relationships that you can easily create using LOAD CSV?
Select the correct answer.
You should now be able to: