Common Confusions on Cypher (and How to Avoid Them)


Learn How to Avoid the Most Common Confusions around the Cypher Query LanguageI spend a lot of my free time answering Stack Overflow questions about Cypher, Neo4j’s graph query language.

Cypher is an intuitive, ASCII-art-driven language that allows you to query your property graph by specifying patterns of nodes and relationships. While Cypher is the reason many developers choose Neo4j, it is not immune to common gotchas.

In this post I want to go over a few recurring questions and mistakes I see on Stack Overflow or at Neo4j trainings. All of my examples will use the movie graph that ships with the Neo4j Browser, so you can follow along by executing :play movies in the browser query bar.

LIMIT x vs. collect()[..x]

In many cases it’s clear when to use LIMIT and when to use collect().

LIMIT

An obvious use of LIMIT might be: “Find the top five actors ordered by how many movies they’ve acted in.”

MATCH (actor:Person)
RETURN actor.name, size((actor)-[:ACTED_IN]->(:Movie)) AS movies
ORDER BY movies DESC
LIMIT 5;

actor.name     | movies
---------------+-------
Tom Hanks      |     12
Keanu Reeves   |      7
Hugo Weaving   |      5
Jack Nicholson |      5
Meg Ryan       |      5

collect()

An obvious use of collect() might be: Find the Matrix movies and their directors as a collection.

MATCH (director:Person)-[:DIRECTED]->(movie:Movie)
WHERE movie.title STARTS WITH "The Matrix"
RETURN movie.title, collect(director.name) AS directors;

movie.title            | directors                           
-----------------------+-------------------------------------
The Matrix Revolutions | ['Andy Wachowski', 'Lana Wachowski']
The Matrix             | ['Lana Wachowski', 'Andy Wachowski']
The Matrix Reloaded    | ['Lana Wachowski', 'Andy Wachowski']

LIMIT and collect()

But once you need the top x records of an entity grouped by some aggregate it gets a little trickier. Below are some examples of people struggling with this:

Let’s say we want to find the two oldest people in the graph and the three most recent movies they’ve acted in. This requires a combination of LIMIT and collect() along with some ORDER BYs.

// Get the two oldest people.
MATCH (actor:Person)
WITH actor
ORDER BY actor.born
LIMIT 2

// Get their three most recent movies.
MATCH (actor)-[:ACTED_IN]->(movie:Movie)
WITH actor, movie
ORDER BY movie.released DESC
RETURN actor.name, 
       2016 - actor.born AS age,
       collect(movie.title)[..3] AS movies;

actor.name    | age | movies                                            
--------------+-----+---------------------------------------------------
Max von Sydow |  87 | ['Snow Falling on Cedars', 'What Dreams May Come']
Gene Hackman  |  86 | ['The Replacements', 'The Birdcage', 'Unforgiven']

The general pattern is to use ORDER BY to order your data as needed and then use collect()[..x] for collecting the top x records for each row.

I know what you’re thinking: What if I want to return a Cartesian product of these actors and movies? That’s where UNWIND comes in and it’s also useful if you want to continue with the query and write more MATCH clauses with the actors and movies you’ve already matched on.

// Get the two oldest people.
MATCH (actor:Person)
WITH actor
ORDER BY actor.born
LIMIT 2

// Get their three most recent movies.
MATCH (actor)-[:ACTED_IN]->(movie:Movie)
WITH actor, movie
ORDER BY movie.released DESC
WITH actor, collect(movie)[..3] AS m

// Unwind the collection into rows.
UNWIND m AS movie
RETURN actor.name, 
       2016 - actor.born AS age,
       movie.title;

actor.name    | age | movie.title           
--------------+-----+-----------------------
Gene Hackman  |  86 | The Replacements      
Gene Hackman  |  86 | The Birdcage          
Gene Hackman  |  86 | Unforgiven            
Max von Sydow |  87 | Snow Falling on Cedars
Max von Sydow |  87 | What Dreams May Come 

MERGE

This is easily the most misunderstood Cypher keyword. It works as documented yet the actual behavior of MERGE often differs from users’ expectations. A few examples of users struggling with this are below.

The following is an exercise I do at every Fundamentals of Neo4j training to clear up the confusion. Assume we have a uniqueness constraint on :Person nodes by the name property.

CREATE CONSTRAINT ON (p:Person) ASSERT p.name IS UNIQUE;

We know there’s already a :Person node in the graph with the property name:"Tom Hanks". Let’s say we want to find or create this node if it doesn’t exist while also adding a new property, oscar_winner. At this point I show the following query and ask what will happen:

MERGE (p:Person {name:"Tom Hanks", oscar_winner: true})
RETURN p.name, p.oscar_winner;

Almost everyone says: It will find the :Person node where name:"Tom Hanks" and then add the oscar_winner:true property. Wrong!

Node 23478 already exists with label Person and property "name"=[Tom Hanks]

MERGE matches on the entire pattern you specify within a single clause. When Neo4j determines that a node with the :Person label and properties name:"Tom Hanks" and oscar_winner:true doesn’t exist, Neo4j attempts to create the node specified. This is when the uniqueness constraint is violated, because a :Person node with property name:"Tom Hanks" already exists.

The solution is to MERGE on the unique property and then use SET to update additional properties.

MERGE (p:Person {name:"Tom Hanks"})
SET p.oscar_winner = true
RETURN p.name, p.oscar_winner;

p.name    | p.oscar_winner
----------+---------------
Tom Hanks | True  

There are also ON MATCH SET and ON CREATE SET for additional control over what happens depending on if the node was found or created, respectively. Read more about those here.

This concept also applies to relationships in MERGE clauses. What do you think will happen with the following? Note that a :Person node with name:"Tom Hanks" already exists whereas a :Person node with name:"Nicole White" does not.

MERGE (tom:Person {name:"Tom Hanks"})-[:KNOWS]-(nicole:Person {name:"Nicole White"})
RETURN tom.name, nicole.name;

At this point everyone’s learned from their previous mistake and correctly guesses that this will throw an error for similar reasons.

Node 23478 already exists with label Person and property "name"=[Tom Hanks]

Once Neo4j has determined that the entire pattern specified in the MERGE clause doesn’t exist, it tries to create everything in the pattern, including:

    1. a :Person node with name:"Tom Hanks"
    2. a :Person node with name:"Nicole White"
    3. a :KNOWS relationship between them

Numbers 2 and 3 would be okay, but the query throws an error because number 1 violates the uniqueness constraint. The solution is to follow the best practice of using MERGE on the individual parts of a pattern that may or may not exist:

MERGE (tom:Person {name:"Tom Hanks"})
MERGE (nicole:Person {name:"Nicole White"})
MERGE (tom)-[:KNOWS]-(nicole)
RETURN tom.name, nicole.name;

tom.name  | nicole.name 
----------+-------------
Tom Hanks | Nicole White

WITH

The WITH clause trips people up because it both changes the scope of variables and, if any aggregates are present, automatically groups by the other variables in the clause. Below are misunderstandings of each of these concepts respectively.

Unbound Variables

Leaving a variable out of the WITH clause causes it to become unbound. It’s easy to catch this error if you attempt to use such a variable in a subsequent WITH or WHERE clause.

For example, let’s say we want to find movies where all of the movie’s directors also wrote the movie. The following attempt will expose an obvious unbound variable problem:

MATCH (p:Person)-[:DIRECTED]->(m:Movie)
WITH m, collect(p) AS directors
WHERE (p)-[:WROTE]->(m)
RETURN m.title, [x IN directors | x.name];

p not defined (line 3, column 8 (offset: 79))
"WHERE (p)-[:WROTE]->(m)"
        ^

But if you try to use that variable in a subsequent MATCH clause, no error will be thrown because the MATCH clause thinks you’re binding to a new variable.

MATCH (p:Person)-[:DIRECTED]->(m:Movie)
WITH m, collect(p) AS directors
MATCH (p)-[:WROTE]->(m)
RETURN m.title, [x IN directors | x.name];

m.title                | [x IN directors | x.name]           
-----------------------+-------------------------------------
A Few Good Men         | ['Rob Reiner']                      
Something's Gotta Give | ['Nancy Meyers']                    
Speed Racer            | ['Andy Wachowski', 'Lana Wachowski']
Speed Racer            | ['Andy Wachowski', 'Lana Wachowski']
Jerry Maguire          | ['Cameron Crowe']                   
Top Gun                | ['Tony Scott']                      
V for Vendetta         | ['James Marshall']
V for Vendetta         | ['James Marshall']                  
When Harry Met Sally   | ['Rob Reiner'] 

Some might look at the above query and say that it’s matching on movies where all its directors are also among its writers, but that is not correct. This query is actually showing you the directors of all movies that have a writer. The variable p was unbound at line 2 of this query and the MATCH clause at line 3 is binding nodes of any label that have an outgoing :WROTE relationship to a new variable p. The accurate version of this query would be:

MATCH (p:Person)-[:DIRECTED]->(m:Movie)
WHERE (p)-[:WROTE]->(m)
WITH m, collect(p) AS directors
RETURN m.title, [x IN directors | x.name];

m.title                | [x IN directors | x.name]        
-----------------------+-------------------------------------
Something's Gotta Give | ['Nancy Meyers']              
Speed Racer            | ['Andy Wachowski', 'Lana Wachowski']
Jerry Maguire          | ['Cameron Crowe'] 

Automatic Grouping

In addition to carrying over variables, the WITH clause also automatically groups by variables if any aggregates are present. A common use case for WITH is to determine an aggregate that you want to filter on. For example, we can find all movies where the average age of its actors is greater than 70.

MATCH (p:Person)-[:ACTED_IN]->(m:Movie)
WITH m, avg(2016 - p.born) AS avg_age
WHERE avg_age > 70
RETURN m.title, avg_age
ORDER BY avg_age DESC;

m.title                         |  avg_age
--------------------------------+---------
Unforgiven                      |    86.00
One Flew Over the Cuckoo's Nest |    76.50
The Birdcage                    |    70.33

At the second line, WITH m, avg(...), we included the variable m which is bound to movies and an aggregate function avg() such that the WITH clause automatically grouped the data by m. If at this point you asked someone to also include the names of the actors in these movies, they might make the following mistake:

MATCH (p:Person)-[:ACTED_IN]->(m:Movie)
WITH m, p, avg(2016 - p.born) AS avg_age
WHERE avg_age > 70
RETURN m.title, avg_age, collect(p.name) AS actors
ORDER BY avg_age DESC;

m.title                         | avg_age | actors                                              
--------------------------------+---------+-----------------------------------------------------
Snow Falling on Cedars          |    87.0 | ['Max von Sydow']                                   
What Dreams May Come            |    87.0 | ['Max von Sydow']                                   
The Birdcage                    |    86.0 | ['Gene Hackman']                                    
The Replacements                |    86.0 | ['Gene Hackman']                                    
Unforgiven                      |    86.0 | ['Gene Hackman', 'Richard Harris', 'Clint Eastwood']
Top Gun                         |    83.0 | ['Tom Skerritt']                                    
Hoffa                           |    79.0 | ['Jack Nicholson']                                  
A Few Good Men                  |    79.0 | ['Jack Nicholson']                                  
As Good as It Gets              |    79.0 | ['Jack Nicholson']                                  
Something's Gotta Give          |    79.0 | ['Jack Nicholson']                                  
One Flew Over the Cuckoo's Nest |    79.0 | ['Jack Nicholson']                                  
Frost/Nixon                     |    78.0 | ['Frank Langella']                                  
The Da Vinci Code               |    77.0 | ['Ian McKellen']                                    
V for Vendetta                  |    76.0 | ['John Hurt']                                       
The Green Mile                  |    76.0 | ['James Cromwell']                                  
The Devil's Advocate            |    76.0 | ['Al Pacino']                                       
Snow Falling on Cedars          |    76.0 | ['James Cromwell']                                  
RescueDawn                      |    74.0 | ['Marshall Bell']                                   
Stand By Me                     |    74.0 | ['Marshall Bell']                                   
What Dreams May Come            |    74.0 | ['Werner Herzog']                                   
Hoffa                           |    73.0 | ['J.T. Walsh']                                      
A Few Good Men                  |    73.0 | ['J.T. Walsh']                                      
One Flew Over the Cuckoo's Nest |    72.0 | ['Danny DeVito']                                    
Hoffa                           |    72.0 | ['Danny DeVito'] 

By including p in the WITH clause, the data has also been grouped by p and the avg_age is grouped by both movie and person, leading to nonsensical results: we’re calculating the average age of one person. The correct way to do this is to collect the actors by movie in the WITH clause:

MATCH (p:Person)-[:ACTED_IN]->(m:Movie)
WITH m, avg(2016 - p.born) AS avg_age, collect(p.name) AS actors
WHERE avg_age > 70
RETURN m.title, avg_age, actors
ORDER BY avg_age DESC;

m.title                         |  avg_age | actors                                              
--------------------------------+----------+-----------------------------------------------------
Unforgiven                      |    86.00 | ['Gene Hackman', 'Clint Eastwood', 'Richard Harris']
One Flew Over the Cuckoo's Nest |    76.50 | ['Jack Nicholson', 'Danny DeVito']                  
The Birdcage                    |    70.33 | ['Gene Hackman', 'Nathan Lane', 'Robin Williams']   

Next Steps

The best way to improve your Cypher is, of course, to write more Cypher!

It’s also very helpful to read other peoples’ queries to pick up new tips and tricks; I probably learn a new Cypher trick once a week by reading Michael Hunger‘s Cypher queries. You can get more practice by signing up for the online training, reading more on our Cypher developer page, and by keeping the Cypher refcard handy at all times.

Want to sharpen your Cypher even more? Click below to get your free copy of the Learning Neo4j ebook and level up your graph database skills in no time.