Cardinality issues are the most frequent culprit in slow or incorrect Cypher queries. Because of this, understanding cardinality, and using this understanding to manage cardinality issues, is a critical component in Cypher query tuning, and query correctness in general.
We will use the built in Movies graph for examples (use
:play movies in the Neo4j browser to create the dataset).
Note that the query planner can optimize some operations. It may change the order of some operations, changing the order of expansion, or which nodes we use as the starting nodes, or more. It’s still best to be mindful of cardinality when tuning your queries, even if the plan isn’t as expected, or circumvents the issue for you in some cases.
Cypher operations execute per row, and generate rows of results
To understand cardinality in Cypher, two important aspects of Cypher execution need to be understood first:
- Cypher operations execute per record/row* of the input stream to the operation
- Cypher operations generate streams of result records/rows*
* While “record” is more technically correct (Neo4j doesn’t use tables, so it doesn’t really have rows), “row” is often more familiar, and is the term used in query plan output. We’ll be using “row” from here on.
These are two aspects of the same principle: streams of rows are both the input and output for cypher operations.
You can see how streams of rows flow between Cypher operations in a PROFILE query plan, increasing from match expansions and unwinds, and diminishing through filtering, aggregations, and limits.
The more rows in the stream, the more work is performed by the next operation, contributing to db hits and query execution time.
As an example, take this simple query on the movies graph to find all actors in The Matrix:
If we decided to return the data at this point (
RETURN m.title as title, actor.name as name) we would get the following results (in no particular order):
These are 5 rows in the result stream.
If instead of returning we decide to do something more from those match results, these rows will be the input for the next operation in the query.
That operation will execute for each these rows.
What is cardinality in Cypher and why does it matter?
Cardinality in general refers to how many times the same entry occurs for a variable in the stream of rows.
A cardinality of 1 means that each distinct entry for a variable only occurs in 1 row in the stream.
Anything else means that distinct value occurs in multiple rows of the stream.
Remember, operations execute per row.
Managing cardinality is all about making sure that when you perform operations on values, that you shrink the cardinality of that value first.
Why does that matter?
- Because we want the query to be quick; we want to do the least amount of work needed, as opposed to redundantly performing the same operation for the same value multiple times.
- Because we want the query to be correct; we don’t want to see unwanted duplicate result rows, or end up creating duplicate graph elements.
Cardinality issues can lead to redundant and wasteful operations
Note that in the results of our Matrix query above, cardinality is different between each variable.
In the query results above, The Matrix movie has a cardinality of 5: the same node is present on all rows. The actors however have a cardinality of 1: each distinct actor only occurs in a single row of the stream.
If we performed a match from
actor (the variable for actors in the match), that match would only be performed once per distinct actor.
However if we performed a match from
m (the variable for movies in the match), that match would be performed 5 times redundantly for the same Matrix node.
Cardinality issues can result in duplicated data
Remember that operations execute per row. This includes CREATE and MERGE operations.
Consider if we wanted to create a new relationship :WORKED_ON between actors and directors to movies they worked on.
Just looking at the Matrix movie, a wrong approach may look like this:
If we viewed the result, we would see two :WORKED_ON relationships between each actor and The Matrix, and 5 :WORKED_ON relationships between each director and The Matrix.
Why? Because the first two matches above resulted in a cross product of the 5 actors and the 2 directors, 10 rows total.
Each distinct director would appear on 5 of those rows (once for each actor) and each distinct actor would appear on 2 of those rows (once for each director). The CREATE operations would execute on each of those 10 rows, leading to the redundant relationships.
While we could solve this by using MERGE instead of CREATE, which would only create the expected number of relationships, we’re still performing redundant operations in the process.
How do we manage cardinality?
We manage cardinality mostly through aggregations and reordering operations in the query, and sometimes through the use of LIMIT (when it makes sense to do so).
The important part about aggregations is that the combination of non-aggregation variables become distinct. We shrink their cardinality to 1.
Let’s take the above query and use an aggregation to reduce the cardinality
In the second line, we perform a collect() aggregation. The only non-aggregation variable,
m, becomes the distinct grouping key. Its cardinality drops to 1.
Because of this, the subsequent expand operation from the next MATCH will only execute once for The Matrix node, instead of 5 times as before.
But what if we want to perform additional matches from
In that case we can UNWIND our collection after our match:
Pattern comprehension can help
Pattern comprehension is a way to populate a list with the results of an expansion. If your desired result include collections of connected nodes, it’s a good way to keep cardinality low and make the query a little less verbose.
Reorder the query to aggregate earlier
Newcomers to Cypher (especially those from SQL backgrounds) often try to perform many operations (limit, aggregation, etc) in the RETURN statement.
In Cypher, we encourage performing these operations early, where it makes sense to do so, as it can keep cardinality low and prevent wasteful operations.
Here’s an example of performing aggregations late, though we get the correct answer through usage of
In Neo4j 3.3.5, the PROFILE for this has 621 db hits.
We do get the right answer in the end, but the more back-to-back matches or optional matches we perform, the more cardinality issues have a chance to snowball multiplicatively.
If we reorder the query to COLLECT() after each OPTIONAL MATCH instead, or use pattern comprehension, we cut down on unnecessary work, as our expand operations occur on each movie keeping cardinality at 1.
In Neo4j 3.3.5, the PROFILE for this has 331 db hits.
Of course, on queries on small graphs like this with small result sets, and few operations, the difference is negligible when we look at timing differences.
However as graph data grows, as well as the complexity of graph queries and results, keeping cardinality low and avoiding multiplicative db hits becomes the difference between a quick streamlined query and one that may exceed acceptable execution time.
Use LIMIT or DISTINCT or an aggregation to reset cardinality
Sometimes in the course of a query we want to expand from some node, perform operations on the nodes we expand to, then expand on a different set of nodes from the original. If we’re not careful, we can encounter cardinality issues.
Consider the attempt to create :WORKED_ON relationships from earlier:
This query resulted in duplicated relationships, even if we used MERGE we would still be doing more work than needed.
One solution here is to do all the processing on one set of nodes first, then do the processing on the next set. The first step toward such a solution might look like this:
Even though we get 1 :WORKED_ON relationship for each actor, we are still seeing 5 :WORKED_ON relationship per director.
Why? Because cardinality does not reset automatically. Even though we have
WITH m in the middle, we still have 5 rows, one per actor, with The Matrix as
m for each of them.
To fix this, we need to either use LIMIT, DISTINCT, or an aggregation to reset the cardinality of
m to 1.
This makes sense when we’re only working with a single movie. But if we were attempting this query for many movies instead of just The Matrix, we wouldn’t be able to limit here, since LIMIT limits all rows, not just per distinct value.
This will reset the cardinality back to 1 for each distinct movie.
The following query will also work just fine, since when we aggregate the non-aggregation variables become distinct:
Use LIMIT early if possible
While not directly related to cardinality, if you’re using LIMIT in your query it’s advantageous to LIMIT early, if possible, instead of at the end.
Consider the differences here:
In Neo4j 3.3.5 the PROFILE for this has 331 db hits.
In Neo4j 3.3.5 the PROFILE for this has 11 db hits.
We avoid doing work that will only be thrown out when we finally perform the LIMIT.