Reducing cardinality

About this module

As you examine execution plans for your queries, you will see that rows are processed and passed through to the next step of the execution plan. Even though much of the data resides in the Page Cache, you want to reduce the number of rows that the query needs to process. In this module, you will learn how to modify queries to reduce the amount of duplicate information that may be passed from the steps of the execution plan.

At the end of this module, you will be able to:

  • Describe cardinality in an execution plan.

  • Tune and rewrite queries to:

    • Aggregate early.

    • Use pattern comprehension.

    • Anchor node through selectivity by label.

    • Use indexes.

    • Use full-text schema indexes.

    • Use query hints for index usage.

    • Use LIMIT early in the query.

    • Use DISTINCT early in the query.

    • Use UNWIND to your advantage.

    • Avoid cartesian products.

Because some of the code examples in this lesson modify the database, it is recommended that you do not execute them against your database as you will be doing so in the hands-on exercises.

Understanding row cardinality

Here is a query that we will be looking to improve. The value of $movieTitle is "The Matrix":

PROFILE
MATCH (m:Movie {title: $movieTitle})
MATCH (dir:Person)-[:DIRECTED]->(m)
MATCH (actor:Person)-[:ACTED_IN]->(m)
RETURN m.title AS Title,
collect(DISTINCT actor.name) AS Actors,
collect(DISTINCT dir.name) AS Directors

Here is the resulting execution plan:

theMatrixStarter

Query execution is not what you might expect

Here is how you might expect the query to execute:

  1. We find all Movies with the title "The Matrix", where one row is returned.

  2. We then find all directors associated with that one movie. There are two rows returned.

  3. We also find all actors associated with that one movie. There are 13 rows returned.

  4. Then we return the title of the movie, the list of 2 unique directors, and the list of 13 unique actors.

theMatrixNot

Actual query explained

But this is NOT how the steps in the execution plan work. This is what really happens in the query.

  1. We find all Movies with the title "The Matrix", where one row is returned.

  2. We then find all directors associated with that one movie. There are two rows returned.

  3. We then find all actors associated with that one movie and a director. There are 26 rows returned, for each director/actor combination.

  4. Then we return the title of the movie, the list of 2 unique directors, and the list of 13 unique actors.

theMatrixActual

In reality, we are processing twice as many rows as we need to when matching the actors. There is duplicate work that we need to eliminate. We do see the correct result because we are specifying DISTINCT for the names of the actors and directors.

Making the query worse

Our query could be even worse if we matched the actors first and then the directors:

theMatrixWorse

Even though this query returns the same number of rows, the match of the directors is performed 13 times. This is more work than we need to do.

Combining paths does not help

Here is a revised query:

PROFILE
MATCH (dir:Person)-[:DIRECTED]->(m:Movie {title: $movieTitle})<-[:ACTED_IN]-(actor:Person)
RETURN m.title AS Title,
collect(DISTINCT actor.name) AS Actors,
collect(DISTINCT dir.name) AS Directors

And here we see the same execution plan:

CombineMatchAttempt

What increases cardinality?

Here are some things to keep in mind that typically increase the cardinality of your queries:

  • Multiple MATCH and OPTIONAL MATCH statements that are back-to back (even with a WITH) in between, especially when:

    • The nodes have a degree > 1 and you need to expand.

    • Index selectivity is > 1

  • Overuse of UNWIND operations because each element of the list becomes a row

  • Procedure results (when they YIELD something)

  • Lack of selectivity for the anchor nodes

How to reduce cardinality?

Here are some tips:

  • Aggregate earlier where the grouping key will become distinct.

  • Use pattern comprehension.

  • Use subqueries (new in Neo4j 4.1)

  • Use labels or indexes to select anchor nodes.

  • Take advantage of indexes.

  • WITH DISTINCT applies to the entire row, not just a single variable.

  • LIMIT reduces all rows, not results per row.

WITH on its own does not shrink cardinality.

Cardinality is a problem

Here is our original query:

PROFILE
MATCH (m:Movie {title: $movieTitle})
MATCH (dir:Person)-[:DIRECTED]->(m)
MATCH (actor:Person)-[:ACTED_IN]->(m)
RETURN m.title AS Title,
collect(DISTINCT actor.name) AS Actors,
collect(DISTINCT dir.name) AS Directors
MatrixQuery

We see that the problems are that we have back-to-back MATCH statements and we aggregate too late in the query. We are passing 26 rows through the query.

Reducing cardinality by aggregating earlier

We can improve this query buy moving the aggregation up:

PROFILE
MATCH (m:Movie {title: $movieTitle})
MATCH (dir:Person)-[:DIRECTED]->(m)
WITH m, collect(dir.name) AS Directors
MATCH (actor:Person)-[:ACTED_IN]->(m)
WITH m, Directors, collect(actor.name) AS Actors
RETURN m.title AS Title, Directors, Actors

With this revised query, as soon as we match the directors, we will collect the names which will be unique. Then when we execute the final MATCH. We are not passing two director rows to be processed, but simply the single row with the movie and list of directors.

Here is the execution plan:

ImprovedMatrixQuery

Here we see that the number of rows has been reduced and subsequently we also see that the number of db hits has been reduced.

Reducing cardinality with pattern comprehension

Pattern comprehension is a very powerful way to reduce cardinality. It behaves like an OPTIONAL MATCH combined with collect(). It behaves line an inline subquery.

Here is a rewrite of the original query using pattern comprehension:

PROFILE
MATCH (m:Movie {title: $movieTitle})
RETURN m.title,
[(dir:Person)-[:DIRECTED]->(m) | dir.name] AS Directors,
[(actor:Person)-[:ACTED_IN]->(m) | actor.name] AS Actors

In the RETURN statement, we are returning two lists, but they are created using pattern comprehension. A match pattern is specified that creates the lists by performing an implicit match and in this case, extracts the name property from the nodes retrieved.

Pattern comprehension does introduce new identifiers, but they are very useful especially if you want to do some filtering with WHERE and computing an expression as the result.

For example: [pattern WHERE <predicate> | <expression>]

Here is the execution plan for this query:

PatternComprehension

Here we see that the query retrieves the movie row and finds 2 rows for directors. With pattern comprehension, these 2 rows are collected and 1 row is then passed to the next pattern comprehension specified for actors. The 13 rows are collected into 1 row so that the final number of rows returned is 1. The use of pattern comprehension is slightly better and reduces the number of db hits.

Exercise 3: Reducing cardinality with aggregation

In the query edit pane of Neo4j Browser, execute the browser command:

:play 4.0-query-tuning-exercises

and follow the instructions for Exercise 3.

This exercise has 3 steps. Estimated time to complete: 20 minutes.

Reducing cardinality with anchor node selectivity

In your MATCH statement patterns, strive to create execution plans that either use an index or label (which is also an index).

In your execution plans, you may see these operators at the leaf steps:

  • NodeByLabelScan

  • Operators that use an index:

    • NodeIndexSeek

    • NodeUniqueIndexSeek

    • MultiNodeIndexSeek

    • NodeIndexSeekByRange

    • NodeUniqueIndexSeekByRange

    • NodeIndexContainsScan

    • NodeIndexEndsWithScan

    • NodeIndexScan

You never want to see AllNodesScan in an execution plan.

Reducing cardinality with labels

You want to see NodeByLabelScan in your execution plans if an index cannot be used. You must be familiar with how labels are used. Ideally you want the greatest selectivity for the anchor nodes.

For example, here is a query that will use NodeByLabelScan:

PROFILE
MATCH (p:Person)
RETURN p.name

It returns 18,726 rows.

PersonNodes

Use more specific labels

But if you are only really interested in directors, anchor your query with this node label:

PROFILE
MATCH (p:Director)
RETURN p.name
DirectorNodes

Reducing cardinality with indexes

A really big win for reducing cardinality is to ensure that indexes can be used for your queries, especially if they represent unique constraints for a value. If a query is performed frequently by the application, you must add an index for the property that is used for the query. The type of index-related step in the execution plan will depend upon the type of filtering your query requires.

Another type of index you can create in the database is the full-text schema index which provides additional indexing capabilities that you do not get from regular indexes:

  • multiple labels

  • properties of relationships

  • support for case-insensitive lookup

  • wildcard lookup

  • custom lucene analyzers

Example: A query that needs improvement

Here is an example where a full-text schema index helps. We want to query the roles in the ACTED_IN relationships. For this example, the value of $testString is "CABBIE".

PROFILE
MATCH (a:Actor)-[r:ACTED_IN]->(m:Movie)
WHERE ANY (role IN r.roles WHERE toUpper(role) CONTAINS $testString)
return m.title, r.roles, a.name

Here is the execution plan for this query:

CabbieExample

We see that to execute this query, we need many rows (6231,6231,56914,7). This spike in rows needed is something you never want to see in an execution plan. This query requires 185,771 db hits!

If this query is one that the application uses frequently, you will want to modify things so that it performs better.

Example: Refactoring the graph

We know that full-text schema indexes allow you to create indexes on relationship properties. This is what we want to do to improve the query.

The caveat, however is that the roles property of the ACTED_IN relationship contains a list of roles and we cannot create a full-text schema index on a list of strings. To solve this problem, we will refactor the graph to have 2 properties for the ACTED_IN relationship:

  • primaryRole

  • secondaryRole

We refactor the graph as follows, keeping the roles property as is:

MATCH (a:Actor)-[r:ACTED_IN]->(m:Movie)
SET r.primaryRole = r.roles[0], r.secondaryRole = r.roles[1]

As you learn about graph data modeling and implementing graphs, you will find that sometimes you will need to refactor the graph to improve query performance.

It is also possible to create an index on a comma-separated list with apoc.text.join(r.roles,",").

Example: Rewritten query with the refactored graph?

So the previous query with this change is:

PROFILE
MATCH (a:Actor)-[r:ACTED_IN]->(m:Movie)
WHERE toUpper(r.primaryRole) CONTAINS $testString OR
toUpper(r.secondaryRole) CONTAINS $testString
RETURN m.title, r.roles, a.name

And we see an execution plan that is still not performing well:

CabbieExample2

It has even more db hits, 407,041 because the properties are stored in different physical locations and require greater db access.

Solution: Adding a full-text schema index

Now that we have separated out the values for the roles, we can add a full-text schema index for these properties:

CALL db.index.fulltext.createRelationshipIndex(
      'ActedInRoleIndex',['ACTED_IN'], ['primaryRole','secondaryRole'])

Example: Using the full-text schema index

After adding this type of index, we can query the graph, but the query will change. Because it is a full-text schema index, it must be called differently and the query changes to something like this:

PROFILE
CALL db.index.fulltext.queryRelationships(
     'ActedInRoleIndex', $testString) YIELD relationship
WITH relationship AS rel
RETURN startNode(rel).name, endNode(rel).title, rel.roles

Here is the execution plan for this query:

CabbieExample3

We can’t really compare db hits here because we are calling a procedures for the full-text schema search, but we do see fewer rows in the execution plan. In the previous query (before adding the full-text schema index), the execution plan performed a NodeByLabelScan which produced a lot of rows. We have already determined from the call to queryRelationships which particular relationships are associated with the index. The problem, however is that the execution plan scans for all relationships between Actors and Movies. This is a problem.

Example: Improved query using full-text schema index

The solution to this is the remove the labels from the MATCH statement so that only the found relationships will be used to retrieve the appropriate Actor and Movie nodes.

Here is the improved query:

PROFILE
CALL db.index.fulltext.queryRelationships(
     'ActedInRoleIndex', $testString) YIELD relationship
WITH relationship AS rel
MATCH (a)-[rel]->(m)
RETURN a.name, m.title, rel.roles

In this special case, we do not want the NodeByLabel scan to occur. Here is the execution plan:

CabbieExample4

This is much better. We see far fewer rows, no NodeByLabelScan, much fewer db hits, and a smaller elapsed time.

When you call a procedure in Cypher, the execution plan shows zero db hits. When calling procedures, you will mainly rely on rows and elapsed time.

When you use CALL for a subquery, however, db hits are measured.

Full-text schema indexes can be used in these special cases where you want to optimize access to a property of a relationship. They are also good for optimizing case-insensitive searches on any node or relationship property string.

Reducing cardinality with query hints

If the database has indexes, strive to ensure that execution plans use them. Ideally, you want indexes that have values with the lowest selectivity. The query planner will always choose to use indexes with low selectivity values.

By default, the execution plan will use a single index.

Here is a query that uses an index $actor1 is "Tom Cruise" and $actor2 is "Kevin Bacon".

PROFILE
MATCH p = (p1:Person)-[:ACTED_IN*4]-(p2:Person)
WHERE p1.name = $actor1
  AND p2.name = $actor2
RETURN [n IN nodes(p) | coalesce(n.title, n.name)]

It finds all paths that represent 4 hops between two actors where $actor1 is "Tom Cruise" and $actor2 is "Kevin Bacon". Then it returns a list of names or titles for the nodes in the paths found.

UsingOneIndex

The index on Person.name is used for the MATCH for the p1 side of the query path, but then you see that there are 47,721 rows retrieved and then 34 rows filtered to return the data required. A total of 283,320 db hits.

In this example, we are interested in all possible paths of this length. If you only need one, use shortestPath() for significantly better performance.

Using query hints

You can force the use of more than one index so that an index is used to find p1 nodes and p2 nodes:

PROFILE
MATCH p = (p1:Person)-[:ACTED_IN*4]-(p2:Person)
USING INDEX p1:Person(name)
USING INDEX p2:Person(name)
WHERE p1.name = $actor1
  AND p2.name = $actor2
RETURN [n IN nodes(p) | coalesce(n.title, n.name)]
UsingTwoIndexes

Here we see fewer rows and fewer db hits, as well as a reduced elapsed time.

What you will find is that the performance of this type of query when the indexes are unique will out-perform indexes that are non-unique because the runtime stops fetching from them after the first result. In this database, the Person.name index is not unique. But for this particular database, there is only one actor named Tom Cruise and one actor named Kevin Bacon. If the database had multiple actors with these names, you would see a greater performance degradation (and cardinality) with using multiple indexes.

Use query hints with caution

Use caution, however when you are explicitly specifying the use of indexes. Here are some things to consider:

  • The planner will take the selectivity of an index into account when evaluating equality.

  • Forcing a plan means that planner cannot adapt when the underlying data changes.

  • Your plan may be more efficient specifically while being less efficient generally.

  • Hints can inform the planner about the structure of your data in ways the planner cannot infer itself.

  • If you do use hints, use them to force the plan around aspects of the data model that will remain consistent.

Index selectivity

You can use the APOC procedure to view the selectivity of your indexes.

CALL apoc.schema.nodes()
IndexSelectivity

USING SCAN

Just like you can explicitly specify if/when an index will be used or assume the index will be automatically used, you can also specify not to use an index. You would specify not to use an index if one of your node labels represented the data you are interested in retrieving. For example, you can set a flag or status label to a set of nodes that you know will be queried. That way, you need not access any properties.

In our graph, there is an index on the Genre nodes. By default, any query that is based upon the name property will use the index. If you want to scan the nodes by their label and not use the index, then you can specify:

PROFILE
MATCH (g:Genre)
USING SCAN g:Genre
WHERE g.name CONTAINS $text
RETURN g.name

Whether queries that rely simply on labels, rather than the index, will depend on your data model.

Using JOIN to reduce cardinality

JOIN is useful when you are performing queries on patterns that are focused on a particular node. This is particularly useful for dense nodes.

Here is our starting query:

PROFILE
MATCH (a)-[:ACTED_IN]->(m:Movie)<-[:DIRECTED]-(d)
RETURN collect(a.name), m.title, collect(d.name)
WithoutJOIN

Adding JOIN

We can add JOIN to this query:

PROFILE
MATCH (a)-[:ACTED_IN]->(m:Movie)<-[:DIRECTED]-(d)
USING JOIN ON m
RETURN collect(a.name), m.title, collect(d.name)
WithJOIN

What happens here is that it does two expands to follow the path to m from a and from d. Then it compares the m’s from each side with each other in a hash-join. There are fewer rows in the execution plan, as well as db hits, and a lower execution time.

Exercise 4: Reducing Cardinality with Anchor Selectivity

In the query edit pane of Neo4j Browser, execute the browser command:

:play 4.0-query-tuning-exercises

and follow the instructions for Exercise 4.

This exercise has 3 steps. Estimated time to complete: 20 minutes.

LIMIT is too far down in the query

If the query is written so that a limited number of results are returned, it is best to move the LIMIT up in the query.

Here is an example:

PROFILE
MATCH (m:Movie)<-[:ACTED_IN]-(a)
WITH m, collect(a) AS Actors
RETURN m.title as Title, Actors LIMIT 10
LimitLate

Here you can see that after the initial query, many rows are passed through the steps of the execution plan.

Solution: Moving LIMIT up in the query

Here is a better way to do it:

PROFILE
MATCH (m:Movie)
WITH m LIMIT 10
MATCH (m)<-[:ACTED_IN]-(a)
WITH m, collect(a) AS Actors
RETURN m.title AS Title, Actors

We know ahead of time that we want 10 rows, one for each movie so expanding after we have retrieved the 10 rows is better.

LimitEarly

DISTINCT is too far down in the query

DISTINCT is another way that you can reduce row cardinality in the execution plan. Just like you just saw how to move LIMIT earlier in the query, you can move DISTINCT up to reduce rows required.

Here is another example where we have set $titleMatch to "Matrix"

PROFILE
MATCH (p:Person)-[:ACTED_IN| DIRECTED]->(m)
WHERE m.title CONTAINS $titleMatch
WITH p
MATCH (p)-[:ACTED_IN]->()<-[:ACTED_IN]-(p2:Person)
RETURN DISTINCT p.name, p2.name

This query finds all people who acted in or directed a movie with $titleMatch in the title. This query will return duplicate Person nodes because some people both acted in and directed movies. Then we have a subsequent query where we use the returned people to find other people who have acted in a movie with the first actor, p. We then use DISTINCT to ensure that we have distinct rows in our return.

DistinctLate

Solution: Moving DISTINCT further up in the query

We can make a slight improvement by moving DISTINCT earlier in the query:

PROFILE
MATCH (p:Person)-[:ACTED_IN| DIRECTED]->(m)
WHERE m.title contains $titleMatch
WITH DISTINCT p
MATCH (p)-[:ACTED_IN]->()<-[:ACTED_IN]-(p2:Person)
RETURN p.name, p2.name
DistinctEarly

It has slightly better execution time and we definitely see fewer rows in the execution plan.

Use UNWIND judiciously

UNWIND creates rows so if you use it, be aware that you are introducing more rows. Sometimes UNWIND is useful, especially if you are creating relationships or refactoring nodes in the graph.

Here is an example where we do a query to find all Movie nodes that are between two Person nodes by at most 4 hops. In this example $actor1 is "Tom Cruise" and $actor2 is "Kevin Bacon". We use UNWIND to create the rows for all nodes visited:

PROFILE
MATCH path = (p1:Person)-[:ACTED_IN*4]-(p2:Person)
USING INDEX p1:Person(name)
USING INDEX p2:Person(name)
WHERE p1.name = $actor1
  AND p2.name = $actor2
UNWIND (nodes(path)) as visitedNode
WITH DISTINCT visitedNode
WHERE visitedNode:Movie
RETURN visitedNode.title
UNWIND1

The UNWIND adds 170 rows to the query. This isn’t that bad considering the total number of rows passed between the steps of the execution plan.

Another UNWIND example

Here is another example where we use UNWIND to combine lists to return rows to process:

PROFILE
MATCH (p1:Person)-[:ACTED_IN]-(m1)-[:ACTED_IN]-(p)-[:ACTED_IN]-(m2)-[:ACTED_IN]-(p2:Person)
USING INDEX p1:Person(name)
USING INDEX p2:Person(name)
WHERE p1.name = $actor1
  AND p2.name = $actor2
WITH collect(m1) as movies1, collect(m2) as movies2
UNWIND (movies1 + movies2) as VisitedNode
WITH DISTINCT VisitedNode
RETURN VisitedNode.title
UNWIND2

This version of the query produces fewer rows from the UNWIND and had better performance.

Cartesian products are expensive

Cartesian products are useful as hash-joins when you are creating relationships between nodes. But for read-only queries, aim to eliminate cartesian products in your queries.

Here is an example where we have set $year to 1990:

PROFILE MATCH (a:Actor), (m:Movie)
WHERE m.releaseYear = $year AND a.born > $year
RETURN collect(DISTINCT a) AS actors, collect(DISTINCT m) AS movies
Cartesian1

Solution 1: Eliminate cartesian products

Here is a better way to do the same query using subqueries:

PROFILE
CALL {
MATCH (m:Movie) WHERE m.releaseYear = $year RETURN collect(m) AS movies
}
CALL {
MATCH (a:Actor) WHERE a.born > $year RETURN collect(DISTINCT a) AS actors }
RETURN movies, actors
Cartesian2

It is about the same number of db hits, but it performs much faster.

Solution 2: Eliminate cartesian products

Here is an even better way to do the same query using UNION:

PROFILE MATCH (m:Movie) WHERE m.releaseYear = $year
RETURN {type:"movies", movies: collect(m)} as data
union all
MATCH (a:Actor) WHERE a.born > $year
RETURN { type:"actors", count:collect(DISTINCT a)} AS data
Cartesian3

It is about the same number of db hits, but it performs slightly better than the use of subqueries.

Exercise 5: Reducing Cardinality with LIMIT, DISTINCT

In the query edit pane of Neo4j Browser, execute the browser command:

:play 4.0-query-tuning-exercises

and follow the instructions for Exercise 5.

This exercise has 3 steps. Estimated time to complete: 20 minutes.

Check your understanding

Question 1

Which of the following factors will impact the cardinality in the steps of an execution plan?

Select the correct answers.

  • Lack of selectivity for anchoring the query.

  • The number of nodes and relationships in the database.

  • Overuse of UNWIND clauses

  • Multiple back-to-back MATCH statements that return more than one row

Question 2

What are some things you can do to reduce cardinality in your execution plans?

Select the correct answers.

  • Aggregate early

  • Limit early

  • Make sure indexes are used

  • Use WITH frequently between MATCH clauses

Question 3

Which Cypher clause cannot provide db hit metrics in the execution plan?

Select the correct answer.

  • CALL for a procedure

  • WITH

  • FOREACH

  • USING INDEX

Summary

You can now:

  • Describe cardinality in an execution plan.

  • Tune and rewrite queries to:

    • Aggregate early.

    • Use pattern comprehension.

    • Anchor node through selectivity by label.

    • Use indexes.

    • Use full-text schema indexes.

    • Use query hints for index usage.

    • Use LIMIT early in the query.

    • Use DISTINCT early in the query.

    • Use UNWIND to your advantage.

    • Avoid cartesian products.