Optimizing Property Access

About this module

You have learned some of the ways that you can tune queries by reducing the cardinality of the rows that are passed between the steps of an execution plan. In this module, you will learn how to rewrite queries so that property access is optimized.

At the end of this module, you will be able to:

  • Describe when properties are accessed in a query.

  • Optimize queries by minimizing property access.

When are properties accessed?

During a query, a node or relationship property is accessed when it is used to filter a query and to return the properties as. Here are some examples.

In the code below, the title property is accessed for the query and the releaseYear property is accessed when returning the query results:

MATCH (m:Movie {title: $movieTitle})
RETURN m.title, m.releaseYear

This query is the same:

MATCH (m:Movie)
WHERE m.title = $movieTitle
RETURN m.title, m.releaseYear
The title property need not be accessed in the RETURN clause because it is cached from the query. If the title property is indexed, then the node property need not be accessed from storage as the value is in the index.

Example: Property access

In the code below, the primaryRole and secondaryRole properties are accessed for the query and the name, title, and roles properties are accessed when returning the query results. You cannot define a "regular" index on a relationship property so for this query, every ACTED_IN relationship’s roles property values need to be accessed. This can be very expensive which is why full-text schema indexes are useful.

MATCH (actor:Person)-[r:ACTED_IN]->(movie:Movie)
WHERE r.primaryRole CONTAINS $characterString OR r.secondaryRole CONTAINS $characterString
RETURN actor.name, movie.title, r.roles

Here is how the previous query could use the full-text schema index that you learned how to create in the previous lesson on reducing cardinality.

In order to perform the query, only the index is used, and then because so few rows are found, the access to the name, title, and roles for a small set of rows is not that expensive.

CALL db.index.fulltext.queryRelationships(
     'ActedInRoleIndex', $testString) YIELD relationship
WITH relationship AS rel
RETURN startNode(rel).name, endNode(rel).title, rel.roles

Your goal for indexing and writing queries will be to minimize the number of properties that need to be accessed during the query. In addition, you will aim to only return property values that are truly needed by the application.

Only return what you need

During query processing you may aggregate data and perform multiple MATCH clauses with the ultimate goal of returning something to the client. Always aim to return the minimum amount of data to the client.

This query will return all properties of all nodes and relationships retrieved for the query:

PROFILE
MATCH (actor)-[:ACTED_IN]->(movie:Movie)
RETURN actor, movie

Here is the resulting execution plan:

returnAll

How much data is returned to the client?

If the client is accessing the database over a network, the elapsed time for the query will be longer.

If we view the table returned in Neo4j Browser, we see more data:

returnAll2

Notice that the query really only took 1 ms, and the total time was 16ms. From this we can infer that the streaming time was 15 ms. If the client were on a different system, this streaming time would be greater.

Reducing rows to limit data returned

In addition, strive to limit the number of rows returned as accessing fewer rows to obtain property values will be faster.

Here we limit the number of rows returned by only returning movies that have greater than $maxActors actors. This will be a subset of all movies.

PROFILE
MATCH (m:Movie)<-[:ACTED_IN]-(a)
WITH  m, collect(a) as actors
WHERE size(actors) > $maxActors
RETURN  m.title, m.releaseYear, actors

Another reduction of data returned

Here is another example where we want to only return 10 values.

PROFILE
MATCH (m:Movie)<-[:ACTED_IN]-(a)
WITH  m, collect(a.name) as actors LIMIT 10
RETURN  m.title, m.releaseYear, actors

Index-based property access

If an index was used for the query, then that property value for that node will be cached and you need not go to the database to retrieve it. Furthermore, if you use ORDER BY for a property that was retrieved with the index, it will be very fast because the index already provides the ordering.

Since Neo4j is schema-free, you must specify to the planner how retrieve the data.

Your queries specify how to compare the data used by the index, for example:

  • For strings: n.name STARTS WITH ""

  • For numbers: n.born > 0

  • For dates: n.releaseYear > date()

The type of the data used in the query specification will determine how the data is ordered in the results.

Example: index-based property access

With this query, the results are returned in ascending order:

PROFILE
MATCH (p:Person)
WHERE p.name STARTS WITH $firstName
RETURN p.name ORDER BY p.name
retrieveTomsAsPerson

It is optional whether you specify ORDER BY, but a best practice is to specify it.

Example: No index to use

Contrast this to a query where an index cannot be used.

With this query:

PROFILE
MATCH (p:Actor)
WHERE p.name STARTS WITH $firstName
RETURN p.name

No index is used for the query and we see that it takes longer to retrieve the nodes.

retrieveTomsAsActor

Sorting takes longer with no index

And if we want the results sorted, there is an additional step:

PROFILE
MATCH (p:Actor)
WHERE p.name STARTS WITH $firstName
RETURN p.name ORDER BY p.name
retrieveTomsAsActorSorted

Aggregation when no index

You learned earlier that eager operators can be expensive. You can perform operations such as min() and max() without needing to aggregate, provided there is a supporting index.

Here is an example where we do not have an index on the Movie.releaseYear property.

PROFILE
MATCH (m:Movie)
WHERE m.releaseYear < $year
RETURN min(m.releaseYear) AS Year
min1

In the execution plan, we see that we need to aggregate all Movie nodes to perform the min() operation.

Aggregation is automatic with index

If we had a query that used an index, the aggregation would be avoided because we can get the data we need to find the alphabetically earliest movie title because it is in the index without needing the collect all of the titles in the graph.

PROFILE
MATCH (m:Movie)
WHERE m.title STARTS WITH $titleString
RETURN min(m.title) AS Title
min2

In the execution plan, we see no aggregation to find the minimum value.

Tuning "top ten" types of queries

In a query where a property value is used to order and provide a top or bottom of the result set, strive to move the ORDER BY and LIMIT earlier in the query.

We want to return the 100 most recent movies and their actors, ordered by the release year:

PROFILE
MATCH (m:Movie)<-[:ACTED_IN]-(actor)
WITH m, collect(actor) AS actors
RETURN m.title as Title, m.releaseYear as Released, actors ORDER BY m.releaseYear DESC LIMIT 100
TopHundred

Improving the "top ten" query

Here is the revised query. We move ORDER BY and LIMIT to earlier in the query. Even though we are accessing the releaseYear property earlier, it helps us to limit the number of rows processed.

PROFILE
MATCH (m:Movie)
WITH m ORDER BY m.releaseYear DESC LIMIT 100
MATCH (m)<-[:ACTED_IN]-(actor)
WITH m, collect(actor) AS actors
RETURN m.title as Title, m.releaseYear as Released, actors
TopHundredRevised

Of course, the best optimization would be to add an index on the releaseYear property of the Movie nodes. It would not be a unique index, but it would definitely reduce the number of properties that need to be accessed for this type of query.

Accessing property values after expansion

If you are doing a query that requires the same value to be repeated for multiple rows in the result, it is sometimes better to project that value out when the node is distinct, then do the expansion.

Here is an example of this:

PROFILE
MATCH (m:Movie)
WITH m
MATCH (m)<-[:ACTED_IN]-(actor)
RETURN m.title AS Title, actor.name AS Actor

With this query, we know that we want to return rows containing the movie title and an actor name. There will be multiple rows that contain the same movie title.

Here is the execution plan:

ProjectAfterExpansion

You will notice, however in this execution plan that the title property is actually cached so that it need not be accessed in the RETURN clause. This is an improvement in the query planner for 4.x.

Project property values before expansion

If you were using an earlier release of Neo4j, a slightly better way to do this would be to project the movie title when you know that node is unique, then do the expansion to retrieve the Actor nodes:

PROFILE
MATCH (m:Movie)
WITH m, m.title AS Title
MATCH (m)<-[:ACTED_IN]-(actor)
RETURN Title, actor.name AS Actor

Here is the execution plan:

ProjectBeforeExpansion
As stated earlier in this course, with every release of Neo4j, you must measure your query performance. This is because the Cypher query planner could change and perhaps make your queries perform faster (or slower).

Aggregate by node property

Here is an example where we aggregate by a node property:

PROFILE
MATCH (p:Person)
WITH p
MATCH (m:Movie)<-[:ACTED_IN]-(p)
WITH p.name as Actor, collect(m) as Movies
RETURN Actor, Movies
AggregateByProperty

Better: Aggregate by node

Depending on the graph, it is better to aggregate by node, rather than by property. When you aggregate by node, you do not have to read any properties or node details, it simply get the id for the node.

PROFILE
MATCH (p:Person)
WITH p
MATCH (m:Movie)<-[:ACTED_IN]-(p)
WITH p, collect(m) as Movies
WITH p.name as Actor, Movies
RETURN Actor, Movies

Here is the execution plan which is slightly better for this graph:

AggregateByNode

Accessing node properties multiple times

Delaying the minimum number of properties accessed by reducing the number of nodes is a good thing. Even better is accessing a distinct node.

This is a variation of getting at the property value before expansion. Here is an example:

PROFILE
MATCH (m:Movie)-[:ACTED_IN]-(a:Actor)
WHERE a.born > $year
RETURN  m.title, m.releaseYear

This query returns 30,146 rows.

NonDistinctPropertyAccess

Better: Delay access until node is distinct

Ideally, you want to access the node property once, so making it distinct helps:

PROFILE
MATCH (m:Movie)-[:ACTED_IN]-(a:Actor)
WHERE a.born > $year
WITH DISTINCT m
RETURN  m.title, m.releaseYear

Although the elapsed time is greater, the number of rows returned is much smaller which means less work to get the property values.

DistinctPropertyAccess

Property access before aggregation

Another example is where aggregation is performed for some of the nodes and the property access is done before the aggregation:

PROFILE
MATCH (m:Movie)<-[:ACTED_IN]-(a:Actor)
WHERE a.born > $year AND m.releaseYear > $year
WITH  m, collect(a) as actors
RETURN  m.title, m.releaseYear, actors
PropertyAccessBeforeAggregation

Better: Aggregate and then access properties

When we aggregate the actor nodes, the movie nodes are distinct so it is much more efficient to rewrite this query to:

PROFILE
MATCH (m:Movie)<-[:ACTED_IN]-(a:Actor)
WHERE a.born > $year
WITH  m, collect(a) as actors
WHERE m.releaseYear > $year
RETURN  m.title, m.releaseYear, [a IN actors | a.name] as actorNames
PropertyAccessAfterAggregation
This query could benefit from indexes on the born and releaseYear properties.

Exercise 6: More Query Tuning

In the query edit pane of Neo4j Browser, execute the browser command:

:play 4.0-query-tuning-exercises

and follow the instructions for Exercise 6.

This exercise has 6 steps. Estimated time to complete: 15 minutes.

Check your understanding

Question 1

Given this query:

MATCH (m:Movie)
WHERE m.title = $movieTitle
RETURN m.title, m.releaseYear

When are the title and releaseYear properties accessed?

Select the correct answers.

  • If there are no indexes on Movie.title or Movie.releaseYear, the Movie.title property is accessed during the query and the Movie.releaseYear property is accessed in the RETURN clause.

  • If there is an index only on Movie.title, the Movie.releaseYear property is accessed in the RETURN clause.

  • If there is an index only on Movie.releaseYear, the Movie.title property is accessed during the query and the Movie.releaseYear property is accessed in the RETURN clause.

  • If there are indexes on both Movie.title and Movie.releaseYear, the properties are not accessed, but gotten only from the indexes.

Question 2

For a given query, what metrics can you observe related to time?

Select the correct answers.

  • total elapsed ms for the query

  • ms for the query execution, that does not include results returned

  • ms for the index lookup

  • ms for the streaming of the results to the client

Question 3

Suppose you have an index on the Movie.releaseYear property. With this code:

MATCH (m:Movie)
WHERE m.releaseYear > $year
RETURN m.title, m.releaseYear

Which statements are true about this code?

Select the correct answers.

  • The title property is accessed for the return.

  • The releaseYear property is accessed for the return.

  • The records returned are sorted by releaseYear.

  • The records returned are in random order.

Summary

You can now:

  • Describe when properties are accessed in a query.

  • Optimize queries by minimizing property access.