Optimizing Property Access
About this module
You have learned some of the ways that you can tune queries by reducing the cardinality of the rows that are passed between the steps of an execution plan. In this module, you will learn how to rewrite queries so that property access is optimized.
At the end of this module, you will be able to:
-
Describe when properties are accessed in a query.
-
Optimize queries by minimizing property access.
When are properties accessed?
During a query, a node or relationship property is accessed when it is used to filter a query and to return the properties as. Here are some examples.
In the code below, the title property is accessed for the query and the releaseYear property is accessed when returning the query results:
MATCH (m:Movie {title: $movieTitle})
RETURN m.title, m.releaseYear
This query is the same:
MATCH (m:Movie)
WHERE m.title = $movieTitle
RETURN m.title, m.releaseYear
The title property need not be accessed in the RETURN clause because it is cached from the query.
If the title property is indexed, then the node property need not be accessed from storage as the value is in the index.
|
Example: Property access
In the code below, the primaryRole and secondaryRole properties are accessed for the query and the name, title, and roles properties are accessed when returning the query results. You cannot define a "regular" index on a relationship property so for this query, every ACTED_IN relationship’s roles property values need to be accessed. This can be very expensive which is why full-text schema indexes are useful.
MATCH (actor:Person)-[r:ACTED_IN]->(movie:Movie)
WHERE r.primaryRole CONTAINS $characterString OR r.secondaryRole CONTAINS $characterString
RETURN actor.name, movie.title, r.roles
Here is how the previous query could use the full-text schema index that you learned how to create in the previous lesson on reducing cardinality.
In order to perform the query, only the index is used, and then because so few rows are found, the access to the name, title, and roles for a small set of rows is not that expensive.
CALL db.index.fulltext.queryRelationships(
'ActedInRoleIndex', $testString) YIELD relationship
WITH relationship AS rel
RETURN startNode(rel).name, endNode(rel).title, rel.roles
Your goal for indexing and writing queries will be to minimize the number of properties that need to be accessed during the query. In addition, you will aim to only return property values that are truly needed by the application.
Only return what you need
During query processing you may aggregate data and perform multiple MATCH
clauses with the ultimate goal of returning something to the client.
Always aim to return the minimum amount of data to the client.
This query will return all properties of all nodes and relationships retrieved for the query:
PROFILE
MATCH (actor)-[:ACTED_IN]->(movie:Movie)
RETURN actor, movie
Here is the resulting execution plan:
How much data is returned to the client?
If the client is accessing the database over a network, the elapsed time for the query will be longer.
If we view the table returned in Neo4j Browser, we see more data:
Notice that the query really only took 1 ms, and the total time was 16ms. From this we can infer that the streaming time was 15 ms. If the client were on a different system, this streaming time would be greater.
Returning less data to the client
Contrast the previous query with this one:
PROFILE
MATCH (actor)-[:ACTED_IN]->(movie:Movie)
RETURN actor.name, movie.title
The number of db hits is smaller and the elapsed time is smaller. You can imagine that over a network, you would see greater differences in performance.
And in the table view, we see that the query took the same amount of time and the streaming of the results was faster because there is less data to stream.
Reducing rows to limit data returned
In addition, strive to limit the number of rows returned as accessing fewer rows to obtain property values will be faster.
Here we limit the number of rows returned by only returning movies that have greater than $maxActors actors. This will be a subset of all movies.
PROFILE
MATCH (m:Movie)<-[:ACTED_IN]-(a)
WITH m, collect(a) as actors
WHERE size(actors) > $maxActors
RETURN m.title, m.releaseYear, actors
Index-based property access
If an index was used for the query, then that property value for that node will be cached and you need not go to the database to retrieve it.
Furthermore, if you use ORDER BY
for a property that was retrieved with the index, it will be very fast because the index already provides the ordering.
Since Neo4j is schema-free, you must specify to the planner how retrieve the data.
Your queries specify how to compare the data used by the index, for example:
-
For strings: n.name STARTS WITH ""
-
For numbers: n.born > 0
-
For dates: n.releaseYear > date()
The type of the data used in the query specification will determine how the data is ordered in the results.
Example: index-based property access
With this query, the results are returned in ascending order:
PROFILE
MATCH (p:Person)
WHERE p.name STARTS WITH $firstName
RETURN p.name ORDER BY p.name
It is optional whether you specify ORDER BY
, but a best practice is to specify it.
Example: No index to use
Contrast this to a query where an index cannot be used.
With this query:
PROFILE
MATCH (p:Actor)
WHERE p.name STARTS WITH $firstName
RETURN p.name
No index is used for the query and we see that it takes longer to retrieve the nodes.
Sorting takes longer with no index
And if we want the results sorted, there is an additional step:
PROFILE
MATCH (p:Actor)
WHERE p.name STARTS WITH $firstName
RETURN p.name ORDER BY p.name
Aggregation when no index
You learned earlier that eager operators can be expensive.
You can perform operations such as min()
and max()
without needing to aggregate, provided there is a supporting index.
Here is an example where we do not have an index on the Movie.releaseYear property.
PROFILE
MATCH (m:Movie)
WHERE m.releaseYear < $year
RETURN min(m.releaseYear) AS Year
In the execution plan, we see that we need to aggregate all Movie nodes to perform the min()
operation.
Aggregation is automatic with index
If we had a query that used an index, the aggregation would be avoided because we can get the data we need to find the alphabetically earliest movie title because it is in the index without needing the collect all of the titles in the graph.
PROFILE
MATCH (m:Movie)
WHERE m.title STARTS WITH $titleString
RETURN min(m.title) AS Title
In the execution plan, we see no aggregation to find the minimum value.
Tuning "top ten" types of queries
In a query where a property value is used to order and provide a top or bottom of the result set, strive to move the ORDER BY
and LIMIT
earlier in the query.
We want to return the 100 most recent movies and their actors, ordered by the release year:
PROFILE
MATCH (m:Movie)<-[:ACTED_IN]-(actor)
WITH m, collect(actor) AS actors
RETURN m.title as Title, m.releaseYear as Released, actors ORDER BY m.releaseYear DESC LIMIT 100
Improving the "top ten" query
Here is the revised query. We move ORDER BY
and LIMIT
to earlier in the query.
Even though we are accessing the releaseYear property earlier, it helps us to limit the number of rows processed.
PROFILE
MATCH (m:Movie)
WITH m ORDER BY m.releaseYear DESC LIMIT 100
MATCH (m)<-[:ACTED_IN]-(actor)
WITH m, collect(actor) AS actors
RETURN m.title as Title, m.releaseYear as Released, actors
Of course, the best optimization would be to add an index on the releaseYear property of the Movie nodes. It would not be a unique index, but it would definitely reduce the number of properties that need to be accessed for this type of query.
Accessing property values after expansion
If you are doing a query that requires the same value to be repeated for multiple rows in the result, it is sometimes better to project that value out when the node is distinct, then do the expansion.
Here is an example of this:
PROFILE
MATCH (m:Movie)
WITH m
MATCH (m)<-[:ACTED_IN]-(actor)
RETURN m.title AS Title, actor.name AS Actor
With this query, we know that we want to return rows containing the movie title and an actor name. There will be multiple rows that contain the same movie title.
Here is the execution plan:
You will notice, however in this execution plan that the title property is actually cached so that it need not be accessed in the RETURN
clause.
This is an improvement in the query planner for 4.x.
Project property values before expansion
If you were using an earlier release of Neo4j, a slightly better way to do this would be to project the movie title when you know that node is unique, then do the expansion to retrieve the Actor nodes:
PROFILE
MATCH (m:Movie)
WITH m, m.title AS Title
MATCH (m)<-[:ACTED_IN]-(actor)
RETURN Title, actor.name AS Actor
Here is the execution plan:
As stated earlier in this course, with every release of Neo4j, you must measure your query performance. This is because the Cypher query planner could change and perhaps make your queries perform faster (or slower). |
Aggregate by node property
Here is an example where we aggregate by a node property:
PROFILE
MATCH (p:Person)
WITH p
MATCH (m:Movie)<-[:ACTED_IN]-(p)
WITH p.name as Actor, collect(m) as Movies
RETURN Actor, Movies
Better: Aggregate by node
Depending on the graph, it is better to aggregate by node, rather than by property. When you aggregate by node, you do not have to read any properties or node details, it simply get the id for the node.
PROFILE
MATCH (p:Person)
WITH p
MATCH (m:Movie)<-[:ACTED_IN]-(p)
WITH p, collect(m) as Movies
WITH p.name as Actor, Movies
RETURN Actor, Movies
Here is the execution plan which is slightly better for this graph:
Accessing node properties multiple times
Delaying the minimum number of properties accessed by reducing the number of nodes is a good thing. Even better is accessing a distinct node.
This is a variation of getting at the property value before expansion. Here is an example:
PROFILE
MATCH (m:Movie)-[:ACTED_IN]-(a:Actor)
WHERE a.born > $year
RETURN m.title, m.releaseYear
This query returns 30,146 rows.
Better: Delay access until node is distinct
Ideally, you want to access the node property once, so making it distinct helps:
PROFILE
MATCH (m:Movie)-[:ACTED_IN]-(a:Actor)
WHERE a.born > $year
WITH DISTINCT m
RETURN m.title, m.releaseYear
Although the elapsed time is greater, the number of rows returned is much smaller which means less work to get the property values.
Property access before aggregation
Another example is where aggregation is performed for some of the nodes and the property access is done before the aggregation:
PROFILE
MATCH (m:Movie)<-[:ACTED_IN]-(a:Actor)
WHERE a.born > $year AND m.releaseYear > $year
WITH m, collect(a) as actors
RETURN m.title, m.releaseYear, actors
Better: Aggregate and then access properties
When we aggregate the actor nodes, the movie nodes are distinct so it is much more efficient to rewrite this query to:
PROFILE
MATCH (m:Movie)<-[:ACTED_IN]-(a:Actor)
WHERE a.born > $year
WITH m, collect(a) as actors
WHERE m.releaseYear > $year
RETURN m.title, m.releaseYear, [a IN actors | a.name] as actorNames
This query could benefit from indexes on the born and releaseYear properties. |
Exercise 6: More Query Tuning
In the query edit pane of Neo4j Browser, execute the browser command:
:play 4.0-query-tuning-exercises
and follow the instructions for Exercise 6.
This exercise has 6 steps. Estimated time to complete: 15 minutes. |
Check your understanding
Question 1
Given this query:
MATCH (m:Movie)
WHERE m.title = $movieTitle
RETURN m.title, m.releaseYear
When are the title and releaseYear properties accessed?
Select the correct answers.
-
If there are no indexes on Movie.title or Movie.releaseYear, the Movie.title property is accessed during the query and the Movie.releaseYear property is accessed in the
RETURN
clause. -
If there is an index only on Movie.title, the Movie.releaseYear property is accessed in the
RETURN
clause. -
If there is an index only on Movie.releaseYear, the Movie.title property is accessed during the query and the Movie.releaseYear property is accessed in the
RETURN
clause. -
If there are indexes on both Movie.title and Movie.releaseYear, the properties are not accessed, but gotten only from the indexes.
Question 2
For a given query, what metrics can you observe related to time?
Select the correct answers.
-
total elapsed ms for the query
-
ms for the query execution, that does not include results returned
-
ms for the index lookup
-
ms for the streaming of the results to the client
Question 3
Suppose you have an index on the Movie.releaseYear property. With this code:
MATCH (m:Movie)
WHERE m.releaseYear > $year
RETURN m.title, m.releaseYear
Which statements are true about this code?
Select the correct answers.
-
The title property is accessed for the return.
-
The releaseYear property is accessed for the return.
-
The records returned are sorted by releaseYear.
-
The records returned are in random order.
Summary
You can now:
-
Describe when properties are accessed in a query.
-
Optimize queries by minimizing property access.
Need help? Ask in the Neo4j Community