Advanced query tuning example
Introduction
One of the most important and useful ways of optimizing Cypher® queries involves creating appropriate indexes.
This is described in more detail in Indexes for search performance, and demonstrated in Basic query tuning example.
In summary, an index will be based on the combination of a Label
and a property
.
Any Cypher query that searches for nodes with a specific label and some predicate on the property (equality, range or existence) will be planned to use
the index if the cost planner deems that to be the most efficient solution.
In order to benefit from enhancements provided by native indexes, it is useful to understand when index-backed property lookup and index-backed order by will come into play. In Neo4j 3.4 and earlier, the fact that the index contains the property value, and the results are returned in a specific order, was not used improve the performance of any later part of the query that might depend on the property value or result order.
Let’s explain how to use these features with a more advanced query tuning example.
If you are upgrading an existing store to 4.0.12, it may be necessary to drop and re-create existing indexes. For information on native index support and upgrade considerations regarding indexes, see Operations Manual → Indexes. |
The data set
In this example we will demonstrate the impact native indexes can have on query performance under certain conditions. We’ll use a movies dataset to illustrate this more advanced query tuning.
LOAD CSV WITH HEADERS FROM 'file:///query-tuning/movies.csv' AS line
MERGE (m:Movie { title: line.title })
ON CREATE SET m.released = toInteger(line.released), m.tagline = line.tagline
LOAD CSV WITH HEADERS FROM 'file:///query-tuning/actors.csv' AS line
MATCH (m:Movie { title: line.title })
MERGE (p:Person { name: line.name })
ON CREATE SET p.born = toInteger(line.born)
MERGE (p)-[:ACTED_IN { roles:split(line.roles, ';')}]->(m)
LOAD CSV WITH HEADERS FROM 'file:///query-tuning/directors.csv' AS line
MATCH (m:Movie { title: line.title })
MERGE (p:Person { name: line.name })
ON CREATE SET p.born = toInteger(line.born)
MERGE (p)-[:DIRECTED]->(m)
CREATE INDEX FOR (p:Person)
ON (p.name)
CALL db.awaitIndexes
CALL db.indexes
+---------------------------------------------------------------------------------------------------------------------------------------------+
| id | name | state | populationPercent | uniqueness | type | entityType | labelsOrTypes | properties | provider |
+---------------------------------------------------------------------------------------------------------------------------------------------+
| 1 | "index_5c0607ad" | "ONLINE" | 100.0 | "NONUNIQUE" | "BTREE" | "NODE" | ["Person"] | ["name"] | "native-btree-1.0" |
+---------------------------------------------------------------------------------------------------------------------------------------------+
1 row
Index-backed property-lookup
Let’s say we want to write a query to find persons with the name 'Tom' that acted in a movie.
MATCH (p:Person)-[:ACTED_IN]->(m:Movie)
WHERE p.name STARTS WITH 'Tom'
RETURN p.name, count(m)
+---------------------------+
| p.name | count(m) |
+---------------------------+
| "Tom Cruise" | 3 |
| "Tom Hanks" | 12 |
| "Tom Skerritt" | 1 |
+---------------------------+
3 rows
We have asked the database to return all the actors with the first name 'Tom'.
There are three of them: 'Tom Cruise', 'Tom Skerritt' and 'Tom Hanks'.
In previous versions of Neo4j, the final clause RETURN p.name
would cause the database to take the node p
and look up its properties and return the value of the property name
.
With native indexes, however, we can leverage the fact that indexes store the property values.
In this case, it means that the names can be looked up directly from the index.
This allows Cypher to avoid the second call to the database to find the property, which can save time on very large queries.
If we profile the above query, we see that the NodeIndexScan
in the Variables
column contains cache[p.name]
,
which means that p.name
is retrieved from the index.
We can also see that the Projection
has no DB Hits
, which means it does not have to access the database again.
PROFILE
MATCH (p:Person)-[:ACTED_IN]->(m:Movie)
WHERE p.name STARTS WITH 'Tom'
RETURN p.name, count(m)
Compiler CYPHER 4.0
Planner COST
Runtime INTERPRETED
Runtime version 4.0
+-----------------------+----------------+------+---------+-----------------+-------------------+----------------------+------------+------------------+-----------------------------------------------------------+
| Operator | Estimated Rows | Rows | DB Hits | Page Cache Hits | Page Cache Misses | Page Cache Hit Ratio | Order | Variables | Other |
+-----------------------+----------------+------+---------+-----------------+-------------------+----------------------+------------+------------------+-----------------------------------------------------------+
| +ProduceResults | 1 | 3 | 0 | 0 | 0 | 0.0000 | p.name ASC | count(m), p.name | |
| | +----------------+------+---------+-----------------+-------------------+----------------------+------------+------------------+-----------------------------------------------------------+
| +OrderedAggregation | 1 | 3 | 0 | 0 | 0 | 0.0000 | p.name ASC | count(m), p.name | p.name |
| | +----------------+------+---------+-----------------+-------------------+----------------------+------------+------------------+-----------------------------------------------------------+
| +Filter | 1 | 16 | 16 | 0 | 0 | 0.0000 | p.name ASC | anon[17], m, p | m:Movie |
| | +----------------+------+---------+-----------------+-------------------+----------------------+------------+------------------+-----------------------------------------------------------+
| +Expand(All) | 1 | 16 | 20 | 0 | 0 | 0.0000 | p.name ASC | anon[17], m -- p | (p)-[:ACTED_IN]->(m) |
| | +----------------+------+---------+-----------------+-------------------+----------------------+------------+------------------+-----------------------------------------------------------+
| +NodeIndexSeekByRange | 1 | 4 | 5 | 0 | 0 | 0.0000 | p.name ASC | p | :Person(name STARTS WITH $` AUTOSTRING0`), cache[p.name] |
+-----------------------+----------------+------+---------+-----------------+-------------------+----------------------+------------+------------------+-----------------------------------------------------------+
Total database accesses: 41
If we change the query, such that it can no longer use an index, we will see that there will be no cache[p.name]
in the Variables, and that the
Projection
now has DB Hits
, since it accesses the database again to retrieve the name.
PROFILE
MATCH (p:Person)-[:ACTED_IN]->(m:Movie)
RETURN p.name, count(m)
Compiler CYPHER 4.0
Planner COST
Runtime INTERPRETED
Runtime version 4.0
+-------------------+----------------+------+---------+-----------------+-------------------+----------------------+------------------+----------------------+
| Operator | Estimated Rows | Rows | DB Hits | Page Cache Hits | Page Cache Misses | Page Cache Hit Ratio | Variables | Other |
+-------------------+----------------+------+---------+-----------------+-------------------+----------------------+------------------+----------------------+
| +ProduceResults | 13 | 102 | 0 | 0 | 0 | 0.0000 | count(m), p.name | |
| | +----------------+------+---------+-----------------+-------------------+----------------------+------------------+----------------------+
| +EagerAggregation | 13 | 102 | 172 | 0 | 0 | 0.0000 | count(m), p.name | p.name |
| | +----------------+------+---------+-----------------+-------------------+----------------------+------------------+----------------------+
| +Filter | 172 | 172 | 172 | 0 | 0 | 0.0000 | anon[17], m, p | p:Person |
| | +----------------+------+---------+-----------------+-------------------+----------------------+------------------+----------------------+
| +Expand(All) | 172 | 172 | 210 | 0 | 0 | 0.0000 | anon[17], p -- m | (m)<-[:ACTED_IN]-(p) |
| | +----------------+------+---------+-----------------+-------------------+----------------------+------------------+----------------------+
| +NodeByLabelScan | 38 | 38 | 39 | 0 | 0 | 0.0000 | m | :Movie |
+-------------------+----------------+------+---------+-----------------+-------------------+----------------------+------------------+----------------------+
Total database accesses: 593
It is important to note that not all property types are supported, because not all are supported by native indexes.
Additionally, some property types such as the spatial type Point
, are indexed in an index that is designed to be approximate and cannot return the values.
For non-native indexes and the spatial type Point
, there will still be a second database access to retrieve those values.
In indexes with mixed values, only those values that cannot be looked up from the index will trigger another database access action.
Predicates that can be used to enable this optimization are:
-
Existence (
WHERE exists(n.name)
) -
Equality (e.g.
WHERE n.name = 'Tom Hanks'
) -
Range (eg.
WHERE n.uid > 1000 AND n.uid < 2000
) -
Prefix (eg.
WHERE n.name STARTS WITH 'Tom'
) -
Suffix (eg.
WHERE n.name ENDS WITH 'Hanks'
) -
Substring (eg.
WHERE n.name CONTAINS 'a'
) -
Several predicates of the above types combined using
OR
, given that all of them are on the same property (eg.WHERE n.prop < 10 OR n.prop = 'infinity'
)
If there is an existence constraint on the property, no predicate is required to trigger the optimization. For example, CREATE CONSTRAINT constraint_name ON (p:Person) ASSERT exists(p.name)
|
Aggregating functions
For all built-in aggregating functions in Cypher, the index-backed property-lookup optimization can be used even without a predicate. Consider this query which returns the number of distinct names of people in the movies dataset:
PROFILE
MATCH (p:Person)
RETURN count(DISTINCT p.name) AS numberOfNames
Compiler CYPHER 4.0
Planner COST
Runtime INTERPRETED
Runtime version 4.0
+-------------------+----------------+------+---------+-----------------+-------------------+----------------------+---------------+------------------------------+
| Operator | Estimated Rows | Rows | DB Hits | Page Cache Hits | Page Cache Misses | Page Cache Hit Ratio | Variables | Other |
+-------------------+----------------+------+---------+-----------------+-------------------+----------------------+---------------+------------------------------+
| +ProduceResults | 1 | 1 | 0 | 0 | 0 | 0.0000 | numberOfNames | |
| | +----------------+------+---------+-----------------+-------------------+----------------------+---------------+------------------------------+
| +EagerAggregation | 1 | 1 | 0 | 0 | 0 | 0.0000 | numberOfNames | |
| | +----------------+------+---------+-----------------+-------------------+----------------------+---------------+------------------------------+
| +NodeIndexScan | 125 | 125 | 126 | 0 | 0 | 0.0000 | p | :Person(name), cache[p.name] |
+-------------------+----------------+------+---------+-----------------+-------------------+----------------------+---------------+------------------------------+
Total database accesses: 126
Note that the NodeIndexScan
in the Variables
column contains cache[p.name]
and that the EagerAggregation
has no DB Hits
.
In this case, the semantics of aggregating functions works like an implicit existence constraint because Person
nodes without the property name
will not affect the result of an aggregation.
Index-backed order by
Now consider the following refinement to the query:
MATCH (p:Person)-[:ACTED_IN]->(m:Movie)
WHERE p.name STARTS WITH 'Tom'
RETURN p.name, count(m)
ORDER BY p.name
+---------------------------+
| p.name | count(m) |
+---------------------------+
| "Tom Cruise" | 3 |
| "Tom Hanks" | 12 |
| "Tom Skerritt" | 1 |
+---------------------------+
3 rows
We are asking for the results in ascending alphabetical order.
The native index happens to store String properties in ascending alphabetical order, and Cypher knows this.
In Neo4j 3.4 and earlier, Cypher would plan a Sort
operation to sort the results, which means building a collection in memory and running a sort algorithm on it.
For large result sets this can be expensive in terms of both memory and time.
If you are using the native index, Cypher will recognise that the index already returns data in the correct order, and skip the Sort
operation.
Indexes storing values of the spatial type Point
, and non-native indexes, cannot be relied on to return the values in the correct order.
This means that for Cypher to enable this optimization, the query needs a predicate that limits the type of the property to some type that is guaranteed to be in the right order.
To demonstrate this effect, let’s remove the String prefix predicate so that Cypher no longer knows the type of the property, and replace it with an existence predicate.
Now the database can no longer guarantee the order.
If we profile the query we will see the Sort
operation:
PROFILE
MATCH (p:Person)-[:ACTED_IN]->(m:Movie)
USING INDEX p:Person(name)
WHERE EXISTS (p.name)
RETURN p.name, count(m)
ORDER BY p.name
Compiler CYPHER 4.0
Planner COST
Runtime INTERPRETED
Runtime version 4.0
+-------------------+----------------+------+---------+-----------------+-------------------+----------------------+------------+------------------+------------------------------+
| Operator | Estimated Rows | Rows | DB Hits | Page Cache Hits | Page Cache Misses | Page Cache Hit Ratio | Order | Variables | Other |
+-------------------+----------------+------+---------+-----------------+-------------------+----------------------+------------+------------------+------------------------------+
| +ProduceResults | 13 | 102 | 0 | 0 | 0 | 0.0000 | p.name ASC | count(m), p.name | |
| | +----------------+------+---------+-----------------+-------------------+----------------------+------------+------------------+------------------------------+
| +Sort | 13 | 102 | 0 | 0 | 0 | 0.0000 | p.name ASC | count(m), p.name | p.name |
| | +----------------+------+---------+-----------------+-------------------+----------------------+------------+------------------+------------------------------+
| +EagerAggregation | 13 | 102 | 0 | 0 | 0 | 0.0000 | | count(m), p.name | p.name |
| | +----------------+------+---------+-----------------+-------------------+----------------------+------------+------------------+------------------------------+
| +Filter | 172 | 172 | 172 | 0 | 0 | 0.0000 | | anon[17], m, p | m:Movie |
| | +----------------+------+---------+-----------------+-------------------+----------------------+------------+------------------+------------------------------+
| +Expand(All) | 172 | 172 | 297 | 0 | 0 | 0.0000 | | anon[17], m -- p | (p)-[:ACTED_IN]->(m) |
| | +----------------+------+---------+-----------------+-------------------+----------------------+------------+------------------+------------------------------+
| +NodeIndexScan | 125 | 125 | 126 | 0 | 0 | 0.0000 | | p | :Person(name), cache[p.name] |
+-------------------+----------------+------+---------+-----------------+-------------------+----------------------+------------+------------------+------------------------------+
Total database accesses: 595
The Order
column describes the order of rows after each operator.
We see that the order is undefined until the Sort
operator.
Now if we add back the predicate that gives us the property type information,
we will see the Sort
operation is no longer there:
PROFILE
MATCH (p:Person)-[:ACTED_IN]->(m:Movie)
WHERE p.name STARTS WITH 'Tom'
RETURN p.name, count(m)
ORDER BY p.name
Compiler CYPHER 4.0
Planner COST
Runtime INTERPRETED
Runtime version 4.0
+-----------------------+----------------+------+---------+-----------------+-------------------+----------------------+------------+------------------+-----------------------------------------------------------+
| Operator | Estimated Rows | Rows | DB Hits | Page Cache Hits | Page Cache Misses | Page Cache Hit Ratio | Order | Variables | Other |
+-----------------------+----------------+------+---------+-----------------+-------------------+----------------------+------------+------------------+-----------------------------------------------------------+
| +ProduceResults | 1 | 3 | 0 | 0 | 0 | 0.0000 | p.name ASC | count(m), p.name | |
| | +----------------+------+---------+-----------------+-------------------+----------------------+------------+------------------+-----------------------------------------------------------+
| +OrderedAggregation | 1 | 3 | 0 | 0 | 0 | 0.0000 | p.name ASC | count(m), p.name | p.name |
| | +----------------+------+---------+-----------------+-------------------+----------------------+------------+------------------+-----------------------------------------------------------+
| +Filter | 1 | 16 | 16 | 0 | 0 | 0.0000 | p.name ASC | anon[17], m, p | m:Movie |
| | +----------------+------+---------+-----------------+-------------------+----------------------+------------+------------------+-----------------------------------------------------------+
| +Expand(All) | 1 | 16 | 20 | 0 | 0 | 0.0000 | p.name ASC | anon[17], m -- p | (p)-[:ACTED_IN]->(m) |
| | +----------------+------+---------+-----------------+-------------------+----------------------+------------+------------------+-----------------------------------------------------------+
| +NodeIndexSeekByRange | 1 | 4 | 5 | 0 | 0 | 0.0000 | p.name ASC | p | :Person(name STARTS WITH $` AUTOSTRING0`), cache[p.name] |
+-----------------------+----------------+------+---------+-----------------+-------------------+----------------------+------------+------------------+-----------------------------------------------------------+
Total database accesses: 41
We also see that the Order
column contains p.name ASC
from the index seek operation, meaning that the rows are ordered by p.name
in ascending order.
Index-backed order by can also be used for queries that expect their results is descending order, but with slightly lower performance.
In cases where the Cypher planner is unable to remove the Sort operator, like in the first example, the planner can utilize knowledge of required order after each operator to plan the Sort operator at a point in the plan with optimal cardinality.
|
min()
and max()
For the min
and max
functions, the index-backed order by optimization can be used to avoid aggregation and instead utilize the fact that the minimum/maximum value is the first/last one in a sorted index.
Consider the following query which returns the fist actor in alphabetical order:
MATCH (p:Person)-[:ACTED_IN]->(m:Movie)
WHERE p.name STARTS WITH ''
RETURN min(p.name) AS name
+----------------+
| name |
+----------------+
| "Aaron Sorkin" |
+----------------+
1 row
To demonstrate the effect of index-backed order by, let’s remove the String prefix predicate so that Cypher no longer knows the type of the property, and replace it with an existence predicate.
Now the database can no longer guarantee the order. If we profile the query we will see the EagerAggregation
operation:
PROFILE
MATCH (p:Person)-[:ACTED_IN]->(m:Movie)
USING INDEX p:Person(name)
WHERE EXISTS (p.name)
RETURN min(p.name) AS name
Compiler CYPHER 4.0
Planner COST
Runtime INTERPRETED
Runtime version 4.0
+-------------------+----------------+------+---------+-----------------+-------------------+----------------------+------------------+------------------------------+
| Operator | Estimated Rows | Rows | DB Hits | Page Cache Hits | Page Cache Misses | Page Cache Hit Ratio | Variables | Other |
+-------------------+----------------+------+---------+-----------------+-------------------+----------------------+------------------+------------------------------+
| +ProduceResults | 1 | 1 | 0 | 0 | 0 | 0.0000 | name | |
| | +----------------+------+---------+-----------------+-------------------+----------------------+------------------+------------------------------+
| +EagerAggregation | 1 | 1 | 0 | 0 | 0 | 0.0000 | name | |
| | +----------------+------+---------+-----------------+-------------------+----------------------+------------------+------------------------------+
| +Filter | 172 | 172 | 172 | 0 | 0 | 0.0000 | anon[17], m, p | m:Movie |
| | +----------------+------+---------+-----------------+-------------------+----------------------+------------------+------------------------------+
| +Expand(All) | 172 | 172 | 297 | 0 | 0 | 0.0000 | anon[17], m -- p | (p)-[:ACTED_IN]->(m) |
| | +----------------+------+---------+-----------------+-------------------+----------------------+------------------+------------------------------+
| +NodeIndexScan | 125 | 125 | 126 | 0 | 0 | 0.0000 | p | :Person(name), cache[p.name] |
+-------------------+----------------+------+---------+-----------------+-------------------+----------------------+------------------+------------------------------+
Total database accesses: 595
Now if we add back the predicate that gives us the property type information, we will see that the EagerAggregation
operation gets replaced by Projection
followed by Limit
followed by Optional
:
PROFILE
MATCH (p:Person)-[:ACTED_IN]->(m:Movie)
WHERE p.name STARTS WITH ''
RETURN min(p.name) AS name
Compiler CYPHER 4.0
Planner COST
Runtime INTERPRETED
Runtime version 4.0
+-----------------------+----------------+------+---------+-----------------+-------------------+----------------------+------------+------------------------+-----------------------------------------------------------+
| Operator | Estimated Rows | Rows | DB Hits | Page Cache Hits | Page Cache Misses | Page Cache Hit Ratio | Order | Variables | Other |
+-----------------------+----------------+------+---------+-----------------+-------------------+----------------------+------------+------------------------+-----------------------------------------------------------+
| +ProduceResults | 1 | 1 | 0 | 0 | 0 | 0.0000 | name ASC | anon[17], m, name, p | |
| | +----------------+------+---------+-----------------+-------------------+----------------------+------------+------------------------+-----------------------------------------------------------+
| +Optional | 1 | 1 | 0 | 0 | 0 | 0.0000 | name ASC | anon[17], m, name, p | |
| | +----------------+------+---------+-----------------+-------------------+----------------------+------------+------------------------+-----------------------------------------------------------+
| +Limit | 1 | 1 | 0 | 0 | 0 | 0.0000 | name ASC | anon[17], m, name, p | 1 |
| | +----------------+------+---------+-----------------+-------------------+----------------------+------------+------------------------+-----------------------------------------------------------+
| +Projection | 1 | 1 | 0 | 0 | 0 | 0.0000 | name ASC | name -- anon[17], m, p | {name : cache[p.name]} |
| | +----------------+------+---------+-----------------+-------------------+----------------------+------------+------------------------+-----------------------------------------------------------+
| +Filter | 1 | 1 | 1 | 0 | 0 | 0.0000 | p.name ASC | anon[17], m, p | m:Movie |
| | +----------------+------+---------+-----------------+-------------------+----------------------+------------+------------------------+-----------------------------------------------------------+
| +Expand(All) | 1 | 1 | 2 | 0 | 0 | 0.0000 | p.name ASC | anon[17], m -- p | (p)-[:ACTED_IN]->(m) |
| | +----------------+------+---------+-----------------+-------------------+----------------------+------------+------------------------+-----------------------------------------------------------+
| +NodeIndexSeekByRange | 1 | 1 | 2 | 0 | 0 | 0.0000 | p.name ASC | p | :Person(name STARTS WITH $` AUTOSTRING0`), cache[p.name] |
+-----------------------+----------------+------+---------+-----------------+-------------------+----------------------+------------+------------------------+-----------------------------------------------------------+
Total database accesses: 5
In the first case, all nodes in the index are scanned to find the name that is first in alphabetic order.
In the second case, we will simply pick the first value from the index.
This is reflected in the fact that the total database access
is lower, indicating a faster query.
For large datasets, this can improve performance dramatically.
Index-backed order by can also be used for corresponding queries with the max
function, but with slightly lower performance.
Restrictions
The optimization can only work on native indexes, and only if we query for a specific type, in order to rule out the spatial type Point
.
Predicates that can be used to enable this optimization are:
-
Equality (e.g.
WHERE n.name = 'Tom Hanks'
) -
Range (eg.
WHERE n.uid > 1000 AND n.uid < 2000
) -
Prefix (eg.
WHERE n.name STARTS WITH 'Tom'
) -
Suffix (eg.
WHERE n.name ENDS WITH 'Hanks'
) -
Substring (eg.
WHERE n.name CONTAINS 'a'
)
Predicates that will not work:
-
Several predicates combined using
OR
because the property type might differ between the predicates -
Existence (eg.
WHERE exists(n.email)
) because no property type information is given
If the predicate uses a parameter, such as An existence constraint does not include any type information either and will thus not be enough to trigger index-backed order by. |