GraphGists

Basic Query Tuning Example

We’ll start with a basic example to help you get the hang of profiling queries. The following examples will use a movies data set.

Let’s start by importing the data:

LOAD CSV WITH HEADERS FROM "https://raw.githubusercontent.com/neo4j/neo4j/2.3/manual/cypher/cypher-docs/src/docs/graphgists/query-tuning/movies.csv" AS line
MERGE (m:Movie {title:line.title})
ON CREATE SET m.released = toInteger(line.released), m.tagline = line.tagline
LOAD CSV WITH HEADERS FROM 'https://raw.githubusercontent.com/neo4j/neo4j/2.3/manual/cypher/cypher-docs/src/docs/graphgists/query-tuning/actors.csv' AS line

MATCH (m:Movie {title:line.title})
MERGE (p:Person {name:line.name})
ON CREATE SET p.born = toInteger(line.born)

MERGE (p)-[:ACTED_IN {roles:split(line.roles,";")}]->(m)
LOAD CSV WITH HEADERS FROM 'https://raw.githubusercontent.com/neo4j/neo4j/2.3/manual/cypher/cypher-docs/src/docs/graphgists/query-tuning/directors.csv' AS line
MATCH (m:Movie {title:line.title})
MERGE (p:Person {name:line.name})
ON CREATE SET p.born = toInteger(line.born)

MERGE (p)-[:DIRECTED]->(m)

Find Tom Hanks: Naive

Let’s say we want to write a query to find Tom Hanks. The naive way of doing this would be to write the following:

MATCH (p {name:"Tom Hanks"})
RETURN p

This query will find the Tom Hanks node but as the number of nodes in the database increase it will become slower and slower.

Profile

We can profile the query to find out why this query is so slow.

PROFILE MATCH (p {name:"Tom Hanks"})
RETURN p

The first thing to keep in mind when reading execution plans is that you need to read from the bottom up.

In that vein, starting from the last row, the first thing we notice is that the value in the Rows column seems high given there is only one node with the name property Tom Hanks in the database. If we look across to the Operator column we’ll see that AllNodesScan has been used which means that the query planner scanned through all the nodes in the database.

Moving up to the previous row we see the Filter operator which will check the name property on each of the nodes passed through by AllNodesScan.

This seems like an inefficient way of finding Tom Hanks given that we are looking at many nodes that aren’t even people and therefore aren’t what we’re looking for.

Find Tom Hanks: Second Try

Whenever we’re looking for a node we should specify a label to help the query planner narrow down the search space. For this query we’d need to add a Person label.

MATCH (p:Person {name:"Tom Hanks"})
RETURN p

This query will be faster than the first one but as the number of people in our database increase we again notice that the query slows down.

Profile

Again we can profile the query to work out why our query is still a bit slow:

PROFILE MATCH (p:Person {name:"Tom Hanks"})
RETURN p

This time the Rows value on the last row has reduced so we’re not scanning some nodes that we were before which is a good start. The NodeByLabelScan operator indicates that we achieved this by first doing a linear scan of all the Person nodes in the database.

Once we’ve done that we again scan through all those nodes using the Filter operator, comparing the name property of each one.

This might be acceptable in some cases but if we’re going to be looking up people by name frequently then we’ll see better performance if we create an index on the name property for the Person label:

CREATE INDEX ON :Person(name)

Find Tom Hanks: With Index on Person

Now if we run the query again it will run more quickly:

MATCH (p:Person {name:"Tom Hanks"})
RETURN p

Provile

Let’s profile the query to see why this version is so much faster:

PROFILE MATCH (p:Person {name:"Tom Hanks"})
RETURN p

Our execution plan is down to a single row and uses the Node Index Seek operator which does a schema index seek to find the appropriate node.