By Petra Selmer & Mark Needham, Engineers, Neo Technology | February 10, 2016
Editor’s Note: Last October at GraphConnect San Francisco, Petra Selmer and Mark Needham – Engineers at Neo Technology – delivered this presentation on how to develop effective queries with Cypher.
For more videos from GraphConnect SF and to register for GraphConnect Europe, check out graphconnect.com..
Petra Selmer: I’m going to cover a little about query planners, a brief overview of how Cypher actually executes queries, leading into query plans and some definitions and finally getting to the good stuff which is optimization tips, which hopefully you’ll find useful.
The inception of Cypher included a rule planner consisting of rules that used indexes to produce query execution plans. With the inception of Neo4j 2.2, we introduced a new cost-based planner that is based on database knowledge principles relational databases have been using for decades.
Using a statistics service, we store information such as the number of movie labels or certain relationship types, and using those statistics we assign costs to various execution plans, which allows us to pick the cheapest one. All read-only queries use this by default because it generally performs much better than rule-based planners.
However, if you need to use the rule-based planner, you can prepend a query with
CYPHER planner = ruleon a query-by-query basis and set
If you do find that the cost-based planner doesn’t work in one patch release, retest it with a new patch release. Additionally, if the speed of queries run with a rule are much faster than than those run with the cost-based planner, please let us know so that we can benchmark it. The Cypher team tries to benchmark as many queries (and types of queries) as we can with as many datasets as we can. We do this because we always want the cost-based planner to perform better than the rule.
Cypher Query Executions
Below is a basic bird’s-eye view of how a query is executed in Cypher:
After we get the input query, we tokenize it and then abstract the syntax text tree to do semantic checking, which is basic error checking. Next, we optimize and normalize the abstract syntax tree and perform functions such as moving all labels and types from the
MATCHclause into the
WHEREclause, or converting equality statements such as
name=Anninto an In.
From the abstract syntax tree we create a query graph, which is much easier to operate, and then create a series of logical plans. Using the data from our statistics store, we get the label and the index selectivity. To keep a query running quickly, you want to ensure high selectivity because it will return fewer rows. We also estimate our cardinality, which is the number of rows returned after each operation, and which we want to keep low for the same reason.
Imagine now we’ve got a plethora of logical plans that we’ve built over different costs. Next we employ an algorithm to select the cheapest one, which is then chosen and optimized. We then create an execution plan based on that logical plan, and query execution follows from that.
More resources on Cypher query executions can be found here and here.
In 2.2, we were using a greedy algorithm which guaranteed that although we wouldn’t choose the cheapest logical plan, it would not be the most expensive logical plan. In version 2.3 we’re using iterative dynamic programming (IDP), which is exhaustive and provides a cheaper logical plan and takes as much or less time than the greedy algorithm.
Interacting with a Query Plan
Mark Needham: How do you actually interact with that plan to make any changes that will make it faster?
Let’s imagine you’re in your Neo4j Browser with your query, and you’d like to pull up one of the plans explained by Petra. You can include two keywords that prefix your queries:
EXPLAINwon’t run the query, but will instead provide you with the predicted costs. If you want to run the query and see how it performs, you prefix it with
Below is the longest Neo4j query plan we’ve ever seen — approximately 230 lines of Cypher — posted in Stack Overflow:
On each line there’s an operator name, and if you were to expand it, it will show you how much work it was doing.
The goal of query profiling is to keep the output of the queries exactly the same and reduce the number of database hits. A database hit is an abstract unit of storage engine work and could correspond to looking up a node, a property or a relationship. If you have one query with ten database hits and another query that performs the same function but only produces five database hits, the one that returns five is going to be faster.
Query Execution Plan Operators
Now we’re going to examine queries, review the query plans, compare them to each other and apply some fixes we’ve accumulated from helping people speed up their queries on Stack Overflow.
The all nodes scan is the equivalent of “select star from star” – i.e. look at every single row in every single table – which is every node in the database. This will take a long time and come back with a production-sized dataset.
A label scan is the equivalent of “select a star from a table,” an example of which is “find all user nodes,” which reduces the cardinality of the number of nodes being queried. The best execution plan operators are “node index seek” and “node index scan.” With “node index seek” you have placed an index on some of your nodes that allow you to start a query at a specific node. The “node index scan” is similar and is used for range queries.
You can review execution plan operators more in depth here.
Using Node Labels
We’re going to have a look at some queries using a dataset from Cineasts, an extended version of the
:play moviesthat comes in the Neo4j Browser and has more movies than IMDB.
Below is our first example, in which we have a
RETURNclause that will give us the movie The Matrix:
The following, on the right-hand side of the slide, is what the database returns:
AllNodesScanlooks up all 65,000 nodes in our database, applies a
filterand returns one row. With the combined database hits from both functions, this search has performed 130,000 pieces of work, which amounts to an expensive search. This leads us to our first tip: If your node is labeled, include it in your query.
We address this by adding the syntax
:movie, which tells our search to only examine nodes with the
Rather than performing an
AllNodeScan, this search performs a
NodeByLabelScanfollowed by a
Filter, which gives us 25,000 database hits instead of 120,000.
Below is how the two queries compare:
In both of these queries, we want to reduce the amount of red, which corresponds to the number of database hits. But we can do even better.
Using Indexes and Constraints
Petra: To tighten up your searches, you can use both indexes and constraints:
With the first
create index on, we are setting an index on the title of a
:movie; with the second, we are setting an index on the name of a
:person, both of which allow us to create unique indexes. In graph databases, indexes are only used to find the starting point for queries, while in relational databases indexes are used to return a set of rows that are then used for JOINS.
If we apply
NodeIndexSeekto The Matrix example, we only get two database hits instead of 120,000:
Mark: In the next example, we want to find movies in which two actors — specifically Tom Hanks and Meg Ryan — have both appeared:
For both Tom Hanks and Meg Ryan, we locate each person and then expand out the
:acts_inrelationship and then filter to make sure both people are different, which gives us around 500 database hits.
However, if you apply an index to both sides of this query — to both Tom Hanks and Meg Ryan — you’ll end up with a much better result. The syntax is
:using indexfollowed by the label and property combination on which you want to apply the indexes:
Again, an index is used to find the starting point. We have now directed our query to zoom in on Tom Hanks and Meg Ryan and find the connections between them. This gives the query plan a very different shape:
NodeIndexSeekon both the Tom Hanks and Meg Ryan sides, JOIN the two datasets, and then apply
Filterto make sure we don’t have any duplicates. This returns 60 database hits, compared to the previous 500.
Below is a side-by-side of the two query plans:
In order to be able to use the most effective query, you have to combine your domain knowledge with the planner’s knowledge of the query language.
MATCH Clauses to Reduce Cardinality
In the next query, we want to find Tom Hanks’ colleagues’ movies:
To answer the question “For actors who have acted in movies with Tom Hanks, what movies did they appear in without Tom Hanks?” we start by finding Tom Hanks and the movies he’s acted in; the movies his co-actors have acted in; the movies they acted in together; and then the other movies the co-actor has appeared in.
We start with
node index seek, expand out the
:acts_in, apply a
filterand end up with 2,000 rows. If we examine the structure of the dataset, we can see that we’re doing extra work because we aren’t eliminating the multiple times a single actor might come up if they’ve worked with Tom Hanks on multiple movies.
To address this, we make the following addition with the blue text below:
We’ve split our query into two
MATCHstatements and run it
WITH DISTINCTover the co-actors, which stops the database from performing multiple searches involving the same actor. This reduces the work in progress, and has brought us from around 10,000 database hits to 8,000.
Again, here are the two queries compared side by side:
Using Size with Relationships
Petra: In our next example, we want to find the number of movies each actor has appeared in. Our first attempt might look something like this, which returns 94,700 rows:
To return a more manageable dataset, you can include
The number of rows remains constant because we have a
get degreeoperator. For each actor that’s found, it pulls the number of relationships of
:ACTS INand parses that through, which keeps us at 45,000 rows.
We strive for our cost-based planner to pick the cheapest logical model and hints to come up with the optimal execution plan, which is sometimes very challenging. And while it is our goal to reduce the number of instances in which you have to apply hints, sometimes they are necessary.
If you notice a query running slowly even if you’ve already applied an index, you may have to force the index:
Another hint you can use is something called
In this particular case, you may already know that you have 65,000 total movie nodes but maybe only 12,000 nodes with
comedylabels. Because you know that the comedy label is quite small – i.e., it’s got a very low cardinality – for badly performing queries, you might need to force a label scan on comedy.
Pictured below is how you would do that with the
USING SCANkeyword in Cypher:
Additional Query Hints
Mark: To perform the best possible query, use parameters whenever possible, especially when you’re building queries in your application. Rather than simply include labels and properties in your Cypher query, define the parameter name in the curly brackets and then feed that in when you call the API from your application.
Cypher will try to parameterize the top query and try to detect that same pattern in any subsequent queries. But by informing Cypher that a query is parameterized, you keep it from having to make that assumption. In the end, parameterizing your Cypher queries is always going to be more effective than Cypher trying to work them out for you.
Avoid Cartesian Products
Petra: It can be an unpleasant surprise if you you accidentally write a query that returns a Cartesian product. It’s important to ensure that you only use disconnected patterns very intentionally — if you really need to pull a Cartesian product.
In the first example below,
count(m)will return the total number of rows, i.e. the Cartesian product of
m, which was not the intention. You can instead use the second example, which performs much faster:
In the latest version of Neo4j we included browser warnings — such as the Cartesian product warning — which are important to monitor.
Only Return What You Need
Mark: In an ideal query, the shape of the number of rows is consistent all the way down and has a small number of database hits in each step. To optimize your queries, you can examine where the most hits are happening in your query and try to systematically reduce them in each step to prevent a bottleneck.
We’ve seen people create nodes with hundreds of properties. The first query below will return every single property of the actor node, whereas the second query will only return the three properties specified:
The final tip is try and keep the query short. It’s often better to do a number of small queries rather than one large query. It’s also important to keep read and write queries separate; if not, you can only use the
Inspired by Mark and Petra’s talk? Register for GraphConnect Europe on April 26, 2016 at for more talks and workshops on Cypher and the evolving world of graph database technology.
About the Author
Petra Selmer & Mark Needham, Engineers, Neo Technology
Petra Selmer is a member of the Neo Technology engineering team. For her doctoral program, she is working on a language that allows for slight imprecisions, so users can explore complex domains without having to get everything exactly right in their queries. For many years, she worked as a consultant and developer in a variety of different settings and roles. Petra is South African and is working in Neo’s London office.
Mark Needham is a Support Engineer for Neo Technology.
From the CEO
Have a Graph Question?
Reach out and connect with the Neo4j staff.Stackoverflow
Share your Graph Story?
Email us: email@example.com