Tuning Your Cypher: Tips & Tricks for More Effective Queries


Editor’s Note: Last October at GraphConnect San Francisco, Petra Selmer and Mark Needham – Engineers at Neo Technology – delivered this presentation on how to develop effective queries with Cypher.

For more videos from GraphConnect SF and to register for GraphConnect Europe, check out graphconnect.com.
.



Petra Selmer: I’m going to cover a little about query planners, a brief overview of how Cypher actually executes queries, leading into query plans and some definitions and finally getting to the good stuff which is optimization tips, which hopefully you’ll find useful.

Cypher Planners


The inception of Cypher included a rule planner consisting of rules that used indexes to produce query execution plans. With the inception of Neo4j 2.2, we introduced a new cost-based planner that is based on database knowledge principles relational databases have been using for decades.

Using a statistics service, we store information such as the number of movie labels or certain relationship types, and using those statistics we assign costs to various execution plans, which allows us to pick the cheapest one. All read-only queries use this by default because it generally performs much better than rule-based planners.

However, if you need to use the rule-based planner, you can prepend a query with CYPHER planner = rule on a query-by-query basis and set dbms.cypher.planner to RULE.

If you do find that the cost-based planner doesn’t work in one patch release, retest it with a new patch release. Additionally, if the speed of queries run with a rule are much faster than than those run with the cost-based planner, please let us know so that we can benchmark it. The Cypher team tries to benchmark as many queries (and types of queries) as we can with as many datasets as we can. We do this because we always want the cost-based planner to perform better than the rule.

Cypher Query Executions


Below is a basic bird’s-eye view of how a query is executed in Cypher:

Watch Mark Needham and Petra Selmer’s Presentation on How To Develop Effective Queries with Cypher


After we get the input query, we tokenize it and then abstract the syntax text tree to do semantic checking, which is basic error checking. Next, we optimize and normalize the abstract syntax tree and perform functions such as moving all labels and types from the MATCH clause into the WHERE clause, or converting equality statements such as name=Ann into an In.

From the abstract syntax tree we create a query graph, which is much easier to operate, and then create a series of logical plans. Using the data from our statistics store, we get the label and the index selectivity. To keep a query running quickly, you want to ensure high selectivity because it will return fewer rows. We also estimate our cardinality, which is the number of rows returned after each operation, and which we want to keep low for the same reason.

Imagine now we’ve got a plethora of logical plans that we’ve built over different costs. Next we employ an algorithm to select the cheapest one, which is then chosen and optimized. We then create an execution plan based on that logical plan, and query execution follows from that.

More resources on Cypher query executions can be found here and here.

In 2.2, we were using a greedy algorithm which guaranteed that although we wouldn’t choose the cheapest logical plan, it would not be the most expensive logical plan. In version 2.3 we’re using iterative dynamic programming (IDP), which is exhaustive and provides a cheaper logical plan and takes as much or less time than the greedy algorithm.

Interacting with a Query Plan


Mark Needham: How do you actually interact with that plan to make any changes that will make it faster?

Let’s imagine you’re in your Neo4j Browser with your query, and you’d like to pull up one of the plans explained by Petra. You can include two keywords that prefix your queries: EXPLAIN and PROFILE. EXPLAIN won’t run the query, but will instead provide you with the predicted costs. If you want to run the query and see how it performs, you prefix it with PROFILE.

Below is the longest Neo4j query plan we’ve ever seen — approximately 230 lines of Cypher — posted in Stack Overflow:

Neo4j's Longest Cypher Query Plan, Part 1Neo4j's Longest Cypher Query Plan, Part 2Neo4j's Longest Cypher Query Plan, Part 3


On each line there’s an operator name, and if you were to expand it, it will show you how much work it was doing.

The goal of query profiling is to keep the output of the queries exactly the same and reduce the number of database hits. A database hit is an abstract unit of storage engine work and could correspond to looking up a node, a property or a relationship. If you have one query with ten database hits and another query that performs the same function but only produces five database hits, the one that returns five is going to be faster.

Query Execution Plan Operators


Now we’re going to examine queries, review the query plans, compare them to each other and apply some fixes we’ve accumulated from helping people speed up their queries on Stack Overflow.

Execution Plan Operators in Cypher


The all nodes scan is the equivalent of “select star from star” – i.e. look at every single row in every single table – which is every node in the database. This will take a long time and come back with a production-sized dataset.

A label scan is the equivalent of “select a star from a table,” an example of which is “find all user nodes,” which reduces the cardinality of the number of nodes being queried. The best execution plan operators are “node index seek” and “node index scan.” With “node index seek” you have placed an index on some of your nodes that allow you to start a query at a specific node. The “node index scan” is similar and is used for range queries.

You can review execution plan operators more in depth here.

Using Node Labels


We’re going to have a look at some queries using a dataset from Cineasts, an extended version of the :play movies that comes in the Neo4j Browser and has more movies than IMDB.

Below is our first example, in which we have a MATCH and RETURN clause that will give us the movie The Matrix:

A Cypher Query Using an All Nodes Scan


The following, on the right-hand side of the slide, is what the database returns:

The Cypher Profile Results of an All Nodes Scan


The AllNodesScan looks up all 65,000 nodes in our database, applies a filter and returns one row. With the combined database hits from both functions, this search has performed 130,000 pieces of work, which amounts to an expensive search. This leads us to our first tip: If your node is labeled, include it in your query.

We address this by adding the syntax :movie, which tells our search to only examine nodes with the movie label:

A Cypher Query Using Node by Label Scan


Rather than performing an AllNodeScan, this search performs a NodeByLabelScan followed by a Filter, which gives us 25,000 database hits instead of 120,000.

Below is how the two queries compare:

The Cypher Profile Results of a Node by Label Scan


In both of these queries, we want to reduce the amount of red, which corresponds to the number of database hits. But we can do even better.

Using Indexes and Constraints


Petra: To tighten up your searches, you can use both indexes and constraints:

A Cypher Query using Node Index Seek


With the first create index on, we are setting an index on the title of a :movie; with the second, we are setting an index on the name of a :person, both of which allow us to create unique indexes. In graph databases, indexes are only used to find the starting point for queries, while in relational databases indexes are used to return a set of rows that are then used for JOINS.

If we apply NodeIndexSeek to The Matrix example, we only get two database hits instead of 120,000:

Cypher Profile Results of a Node Index Seek


Mark: In the next example, we want to find movies in which two actors — specifically Tom Hanks and Meg Ryan — have both appeared:

A Cypher Query with No Index Usage


For both Tom Hanks and Meg Ryan, we locate each person and then expand out the :acts_in relationship and then filter to make sure both people are different, which gives us around 500 database hits.

However, if you apply an index to both sides of this query — to both Tom Hanks and Meg Ryan — you’ll end up with a much better result. The syntax is :using index followed by the label and property combination on which you want to apply the indexes:

A Cypher Query using Enforced Index Usage


Again, an index is used to find the starting point. We have now directed our query to zoom in on Tom Hanks and Meg Ryan and find the connections between them. This gives the query plan a very different shape:

Cypher Profile Results of Enforced Index Usage


We run NodeIndexSeek on both the Tom Hanks and Meg Ryan sides, JOIN the two datasets, and then apply Filter to make sure we don’t have any duplicates. This returns 60 database hits, compared to the previous 500.

Below is a side-by-side of the two query plans:

Cypher Profile Results Comparing No Index Usage Vs. Enforced Index Usage


In order to be able to use the most effective query, you have to combine your domain knowledge with the planner’s knowledge of the query language.

Split MATCH Clauses to Reduce Cardinality


In the next query, we want to find Tom Hanks’ colleagues’ movies:

A Cypher Query with High Cardinality


To answer the question “For actors who have acted in movies with Tom Hanks, what movies did they appear in without Tom Hanks?” we start by finding Tom Hanks and the movies he’s acted in; the movies his co-actors have acted in; the movies they acted in together; and then the other movies the co-actor has appeared in.

We start with node index seek, expand out the :acts_in, apply a filter and end up with 2,000 rows. If we examine the structure of the dataset, we can see that we’re doing extra work because we aren’t eliminating the multiple times a single actor might come up if they’ve worked with Tom Hanks on multiple movies.

To address this, we make the following addition with the blue text below:

A Cypher Query with Reduced Cardinality Due to Split MATCH Clauses


We’ve split our query into two MATCH statements and run it WITH DISTINCT over the co-actors, which stops the database from performing multiple searches involving the same actor. This reduces the work in progress, and has brought us from around 10,000 database hits to 8,000.

Again, here are the two queries compared side by side:

Cypher Profile Results of Reduced Cardinality Due to Split MATCH Clauses


Using Size with Relationships


Petra: In our next example, we want to find the number of movies each actor has appeared in. Our first attempt might look something like this, which returns 94,700 rows:

A Cypher Query on a Large Dataset


To return a more manageable dataset, you can include SIZE with the :ACTS IN relationship:

A Cypher Query using SIZE with Relationships


The number of rows remains constant because we have a get degree operator. For each actor that’s found, it pulls the number of relationships of :ACTS IN and parses that through, which keeps us at 45,000 rows.

Forcing Indexes


We strive for our cost-based planner to pick the cheapest logical model and hints to come up with the optimal execution plan, which is sometimes very challenging. And while it is our goal to reduce the number of instances in which you have to apply hints, sometimes they are necessary.

If you notice a query running slowly even if you’ve already applied an index, you may have to force the index:

A Cypher Query with the USING INDEX Keyword


USING SCAN


Another hint you can use is something called USING SCAN.

In this particular case, you may already know that you have 65,000 total movie nodes but maybe only 12,000 nodes with comedy labels. Because you know that the comedy label is quite small – i.e., it’s got a very low cardinality – for badly performing queries, you might need to force a label scan on comedy.

Pictured below is how you would do that with the USING SCAN keyword in Cypher:

A Cypher Query with the USING SCAN Keyword


Additional Query Hints


Use Parameters

Mark: To perform the best possible query, use parameters whenever possible, especially when you’re building queries in your application. Rather than simply include labels and properties in your Cypher query, define the parameter name in the curly brackets and then feed that in when you call the API from your application.

Cypher will try to parameterize the top query and try to detect that same pattern in any subsequent queries. But by informing Cypher that a query is parameterized, you keep it from having to make that assumption. In the end, parameterizing your Cypher queries is always going to be more effective than Cypher trying to work them out for you.

Use Parameters in Your Cypher Query for Faster Results


Avoid Cartesian Products

Petra: It can be an unpleasant surprise if you you accidentally write a query that returns a Cartesian product. It’s important to ensure that you only use disconnected patterns very intentionally — if you really need to pull a Cartesian product.

In the first example below, count(a) and count(m) will return the total number of rows, i.e. the Cartesian product of a times m, which was not the intention. You can instead use the second example, which performs much faster:

How to Avoid Cartesian Products in Your Cypher Queries


In the latest version of Neo4j we included browser warnings — such as the Cartesian product warning — which are important to monitor.

Only Return What You Need

Mark: In an ideal query, the shape of the number of rows is consistent all the way down and has a small number of database hits in each step. To optimize your queries, you can examine where the most hits are happening in your query and try to systematically reduce them in each step to prevent a bottleneck.

We’ve seen people create nodes with hundreds of properties. The first query below will return every single property of the actor node, whereas the second query will only return the three properties specified:

A Cypher Query with RETURN Only for Node Properties


Final Tips

The final tip is try and keep the query short. It’s often better to do a number of small queries rather than one large query. It’s also important to keep read and write queries separate; if not, you can only use the RULE planner.

The Takeaway


TL;DR Review of Cypher Query Tuning Tips



Inspired by Mark and Petra’s talk? Register for GraphConnect Europe on April 26, 2016 at for more talks and workshops on Cypher and the evolving world of graph database technology.