Introducing the new Cypher Query Optimizer


Written by Petra Selmer, Neo4j Developer.

Introducing the new Cypher Query Optimizer

You’ll have seen the excitement around the Neo4j 2.2 Release, and might have seen a few items about the Cypher query language. We in the Cypher team are delighted that our new Cypher Query Optimizer, codenamed ‘Ronja’, has been a big part of the release, and we’d love to share some of the details with you.

Ronja is a new cost-based query compiler whose goal is to produce an optimized query plan, leading to faster execution times for Cypher queries. For this release, Ronja is used for read-only queries with the following structure:

[MATCH WHERE]
[OPTIONAL MATCH WHERE]
[WITH [ORDER BY] [SKIP] [LIMIT]]
RETURN [ORDER BY] [SKIP] [LIMIT]

In this post, we will provide an overview of how Cypher queries are executed to show how Ronja helps.

Cypher query execution occurs in the following steps, where Ronja handles steps 2 – 5:
    1. Convert the input query string into an abstract syntax tree (AST)
    2. Optimize and normalize the AST
    3. Create a query graph from the normalized AST
    4. Create a logical plan from X
    5. Rewrite the logical plan
    6. Create an execution plan from the logical plan
    7. Execute the query using the execution plan
Let’s take a look at each of these in more detail. It’s a little geeky, but quite fun if you like languages.

Convert the input query string into an abstract syntax tree (AST)

The input query string is first tokenized and then parsed into an AST.

Using this AST, we perform semantic checking of the variable types and scoping of variables within the tree. Any errors regarding basic typing information – such as attempting to divide two strings – are caught and returned to the user.

At this point, the AST is marked up with semantic information and we keep a reference to the original input query string.

Optimize and normalize the AST

Ronja starts off by performing simple optimizations and normalizations on the AST; some of these include:
    • Moving all labels and types from the
       MATCH
      clause to
      WHERE
    • Suppressing redundant
       WITHs
    • Expanding aliases; e.g.
      RETURN * => RETURN x as x, y as y
    • Folding of constants; e.g.
      1 + 2 * 4 => 9
    • Naming anonymous pattern nodes; e.g.
       MATCH () => MATCH (n)
    • Converting the equality operator to an
       ‘IN’: MATCH (n) WHERE id(n) = 12 => MATCH n WHERE id(n) IN [12]
    • Other normalizations

Create a query graph from the normalized AST

We use the normalized AST to create a query graph, which is a more abstract, high level representation of the query. Using the query graph instead of operating directly on the AST allows us to compute costs and perform optimizations far more effectively, as we detail in the step below.

Create a logical plan from the query graph

A logical plan is a tree with up to 2 children, consisting of operators similar to those used by logical plans of relational databases.

A logical plan is produced for each query graph (depending on the query, a query graph may consist of sub query graphs). This is done in a step-by-step fashion following a bottom-up approach.

At each step, we firstly obtain data such as index and label selectivity from our new statistics store. This data is then used to estimate the cardinality – this is the number of matching rows – using information from the query graph. With this we can estimate a cost, which is used to build a candidate logical plan.

The cost of a logical plan is an estimate of the amount of work the database will have to do in order to execute the logical plan, which is dominated by I/O reads from the store and indices, and in-memory computational work such as expanding the graph by traversing more relationships and hence gathering more nodes.

Thus at each step, multiple candidate logical plans for the query graph are produced. Ronja uses a greedy search strategy to pick the cheapest logical plan from the multiple candidates as the process percolates up the query graph. Owing to the greedy search, the selected logical plan may not be the most optimal one, but it is guaranteed not to be the costliest.

Rewrite the logical plan

The logical plan is now rewritten using un-nesting, merging and simplification of various components; e.g. any equality operator transformed into ‘IN’s in Step 2 are transformed back into equality operators. At the end of this stage, Ronja has completed its goal.

Create an execution plan from the logical plan

An execution plan is created from the logical plan by choosing a physical implementation for logical operators, and subsequently cached. This is a tree composed of operators, where each non-leaf feeds from one or two children; more information on the operators may be found in the manual. We show below a visual representation of an execution plan for the query:

MATCH (a1:Artist {name: ‘Artist-1’}), (a2:Artist),

p = shortestPath((a1)-[*..3]-(a2))

RETURN p LIMIT 5


Document3

Execute the query using the execution plan

After the execution plan has been converted to nested iterators, the result is obtained in a pipelined manner by pulling from the top iterator, which pulls from the next iterator and so on.

The following graphic illustrates the entire process; Ronja comprises part of the ‘Semantic Analysis’ step through to the end of the ‘Logical Plan’ step.

Document2

Ronja in practice

So, for which types of queries does Ronja bring about the best improvements? We have conducted extensive benchmarking tests, and, empirically, queries containing multiple nodes that can be reached via indices and/or pattern predicates show the most improvement.

The following examples illustrate these types of queries. For the first query, an inexpensive index seek on a1 (finding the node(s) having ‘Artist-1’ in the name property) is undertaken twice, effectively resulting in two branches of execution. Each time, the branch’s result is expanded by traversing the outgoing relationships: the first expansion follows all PERFORMED_AT relationships to gather all matching nodes labelled with Concert; the second follows all SIGNED_WITH relationships, obtaining matching Company nodes. Further expansions on each branch follow, after which the results of the two branches are merged. The ability to construct plans such as this did not exist prior to Ronja.

MATCH (a1:Artist {name: ‘Artist-1’})-[:PERFORMED_AT]->

     (:Concert)<-[:PERFORMED_AT]-(a2:Artist)

MATCH (a1)-[:SIGNED_WITH]->(corp:Company)<-[:SIGNED_WITH]-(a2)

RETURN a2, COUNT(*)

ORDER BY COUNT(*) DESC

For the query below, prior to Ronja, the WHERE predicate (b)-[:SOLD_AT]->(s:Store) AND s.SalesRank > 3 would have just been an expression. Now, the predicate is an actual sub-query, which means that Ronja can take costs for it into account, resulting in the construction of a much more accurate logical plan.

MATCH (a:Author),

(a)-[:WROTE]->(b:Book),

(s:Store)

WHERE (b)-[:SOLD_AT]->(s:Store) AND s.SalesRank > 3 RETURN a.FullName, b.Title, s.StoreName

Future work

We will continue to improve Ronja, with our goal to be even faster than hand-crafted Java code. We shall also be integrating Ronja into the read-write and write-only queries so that they execute much more rapidly in the future. Besides query planning we also work on improving query execution by developing a new runtime that compiles logical plans down to executable “machine” code.


Want to learn more about graph databases? Click below to get your free copy of O’Reilly’s Graph Databases ebook and discover how to use graph technologies for your application today.

Download My Ebook