Presentation Summary
In this presentation, Christophe Willemsen covers a variety of do-and-don’t tips to help your Cypher queries run faster than ever in Neo4j.
First, always use the official up-to-date Bolt drivers. Next, leave out object mappers as they produce too much overhead and are not made for batch imports.
Then, Willemsen advises you to use query parameters since using parameters allows Neo4j to cache the query plan and reuse it next time. Also, you should always reuse identifiers within queries because using incremental identifiers prevents the query plan from being cached, so Cypher will think it’s a new query every time.
Willemsen’s next tip is to split long Cypher queries into smaller, more optimized queries for ease of profiling and debugging. In addition, he advises you to check your schema indexes. By creating a constraint in your Cypher query, you will automatically create a schema index in the database.
The final two tips are to batch your writes using Cypher’s
UNWIND
feature for better performance, and finally, to beware of query replanning, which can plague more seasoned Cypher users with constantly changing statistics that can slow down queries and introduce higher rates of garbage collection.Full Presentation: Cypher: Write Fast and Furious
What we’re going to be talking about today is how to make the most out of the Cypher graph query language:
We will go over a few things not to do and will talk about ways to improve the performance of your Cypher queries.
Use Up-to-Date, Official Neo4j Drivers
The first thing to keep in mind is that you need to use an up-to-date, Neo4j-official Bolt driver.
The four official Neo4j drivers are for Python, Java, JavaScript and .NET. At GraphAware, we also maintain the PHP driver, which is in compliance with the Neo4j technological compliance kit.
Forget Object Mappers
The next thing to do is completely forget object mappers.
You can find Neo4j-ogm in Java, Python, etc. but when you want to write fast and you need to write personalized queries for your writes and domain, the Object-Graph Mapper (OGM) adds a lot of overhead, is not made for batch imports and keeps you from going fast.
So if you want to write 100,000 nodes as fast as possible, it doesn’t make sense to use object mappers.
Use Query Parameters
It’s always important to use query parameters. Take the following query as an example:
MERGE (p:Person {name:"Robert"}) MERGE (p:Person {name:"Chris"}) MERGE (p:Person {name:"Michael"})
This will query the three people mentioned, but Cypher can cache the query plans, so using parameters allows Neo4j to cache the query plan and reuse it next time, which increases query speed.
So, you would change it to look like this, and you’d pass the parameters with the driver:
MERGE (p:Person {name:{name} }) MERGE (p:Person {name:{name} }) MERGE (p:Person {name:{name} })
Reuse Identifiers
When generating Cypher queries at the application level, I see a lot of people building incremental identifiers:
MERGE (p1:Person {name:"Robert"}) MERGE (p2:Person {name:"Chris"}) MERGE (p3:Person {name:"Michael"})
Using P1, P2 and P3 (etc.) completely prevents the query plan from being cached, so Cypher will think it’s a new query every time, meaning it has to make statistics, caching, etc.
Let me show you the difference in the demo below:
Split Long Queries
Avoid long Cypher queries (30-40 lines) when possible by splitting your queries into smaller, separate queries.
You can then run all of these smaller, optimized queries in one transaction, which means you don’t have to worry about transactionality and ACID compliance. A query of two lines is much easier to maintain than one with 20 lines. Smaller queries are also easier to
PROFILE
because you can quickly identify any bottlenecks in your query plan.Just remember: A number of small optimized queries always run faster than one long, un-optimized query. It adds a bit of overhead in the code, but in the end, you will really benefit from that overhead.
Check Schema Indexes
Another thing is to check your schema indexes. In the below Cypher query plan, we are creating a range from zero to 10,000, and we will merge a new person node with an ID being the increment in the range:
So you can see in the query plan that it is doing a node by label scan. If I were to have 1000 people, it would try to find 1000 people checking if the value for the
MERGE
is the same. If not, it will create a new node.But whether it’s 1000, 1000000, or 10000000, your query will grow in db hits, so it won’t be as fast as you want it to be.
However, you can address this by creating a constraint, which will automatically create a schema index in the database. It will be an 01 operation. Consider the Cypher query below:
CREATE CONSTRAINT ON (p:Person) ASSERT p.id IS UNIQUE
If you have a constraint on the
person ID
, then the next time you do a MERGE
— which is a MATCH
or CREATE
— the MATCH
will an 01 operation so it will run very fast. The new query plan is NodeUniqueIndexSeek
, which is really an 01 operation.Batch Your Writes
In our earlier examples, we were creating a new query to create one node. You can defer your writes at the application level for example and keep an array of 1000 operations. You can then use
UNWIND
, which is a very powerful feature of Neo4j.Below we are creating an array at the application level, which we pass as a first parameter:
It will iterate this array and then do an operation: create a person and setting the properties. In this array, the person also has to be connected, so we create person nodes and relationships to the other people.
Below is a demo showing performance differences with and without schema indexes:
Beware of Query Replanning
The following relates to a problem that typically faces more experienced Cypher users in production scenarios. That is, query replanning.
When you are creating a lot of nodes and relationships, the statistics are continually evolving so Cypher may detect a plan as stale. However, you can disable this during batch imports.
Consider the following holiday house recommendations use case: Every house node has 800 relationships to other top-k similar houses based on click sessions, search features and content-based recommendations.
The problem we encountered was that in the background, we were constantly recomputing the similarity in the background, deleting every relationship and recreating new ones to the new 800 top-k similar relationships. But if you were to look in the Neo4j logs, it would always be a query detected as stale, then replanning, then the query being detected as stale, then replanning and so on.
Cypher automatically does query-replanning because of continuous change in statistics, which can slow down queries and introduce higher rates of garbage collection. But there is a configuration in Neo4j that you can use for disabling the replanning from the beginning. Also, this will
The parameters for disabling replanning are:
cypher.min_relplan_interval
and
cypher.statistics_divergence_threshold
The first outlines the parameters for the limited lifetime of a query plan before a query is considered for replanning. The second is the threshold for when a plan is considered stale. If any of the underlying statistics used to create the plan has changed more than this defined value, the plan is considered stale and will be replanned. A value of 0 always means replan, while a value of 1 means never replan.
I discussed with the Cypher authors yesterday, and they are maybe thinking of adding this factor on the query level, because these configurations impact all of your other queries as well.
So this is something you can use for making your writes faster in the first batch import. It is better than restarting Neo4j, because all your
MATCH
queries and your user-facing queries will be impacted by this.Register for GraphConnect