5 Secrets to More Effective Neo4j 2.2 Query Tuning

Learn Five Secrets to More Effective Query Tuning with Neo4j 2.2.x

Even in Neo4j with its high performance graph traversals, there are always queries that could and should be run faster – especially if your data is highly connected and if global pattern matches make even a single query account for many millions or billions of paths.

For this article, we’re using the larger movie dataset, which is also listed on the example datasets page.

The domain model that interests us here, is pretty straightforward:

(:Person {name}) -[:ACTS_IN|:DIRECTED]-> (:Movie {title})
(:Movie {title}) -[:GENRE]-> (:Genre {name})

HARDWARE

I presume you use a sensible machine, with a SSD (or enough IOPS) and decent amount of RAM. For a highly concurrent load, there should be also enough CPU cores to handle it.

Other questions to consider: Did you monitor io-waits, top, for CPU and memory usage? Any bottlenecks that turn up?

If so, you should address those issues first.

On Linux, configure your disk scheduler to noop or deadline and mount the database volume with noatime. See this blog post for more information.

CONFIG

For best results, use the latest stable version of Neo4j (i.e., Neo4j Enterprise 2.2.5). There is always an Enterprise trial version available to give you a high-watermark baseline, so compare it to Neo4j Community on your machine as needed.

Set dbms.pagecache.memory=4G in conf/neo4j.properties or the size of the store-files (nodes, relationships, properties, string-properties) combined.

ls -lt data/graph.db/neostore.*.db
3802094 16 Jul 14:31 data/graph.db/neostore.propertystore.db
 456960 16 Jul 14:31 data/graph.db/neostore.relationshipstore.db
 442260 16 Jul 14:31 data/graph.db/neostore.nodestore.db
   8192 16 Jul 14:31 data/graph.db/neostore.schemastore.db
   8190 16 Jul 14:31 data/graph.db/neostore.labeltokenstore.db
   8190 16 Jul 14:31 data/graph.db/neostore.relationshiptypestore.db
   8175 16 Jul 14:31 data/graph.db/neostore.relationshipgroupstore.db

Set heap from 8 to 16G, depending on the RAM size of the machine. Also configure the young generation in conf/neo4j-wrapper.conf.

wrapper.java.initmemory=8000
wrapper.java.maxmemory=8000
wrapper.java.additional=-Xmn2G

That’s mostly it, config-wise. If you are concurrency heavy, you could also set the webserver threads in conf/neo4j-server.properties.

# cpu * 2
org.neo4j.server.webserver.maxthreads=24

QUERY TUNING

If these previous factors are taken care of, it’s now time to dig into query tuning. A lot of query tuning is simply prefixing your statements with EXPLAIN to see what Cypher would do and using PROFILE to retrieve the real execution data as well:

For example, let’s look at this query, which has the PROFILE prefix:

PROFILE
MATCH(g:Genre {name:"Action"})<-[:GENRE]-(m:Movie)<-[:ACTS_IN]-(a)
WHERE a.name =~ "A.*"
RETURN distinct a.name;

The result of this query is shown below in the visual query plan tool available in the Neo4j browser.

A Screenshot of a Query Plan for More Effective Query Tuning

While the visual query plan is nice in the Neo4j browser, the one in Neo4j-shell is easier to compare and it also has more raw numbers.

Operator	Est.Rows	Rows	DbHits	Identifiers	Other
Total database accesses: 72310
Distinct	2048	860	2636	a.name	a.name
Filter(0)	2155	1318	41532	anon[32], anon[52], a, g, m	a.name ~= /{ AUTOSTRING1}/
Expand(All)(0)	2874	20766	23224	anon[32], anon[52], a, g, m	(m)←[:ACTS_IN]-(a)
Filter(1)	390	2458	2458	anon[32], g, m	m:Movie
Expand(All)(1)	390	2458	2459	anon[32], g, m	(g)←[:GENRE]-(m)
NodeUniqueIndexSeek	1	1	1	g	:Genre(name)

Query Tuning Tip #1: Use Indexes and Constraints for Nodes You Look Up by Properties

Check – with either schema or :schema – that there is an index in place for non-unique values and a constraint for unique values, and make sure – with EXPLAIN – that the index is used in your query.

CREATE INDEX ON :Movie(title);
CREATE INDEX ON :Person(name);
CREATE CONSTRAINT ON (g:Genre) ASSERT g.name IS UNIQUE;

Even for range queries (pre-Neo4j 2.3), it might be better to turn them into an IN query to leverage an index.

// if :Movie(released) is indexed, this query for the nineties will *not use* an index:
MATCH (m:Movie) WHERE m.released >= 1990 and m.released < 2000
RETURN count(*);

CREATE INDEX ON :Movie(released);

// but this will
MATCH (m:Movie) WHERE m.released IN range(1990,1999)  RETURN count(*);

// same for OR queries
MATCH (m:Movie) WHERE m.released = 1990 OR m.released = 1991 OR ...

Query Tuning Tip #2: Patterns with Bound Nodes are Optimized

If you have a pattern (node)-[:REL]→(node) where both nodes on either side are already bound, Cypher will optimize the match by taking the node-degree (number of relationships) into account when checking for the connection, starting on the smaller side and also caching internally.

So, for example, (actor)-[:ACTS_IN]->(movie) – if both actor and movie are known – turns into that described Expand(Into) operation.

If one side is not known, then it is a normal Expand(All) operation.

Query Tuning Tip #3: Enforce Index Lookups for Both Sides of a Path

Make sure that if nodes on both sides of a longer path can be found in an index, and are only a few hits of a larger total count, to add USING INDEX for both sides. In many cases, that makes a big difference. It doesn't help if the path explodes in the middle and a simple left-to-right traversal with property checks would touch fewer paths.

PROFILE
MATCH (a:Person {name:"Tom Hanks"})-[:ACTS_IN]->()<-[:ACTS_IN]-(b:Person {name:"Meg Ryan"})
RETURN count(*);

Operator	Est.Rows	Rows	DbHits	Identifiers	Other
Total database accesses: 765
EagerAggregation	0	1	0	count(*)
Filter	0	3	531	anon[36], anon[49], anon[51], a, b	a:Person AND a.name == { AUTOSTRING0}) AND NOT(anon[36] == anon[51]
Expand(All)(0)	3	177	204	anon[36], anon[49], anon[51], a, b	()←[:ACTS_IN]-(a)
Expand(All)(1)	2	27	28	anon[49], anon[51], b	(b)-[:ACTS_IN]→()
NodeIndexSeek	1	1	2	b	:Person(name)

If we add the second index-hint, we get 10x fewer database hits.

PROFILE
MATCH (a:Person {name:"Tom Hanks"})-[:ACTS_IN]->()<-[:ACTS_IN]-(b:Person {name:"Meg Ryan"})
USING INDEX a:Person(name) USING INDEX b:Person(name)
RETURN count(*);

Operator	Est.Rows	Rows	DbHits	Identifiers	Other
Total database accesses: 68
EagerAggregation	0	1	0	count(*)
Filter	0	3	0	anon[36], anon[49], anon[51], a, b	NOT(anon[36] == anon[51])
NodeHashJoin	0	3	0	anon[36], anon[49], anon[51], a, b	anon[49]
Expand(All)(0)	2	27	28	anon[49], anon[51], b	(b)-[:ACTS_IN]→()
NodeIndexSeek(0)	1	1	2	b	:Person(name)
Expand(All)(1)	2	35	36	anon[36], anon[49], a	(a)-[:ACTS_IN]→()
NodeIndexSeek(1)	1	1	2	a	:Person(name)

Query Tuning Tip #4: Defer Property Access

Make sure to access properties only as the last operation – if possible – and on the smallest set of nodes and relationships. Massive property loading is more expensive than following relationships.

For example, this query:

PROFILE
MATCH (p:Person)-[:ACTS_IN]->(m:Movie)
RETURN p.name, count(*) as c
ORDER BY c DESC limit 10;

Operator	Est.Rows	Rows	DbHits	Identifiers	Other
Total database accesses: 404525
Projection(0)	308	10	0	anon[48], anon[54], c, p.name	anon[48]; anon[54]
Top	308	10	0	anon[48], anon[54]	{ AUTOINT0};
EagerAggregation	308	44689	0	anon[48], anon[54]	anon[48]
Projection(1)	94700	94700	189400	anon[48], anon[17], m, p	p.name
Filter	94700	94700	94700	anon[17], m, p	p:Person
Expand(All)	94700	94700	107562	anon[17], m, p	(m)←[:ACTS_IN]-(p)
NodeByLabelScan	12862	12862	12863	m	:Movie

This query shown above accesses p.name for all people, totaling 400,000 database hits. Instead, you should aggregate on the node first, then order and paginate, and only in the very end should you access and return the property.

PROFILE
MATCH (p:Person)-[:ACTS_IN]->(m:Movie)
WITH p, count(*) as c
ORDER BY c DESC LIMIT 10
RETURN p.name, c;

This second query above only accesses p.name for the top ten actors, and before that, it groups them directly by the nodes, saving us about 200,000 database hits.

Operator	Est.Rows	Rows	DbHits	Identifiers	Other
Total database accesses: 215145
Projection	308	10	20	c, p, p.name	p.name; c
Top	308	10	0	c, p	{ AUTOINT0}; c
EagerAggregation	308	44943	0	c, p	p
Filter	94700	94700	94700	anon[17], m, p	p:Person
Expand(All)	94700	94700	107562	anon[17], m, p	(m)←[:ACTS_IN]-(p)
NodeByLabelScan	12862	12862	12863	m	:Movie

But that query could even be optimized more, with....

Query Tuning Tip #5: Fast Relationship Counting

There is an optimal implementation for single path-expressions, by directly reading the degree of a node. Personally, I always prefer this method over optional matches, exists or general where conditions: size((s)-[:REL]->()) ← uses get-degree which is a constant time operation (similarly without rel-type or direction).

PROFILE
MATCH (n:Person) WHERE EXISTS((n)-[:DIRECTED]->())
RETURN count(*);

Here the plan doesn’t count the nested db-hits in the expression, which it should. That’s why I included the runtime:

1 row 197 ms

Operator	Est.Rows	Rows	DbHits	Identifiers	Other
Total database accesses: 106396
EagerAggregation	194	1	56216	count(*)
Filter	37634	6037	0	n	NestedPipeExpression(ExpandAllPipe(….))
NodeByLabelScan	50179	50179	50180	n

versus

PROFILE
MATCH (n:Person) WHERE size((n)-[:DIRECTED]->()) <> 0
RETURN count(*);

1 row 90 ms

Operator	Est.Rows	Rows	DbHits	Identifiers	Other
Total database accesses: 150538
EagerAggregation	213	1	0	count(*)
Filter	45161	6037	100358	n	NOT(GetDegree(n,Some(DIRECTED),OUTGOING) == { AUTOINT0})
NodeByLabelScan	50179	50179	50180	n	:Person

You can also use that technique nicely for overview pages or inline aggregations:

PROFILE
MATCH (m:Movie)
RETURN m.title, size((m)<-[:ACTS_IN]-()) as actors, size((m)<-[:DIRECTED]-()) as directors
LIMIT 10;

+-------------------------------------------------------------+
| m.title                                | actors | directors |
+-------------------------------------------------------------+
| "Indiana Jones and the Temple of Doom" | 13     | 1         |
| "King Kong"                            | 1      | 1         |
| "Stolen Kisses"                        | 21     | 1         |
| "One Flew Over The Cuckoo's Nest"      | 24     | 1         |
| "Ziemia obiecana"                      | 17     | 1         |
| "Scoop"                                | 21     | 1         |
| "Fire"                                 | 0      | 1         |
| "Dial M For Murder"                    | 5      | 1         |
| "Ed Wood"                              | 21     | 1         |
| "Requiem"                              | 11     | 1         |
+-------------------------------------------------------------+
10 rows
13 ms

Operator	Est.Rows	Rows	DbHits	Identifiers	Other
Total database accesses: 71
Projection	12862	10	60	actors, directors,	m.title; GetDegree(m,Some(ACTS_IN),INCOMING);
				m, m.title	GetDegree(m,Some(DIRECTED),INCOMING)
Limit	12862	10	0	m	{ AUTOINT0}
NodeByLabelScan	12862	10	11	m	:Movie

Our query from the previous section would look like this:

PROFILE
MATCH (p:Person)
WITH p, sum(size((p)-[:ACTS_IN]->())) as c
ORDER BY c DESC LIMIT 10
RETURN p.name, c;

This query shaves off another 50,000 database hits. Not bad.

Operator	Est.Rows	Rows	DbHits	Identifiers	Other
Total database accesses: 150558
Projection	224	10	20	c, p, p.name	p.name; c
Top	224	10	0	c, p	{ AUTOINT0}; c
EagerAggregation	224	50179	100358	c, p	p
NodeByLabelScan	50179	50179	50180	p	:Person

Note to self: Optimized Cypher looks more like Lisp.

Bonus Query Tuning Tip: Reduce Cardinality of Work in Progress

When following longer paths, you’ll encounter duplicates. If you’re not interested in all the possible paths – but just distinct information from stages of the path – make sure that you eagerly eliminate duplicates, so that later matches don’t have to be executed many multiple times.

This reduction of the cardinality can be done using either WITH DISTINCT or WITH aggregation (which automatically de-duplicates).

So, for instance, for this query of "Movies that Tom Hanks' colleagues acted in":

PROFILE
MATCH (p:Person {name:"Tom Hanks"})-[:ACTS_IN]->(m1)<-[:ACTS_IN]-(coActor)-[:ACTS_IN]->(m2)
RETURN distinct m2.title;

This query has 10,272 db-hits and touches 3,020 total paths.

Operator	Est.Rows	Rows	DbHits	Identifiers	Other
Total database accesses: 10272
Distinct	4	2021	6040	m2.title	m2.title
Filter(0)	4	3020	0	anon[36], anon[53], anon[75], coActor, m1, m2, p	(NOT(anon[53] == anon[75]) AND NOT(anon[36] == anon[75]))
Expand(All)(0)	4	3388	3756	anon[36], anon[53], anon[75], coActor, m1, m2, p	(coActor)-[:ACTS_IN]→(m2)
Filter(1)	3	368	0	anon[36], anon[53], coActor, m1, p	NOT(anon[36] == anon[53])
Expand(All)(1)	3	403	438	anon[36], anon[53], coActor, m1, p	(m1)←[:ACTS_IN]-(coActor)
Expand(All)(2)	2	35	36	anon[36], m1, p	(p)-[:ACTS_IN]→(m1)
NodeIndexSeek	1	1	2	p	:Person(name)

The first-degree neighborhood is unique, since in this dataset there is only at most one :ACTS_IN relationship between an actor and a movie. So, the first duplicated nodes appear at the second degree, which we can eliminate like this:

PROFILE
MATCH (p:Person {name:"Tom Hanks"})-[:ACTS_IN]->(m1)<-[:ACTS_IN]-(coActor)
WITH distinct coActor
MATCH (coActor)-[:ACTS_IN]->(m2)
RETURN distinct m2.title;

This query tuning technique reduces the number of paths to match for the last step to 2,906. In other use cases with more duplicates, the impact is much bigger.

Operator	Est.Rows	Rows	DbHits	Identifiers	Other
Total database accesses: 9529
Distinct(0)	4	2031	5812	m2.title	m2.title
Expand(All)(0)	4	2906	3241	anon[113], coActor, m2	(coActor)-[:ACTS_IN]→(m2)
Distinct(1)	3	335	0	coActor	coActor
Filter	3	368	0	anon[36], anon[53], coActor, m1, p	NOT(anon[36] == anon[53])
Expand(All)(1)	3	403	438	anon[36], anon[53], coActor, m1, p	(m1)←[:ACTS_IN]-(coActor)
Expand(All)(2)	2	35	36	anon[36], m1, p	(p)-[:ACTS_IN]→(m1)
NodeIndexSeek	1	1	2	p	:Person(name)

Of course we would apply our Minimize Property Access tip here too:

PROFILE
MATCH (p:Person {name:"Tom Hanks"})-[:ACTS_IN]->(m1)<-[:ACTS_IN]-(coActor)
WITH distinct coActor
MATCH (coActor)-[:ACTS_IN]->(m2)
WITH distinct m2
RETURN m2.title;

Operator	Est.Rows	Rows	DbHits	Identifiers	Other
Total database accesses: 7791
Projection	4	2037	4074	m2, m2.title	m2.title
Distinct(0)	4	2037	0	m2	m2
Expand(All)(0)	4	2906	3241	anon[113], coActor, m2	(coActor)-[:ACTS_IN]→(m2)
Distinct(1)	3	335	0	coActor	coActor
Filter	3	368	0	anon[36], anon[53], coActor, m1, p	NOT(anon[36] == anon[53])
Expand(All)(1)	3	403	438	anon[36], anon[53], coActor, m1, p	(m1)←[:ACTS_IN]-(coActor)
Expand(All)(2)	2	35	36	anon[36], m1, p	(p)-[:ACTS_IN]→(m1)
NodeIndexSeek	1	1	2	p	:Person(name)

We still need the distinct m2 at the end, as the co-actors can have played in the same movies, and we don’t want duplicate results.

This query has 7,791 db-hits and touches 2,906 paths in total.

If you are also interested in the frequency (e.g., for scoring), you can compute them along with an aggregation instead of distinctly. In the end, You just multiply the path count per co-actor with the number of occurrences per movie.

MATCH (p:Person {name:"Tom Hanks"})-[:ACTS_IN]->(m1)<-[:ACTS_IN]-(coActor)
WITH coActor, count(*) as freq
MATCH (coActor)-[:ACTS_IN]->(m2)
RETURN m2.title, freq * count(*) as occurrence;

Conclusion

The best way to start with query tuning is to take the slowest queries, PROFILE them and optimize them using these tips.

If you need help, you can always reach out to us on Stack Overflow, our Google Group or our public Discord channel.

If you are part of a project that is adopting Neo4j or putting it into production, make sure to get some expert help to ensure you’re successful. Note: If you do ask for help, please provide enough information for others to be able to help you. Explain your graph model, share your queries, their profile output and – best of all – a dataset to run them on.

Need more tips on how to effectively use Neo4j? Register for our online training class, Neo4j in Production, and learn how to master the world’s leading graph database.

Sign Me Up

5 Secrets to More Effective Neo4j 2.2 Query Tuning

Michael Hunger, Head of Product Innovation & Developer Strategy, Neo4j Sep 25, 2015 2 mins read

HARDWARE

CONFIG

QUERY TUNING

Query Tuning Tip #1: Use Indexes and Constraints for Nodes You Look Up by Properties

Query Tuning Tip #2: Patterns with Bound Nodes are Optimized

Query Tuning Tip #3: Enforce Index Lookups for Both Sides of a Path

Query Tuning Tip #4: Defer Property Access

Query Tuning Tip #5: Fast Relationship Counting

Bonus Query Tuning Tip: Reduce Cardinality of Work in Progress

Conclusion

GraphAcademy

Developer Blog

Author

Michael Hunger, Head of Product Innovation & Developer Strategy, Neo4j