5 Secrets to More Effective Neo4j 2.2 Query Tuning


Learn Five Secrets to More Effective Query Tuning with Neo4j 2.2.x

Even in Neo4j with its high performance graph traversals, there are always queries that could and should be run faster – especially if your data is highly connected and if global pattern matches make even a single query account for many millions or billions of paths.

For this article, we’re using the larger movie dataset, which is also listed on the example datasets page.

The domain model that interests us here, is pretty straightforward:

(:Person {name}) -[:ACTS_IN|:DIRECTED]-> (:Movie {title})
(:Movie {title}) -[:GENRE]-> (:Genre {name})


HARDWARE


I presume you use a sensible machine, with a SSD (or enough IOPS) and decent amount of RAM. For a highly concurrent load, there should be also enough CPU cores to handle it.

Other questions to consider: Did you monitor io-waits, top, for CPU and memory usage? Any bottlenecks that turn up?

If so, you should address those issues first.

On Linux, configure your disk scheduler to noop or deadline and mount the database volume with noatime. See this blog post for more information.

CONFIG


For best results, use the latest stable version of Neo4j (i.e., Neo4j Enterprise 2.2.5). There is always an Enterprise trial version available to give you a high-watermark baseline, so compare it to Neo4j Community on your machine as needed.

Set dbms.pagecache.memory=4G in conf/neo4j.properties or the size of the store-files (nodes, relationships, properties, string-properties) combined.

ls -lt data/graph.db/neostore.*.db
3802094 16 Jul 14:31 data/graph.db/neostore.propertystore.db
 456960 16 Jul 14:31 data/graph.db/neostore.relationshipstore.db
 442260 16 Jul 14:31 data/graph.db/neostore.nodestore.db
   8192 16 Jul 14:31 data/graph.db/neostore.schemastore.db
   8190 16 Jul 14:31 data/graph.db/neostore.labeltokenstore.db
   8190 16 Jul 14:31 data/graph.db/neostore.relationshiptypestore.db
   8175 16 Jul 14:31 data/graph.db/neostore.relationshipgroupstore.db

Set heap from 8 to 16G, depending on the RAM size of the machine. Also configure the young generation in conf/neo4j-wrapper.conf.

wrapper.java.initmemory=8000
wrapper.java.maxmemory=8000
wrapper.java.additional=-Xmn2G


That’s mostly it, config-wise. If you are concurrency heavy, you could also set the webserver threads in conf/neo4j-server.properties.

# cpu * 2
org.neo4j.server.webserver.maxthreads=24

QUERY TUNING


If these previous factors are taken care of, it’s now time to dig into query tuning. A lot of query tuning is simply prefixing your statements with EXPLAIN to see what Cypher would do and using PROFILE to retrieve the real execution data as well:

For example, let’s look at this query, which has the PROFILE prefix:

PROFILE
MATCH(g:Genre {name:"Action"})<-[:GENRE]-(m:Movie)<-[:ACTS_IN]-(a)
WHERE a.name =~ "A.*"
RETURN distinct a.name;


The result of this query is shown below in the visual query plan tool available in the Neo4j browser.

A Screenshot of a Query Plan for More Effective Query Tuning


While the visual query plan is nice in the Neo4j browser, the one in Neo4j-shell is easier to compare and it also has more raw numbers.

Operator Est.Rows Rows DbHits Identifiers Other

Total database accesses: 72310

Distinct

2048

860

2636

a.name

a.name

Filter(0)

2155

1318

41532

anon[32], anon[52], a, g, m

a.name ~= /{ AUTOSTRING1}/

Expand(All)(0)

2874

20766

23224

anon[32], anon[52], a, g, m

(m)←[:ACTS_IN]-(a)

Filter(1)

390

2458

2458

anon[32], g, m

m:Movie

Expand(All)(1)

390

2458

2459

anon[32], g, m

(g)←[:GENRE]-(m)

NodeUniqueIndexSeek

1

1

1

g

:Genre(name)



Query Tuning Tip #1: Use Indexes and Constraints for Nodes You Look Up by Properties

Check – with either schema or :schema – that there is an index in place for non-unique values and a constraint for unique values, and make sure – with EXPLAIN – that the index is used in your query.

CREATE INDEX ON :Movie(title);
CREATE INDEX ON :Person(name);
CREATE CONSTRAINT ON (g:Genre) ASSERT g.name IS UNIQUE;


Even for range queries (pre-Neo4j 2.3), it might be better to turn them into an IN query to leverage an index.

// if :Movie(released) is indexed, this query for the nineties will *not use* an index:
MATCH (m:Movie) WHERE m.released >= 1990 and m.released < 2000
RETURN count(*);

CREATE INDEX ON :Movie(released);

// but this will
MATCH (m:Movie) WHERE m.released IN range(1990,1999)  RETURN count(*);

// same for OR queries
MATCH (m:Movie) WHERE m.released = 1990 OR m.released = 1991 OR ...


Query Tuning Tip #2: Patterns with Bound Nodes are Optimized

If you have a pattern (node)-[:REL]→(node) where both nodes on either side are already bound, Cypher will optimize the match by taking the node-degree (number of relationships) into account when checking for the connection, starting on the smaller side and also caching internally.

So, for example, (actor)-[:ACTS_IN]->(movie) – if both actor and movie are known – turns into that described Expand(Into) operation.

If one side is not known, then it is a normal Expand(All) operation.

Query Tuning Tip #3: Enforce Index Lookups for Both Sides of a Path

Make sure that if nodes on both sides of a longer path can be found in an index, and are only a few hits of a larger total count, to add USING INDEX for both sides. In many cases, that makes a big difference. It doesn't help if the path explodes in the middle and a simple left-to-right traversal with property checks would touch fewer paths.

PROFILE
MATCH (a:Person {name:"Tom Hanks"})-[:ACTS_IN]->()<-[:ACTS_IN]-(b:Person {name:"Meg Ryan"})
RETURN count(*);

Operator Est.Rows Rows DbHits Identifiers Other

Total database accesses: 765

EagerAggregation

0

1

0

count(*)

Filter

0

3

531

anon[36], anon[49], anon[51], a, b

a:Person AND a.name == { AUTOSTRING0}) AND NOT(anon[36] == anon[51]

Expand(All)(0)

3

177

204

anon[36], anon[49], anon[51], a, b

()←[:ACTS_IN]-(a)

Expand(All)(1)

2

27

28

anon[49], anon[51], b

(b)-[:ACTS_IN]→()

NodeIndexSeek

1

1

2

b

:Person(name)


If we add the second index-hint, we get 10x fewer database hits.

PROFILE
MATCH (a:Person {name:"Tom Hanks"})-[:ACTS_IN]->()<-[:ACTS_IN]-(b:Person {name:"Meg Ryan"})
USING INDEX a:Person(name) USING INDEX b:Person(name)
RETURN count(*);

Operator Est.Rows Rows DbHits Identifiers Other

Total database accesses: 68

EagerAggregation

0

1

0

count(*)

Filter

0

3

0

anon[36], anon[49], anon[51], a, b

NOT(anon[36] == anon[51])

NodeHashJoin

0

3

0

anon[36], anon[49], anon[51], a, b

anon[49]

Expand(All)(0)

2

27

28

anon[49], anon[51], b

(b)-[:ACTS_IN]→()

NodeIndexSeek(0)

1

1

2

b

:Person(name)

Expand(All)(1)

2

35

36

anon[36], anon[49], a

(a)-[:ACTS_IN]→()

NodeIndexSeek(1)

1

1

2

a

:Person(name)


Query Tuning Tip #4: Defer Property Access

Make sure to access properties only as the last operation – if possible – and on the smallest set of nodes and relationships. Massive property loading is more expensive than following relationships.

For example, this query:

PROFILE
MATCH (p:Person)-[:ACTS_IN]->(m:Movie)
RETURN p.name, count(*) as c
ORDER BY c DESC limit 10;

Operator Est.Rows Rows DbHits Identifiers Other

Total database accesses: 404525

Projection(0)

308

10

0

anon[48], anon[54], c, p.name

anon[48]; anon[54]

Top

308

10

0

anon[48], anon[54]

{ AUTOINT0};

EagerAggregation

308

44689

0

anon[48], anon[54]

anon[48]

Projection(1)

94700

94700

189400

anon[48], anon[17], m, p

p.name

Filter

94700

94700

94700

anon[17], m, p

p:Person

Expand(All)

94700

94700

107562

anon[17], m, p

(m)←[:ACTS_IN]-(p)

NodeByLabelScan

12862

12862

12863

m

:Movie


This query shown above accesses p.name for all people, totaling 400,000 database hits. Instead, you should aggregate on the node first, then order and paginate, and only in the very end should you access and return the property.

PROFILE
MATCH (p:Person)-[:ACTS_IN]->(m:Movie)
WITH p, count(*) as c
ORDER BY c DESC LIMIT 10
RETURN p.name, c;

This second query above only accesses p.name for the top ten actors, and before that, it groups them directly by the nodes, saving us about 200,000 database hits.

Operator Est.Rows Rows DbHits Identifiers Other

Total database accesses: 215145

Projection

308

10

20

c, p, p.name

p.name; c

Top

308

10

0

c, p

{ AUTOINT0}; c

EagerAggregation

308

44943

0

c, p

p

Filter

94700

94700

94700

anon[17], m, p

p:Person

Expand(All)

94700

94700

107562

anon[17], m, p

(m)←[:ACTS_IN]-(p)

NodeByLabelScan

12862

12862

12863

m

:Movie

But that query could even be optimized more, with....​

Query Tuning Tip #5: Fast Relationship Counting

There is an optimal implementation for single path-expressions, by directly reading the degree of a node. Personally, I always prefer this method over optional matches, exists or general where conditions: size((s)-[:REL]->()) ← uses get-degree which is a constant time operation (similarly without rel-type or direction).

PROFILE
MATCH (n:Person) WHERE EXISTS((n)-[:DIRECTED]->())
RETURN count(*);

Here the plan doesn’t count the nested db-hits in the expression, which it should. That’s why I included the runtime:

1 row 197 ms

Operator Est.Rows Rows DbHits Identifiers Other

Total database accesses: 106396

EagerAggregation

194

1

56216

count(*)

Filter

37634

6037

0

n

NestedPipeExpression(ExpandAllPipe(…​.))

NodeByLabelScan

50179

50179

50180

n


versus

PROFILE
MATCH (n:Person) WHERE size((n)-[:DIRECTED]->()) <> 0
RETURN count(*);

1 row 90 ms

Operator Est.Rows Rows DbHits Identifiers Other

Total database accesses: 150538

EagerAggregation

213

1

0

count(*)

Filter

45161

6037

100358

n

NOT(GetDegree(n,Some(DIRECTED),OUTGOING) == { AUTOINT0})

NodeByLabelScan

50179

50179

50180

n

:Person


You can also use that technique nicely for overview pages or inline aggregations:

PROFILE
MATCH (m:Movie)
RETURN m.title, size((m)<-[:ACTS_IN]-()) as actors, size((m)<-[:DIRECTED]-()) as directors
LIMIT 10;

+-------------------------------------------------------------+
| m.title                                | actors | directors |
+-------------------------------------------------------------+
| "Indiana Jones and the Temple of Doom" | 13     | 1         |
| "King Kong"                            | 1      | 1         |
| "Stolen Kisses"                        | 21     | 1         |
| "One Flew Over The Cuckoo's Nest"      | 24     | 1         |
| "Ziemia obiecana"                      | 17     | 1         |
| "Scoop"                                | 21     | 1         |
| "Fire"                                 | 0      | 1         |
| "Dial M For Murder"                    | 5      | 1         |
| "Ed Wood"                              | 21     | 1         |
| "Requiem"                              | 11     | 1         |
+-------------------------------------------------------------+
10 rows
13 ms

Operator Est.Rows Rows DbHits Identifiers Other

Total database accesses: 71

Projection

12862

10

60

actors, directors,

m.title; GetDegree(m,Some(ACTS_IN),INCOMING);

m, m.title

GetDegree(m,Some(DIRECTED),INCOMING)

Limit

12862

10

0

m

{ AUTOINT0}

NodeByLabelScan

12862

10

11

m

:Movie


Our query from the previous section would look like this:

PROFILE
MATCH (p:Person)
WITH p, sum(size((p)-[:ACTS_IN]->())) as c
ORDER BY c DESC LIMIT 10
RETURN p.name, c;

This query shaves off another 50,000 database hits. Not bad.

Operator Est.Rows Rows DbHits Identifiers Other

Total database accesses: 150558

Projection

224

10

20

c, p, p.name

p.name; c

Top

224

10

0

c, p

{ AUTOINT0}; c

EagerAggregation

224

50179

100358

c, p

p

NodeByLabelScan

50179

50179

50180

p

:Person


Note to self: Optimized Cypher looks more like Lisp.

Bonus Query Tuning Tip: Reduce Cardinality of Work in Progress

When following longer paths, you’ll encounter duplicates. If you’re not interested in all the possible paths – but just distinct information from stages of the path – make sure that you eagerly eliminate duplicates, so that later matches don’t have to be executed many multiple times.

This reduction of the cardinality can be done using either WITH DISTINCT or WITH aggregation (which automatically de-duplicates).

So, for instance, for this query of "Movies that Tom Hanks' colleagues acted in":

PROFILE
MATCH (p:Person {name:"Tom Hanks"})-[:ACTS_IN]->(m1)<-[:ACTS_IN]-(coActor)-[:ACTS_IN]->(m2)
RETURN distinct m2.title;

This query has 10,272 db-hits and touches 3,020 total paths.

Operator Est.Rows Rows DbHits Identifiers Other

Total database accesses: 10272

Distinct

4

2021

6040

m2.title

m2.title

Filter(0)

4

3020

0

anon[36], anon[53], anon[75], coActor, m1, m2, p

(NOT(anon[53] == anon[75]) AND NOT(anon[36] == anon[75]))

Expand(All)(0)

4

3388

3756

anon[36], anon[53], anon[75], coActor, m1, m2, p

(coActor)-[:ACTS_IN]→(m2)

Filter(1)

3

368

0

anon[36], anon[53], coActor, m1, p

NOT(anon[36] == anon[53])

Expand(All)(1)

3

403

438

anon[36], anon[53], coActor, m1, p

(m1)←[:ACTS_IN]-(coActor)

Expand(All)(2)

2

35

36

anon[36], m1, p

(p)-[:ACTS_IN]→(m1)

NodeIndexSeek

1

1

2

p

:Person(name)


The first-degree neighborhood is unique, since in this dataset there is only at most one :ACTS_IN relationship between an actor and a movie. So, the first duplicated nodes appear at the second degree, which we can eliminate like this:

PROFILE
MATCH (p:Person {name:"Tom Hanks"})-[:ACTS_IN]->(m1)<-[:ACTS_IN]-(coActor)
WITH distinct coActor
MATCH (coActor)-[:ACTS_IN]->(m2)
RETURN distinct m2.title;

This query tuning technique reduces the number of paths to match for the last step to 2,906. In other use cases with more duplicates, the impact is much bigger.

Operator Est.Rows Rows DbHits Identifiers Other

Total database accesses: 9529

Distinct(0)

4

2031

5812

m2.title

m2.title

Expand(All)(0)

4

2906

3241

anon[113], coActor, m2

(coActor)-[:ACTS_IN]→(m2)

Distinct(1)

3

335

0

coActor

coActor

Filter

3

368

0

anon[36], anon[53], coActor, m1, p

NOT(anon[36] == anon[53])

Expand(All)(1)

3

403

438

anon[36], anon[53], coActor, m1, p

(m1)←[:ACTS_IN]-(coActor)

Expand(All)(2)

2

35

36

anon[36], m1, p

(p)-[:ACTS_IN]→(m1)

NodeIndexSeek

1

1

2

p

:Person(name)


Of course we would apply our Minimize Property Access tip here too:

PROFILE
MATCH (p:Person {name:"Tom Hanks"})-[:ACTS_IN]->(m1)<-[:ACTS_IN]-(coActor)
WITH distinct coActor
MATCH (coActor)-[:ACTS_IN]->(m2)
WITH distinct m2
RETURN m2.title;

Operator Est.Rows Rows DbHits Identifiers Other

Total database accesses: 7791

Projection

4

2037

4074

m2, m2.title

m2.title

Distinct(0)

4

2037

0

m2

m2

Expand(All)(0)

4

2906

3241

anon[113], coActor, m2

(coActor)-[:ACTS_IN]→(m2)

Distinct(1)

3

335

0

coActor

coActor

Filter

3

368

0

anon[36], anon[53], coActor, m1, p

NOT(anon[36] == anon[53])

Expand(All)(1)

3

403

438

anon[36], anon[53], coActor, m1, p

(m1)←[:ACTS_IN]-(coActor)

Expand(All)(2)

2

35

36

anon[36], m1, p

(p)-[:ACTS_IN]→(m1)

NodeIndexSeek

1

1

2

p

:Person(name)


We still need the distinct m2 at the end, as the co-actors can have played in the same movies, and we don’t want duplicate results.

This query has 7,791 db-hits and touches 2,906 paths in total.

If you are also interested in the frequency (e.g., for scoring), you can compute them along with an aggregation instead of distinctly. In the end, You just multiply the path count per co-actor with the number of occurrences per movie.

MATCH (p:Person {name:"Tom Hanks"})-[:ACTS_IN]->(m1)<-[:ACTS_IN]-(coActor)
WITH coActor, count(*) as freq
MATCH (coActor)-[:ACTS_IN]->(m2)
RETURN m2.title, freq * count(*) as occurrence;

Conclusion


The best way to start with query tuning is to take the slowest queries, PROFILE them and optimize them using these tips.

If you need help, you can always reach out to us on Stack Overflow, our Google Group or our public Discord channel.

If you are part of a project that is adopting Neo4j or putting it into production, make sure to get some expert help to ensure you’re successful. Note: If you do ask for help, please provide enough information for others to be able to help you. Explain your graph model, share your queries, their profile output and – best of all – a dataset to run them on.


Need more tips on how to effectively use Neo4j? Register for our online training class, Neo4j in Production, and learn how to master the world’s leading graph database.