Improving performance

General principles

Since Spark excels at parallel computation, any non-graph heavy computation should be done in the Spark layer, rather than in Cypher^® on Neo4j.
- Wherever possible, perform data quality fixes before loading into Neo4j; this includes dropping missing records, changing data types of properties, and so on.
- Try different data modeling strategies.
- Understand the way Spark optimizations are applied.
Tune Spark and Neo4j parameters according to your application.
- Size your Neo4j instance appropriately before using aggressive parallelism or large batch sizes.
- Experiment with larger batch sizes (ensuring that batches stay within Neo4j configured heap memory). In general, the larger the batches, the faster the overall throughput to Neo4j.

Use the correct indexes and constraints

At the Neo4j Cypher level, it’s very common to use the Spark connector in a way that generates MERGE queries. In Neo4j, this looks up a node by some 'key' and then creates it only if it does not already exist.

It is strongly recommended to assert indexes or constraints on any graph property that you use as part of node.keys, relationship.source.node.keys, relationship.target.node.keys or other similar key options.

A common source of poor performance is to write Spark code that generates MERGE Cypher, or otherwise tries to look data up in Neo4j without the appropriate database indexes. In this case, the Neo4j server ends up looking through much more data than necessary to satisfy the query, and performance suffers.