Neo4j Connector for Apache Spark FAQ

How can I speed up writes to Neo4j?

The Spark connector fundamentally writes data to Neo4j in batches. Neo4j is a transactional database, and so all modifications are made within a transaction. Those transactions in turn have overhead.

The two simplest ways of increasing write performance are: * Increase the batch size (option batch.size). The larger the batch, the fewer transactions are executed to write all of your data, and the less transactional overhead is incurred. * Ensure that your Neo4j instance has ample free heap & properly sized page cache. Small heaps will make you unable to commit large batches, which in turn will slow overall import

For best performance, make sure you are familiar with the material in the Neo4j Performance Tuning Guide

It is important to keep in mind that Neo4j scales writes vertically and reads horizontally. In the Causal Cluster Model, only the cluster leader (1 machine) may accept writes. For this reason, focus on getting the best hardware & performance on your cluster leader to maximize write throughput.

Where can I get help?

The Neo4j Community site is a great place to go to ask questions, and talk with other users who use the connector and get help from Neo4j pros.

What is the license for this connector?

The source code is offered under the terms of the Apache 2.0 open source license. You are free to download, modify, and redistribute the connector; however Neo4j support will apply only to official builds provided by Neo4j.

Is this software connected to Morpheus or Cypher for Apache Spark (CAPS)?

No. There is no shared code or approach between the two, and they take very different approaches. Cypher for Apache Spark/Morpheus took the approach of providing an interpreter that could execute Cypher queries within the Spark environment, and provided a native graph representation for Spark. By contrast, this connector does not provide that functionality, and focuses on doing reads and writes back and forth between Neo4j & Spark. Via this connector, all Cypher code is executed strictly within Neo4j. The spark environment operates in terms of DataFrames as it always did, and this connector does not provide graph API primitives for Spark.

Can this connector be used for pre-processing of data and loading into Neo4j?

Yes. This connector enables spark to be used as a good method of loading data directly into Neo4j. See the architecture section for a detailed discussion of "Normalized Loading" vs. "Cypher Destructuring" and guidance on different approaches for how to do performant data loads into Neo4j.

My writes are failing due to Deadlock Exceptions

In some cases, Neo4j will reject write transactions due to a deadlock exception that you may see in the stacktrace.

This Neo4j Knowledge Base entry describes the issue.

Typically this is caused by too much parallelism in writing to Neo4j. For example, when you write a relationship (:A)-[:REL]→(:B), this creates a "lock" in the database on both nodes. If some simultaneous other thread is attempting to write to those nodes too often, deadlock exceptions can result and a transaction will fail.

In general, the solution is to repartition the dataframe prior to writing it to Neo4j, to avoid multiple partitioned writes from locking the same nodes & relationships.

I’m getting a cast error like UTF8String cannot be cast to Long. How do I solve it?

You might be getting error like:

java.lang.ClassCastException: org.apache.spark.unsafe.types.UTF8String cannot be cast to java.lang.Long

or similar, with different types.

This is typically due to a field having different types on the same nodes label. You can solve it by adding APOC to your Neo4j installation; this will remove the error but all the values for that field will be casted to String. This because Spark is not schema free, and need each column to always have the same type.

You can read more here.

The returned columns are not in the same order as I specified in the query

Unfortunately this is a known issue and is there for Neo4j 3.* and Neo4j 4.0. With Neo4j 4.1+ you will get the same order as specified in the return statement.

TableProvider implementation org.neo4j.spark.DataSource cannot be written with ErrorIfExists mode, please use Append or Overwrite modes instead.

If you are getting this error while trying to write to Neo4j be aware that the current version of the connector doesn’t support SaveMode.ErrorIfExists on Spark 3, and that is the default save mode. So please, change the save mode to one of SaveMode.Append or SaveMode.Overwrite.

We are working to fully support all the Save Mode on Spark 3.

I’m getting a NoClassDefFoundError or ClassNotFoundException.

If you see this type of error:

NoClassDefFoundError: org/apache/spark/sql/sources/v2/ReadSupport
Caused by: ClassNotFoundException: org.apache.spark.sql.sources.v2.ReadSupport

This means that your Spark version doesn’t match the Spark version on the connector. Please refer to this page to know which version you need.

Getting "Failed to invoke procedure gds.graph.create.cypher: Caused by: java.lang.IllegalArgumentException: A graph with name [name] already exists."

This might happen when creating a new graph using the GDS library. The issue here is that the query is run the first time to extract the DataFrame schema and then is run again to get the data.

To avoid this issue we suggest using the user defined schema approach.