Neo4j Connector for Apache Spark FAQ

What is the license for this connector?

The source code is offered under the terms of the Apache 2.0 open source license. You are free to download, modify, and redistribute the connector; however, Neo4j support applies only to official builds provided by Neo4j.

Is this software connected to Morpheus or Cypher for Apache Spark (CAPS)?

No. There is no shared code or approach between the two, and they take very different approaches. Cypher for Apache Spark/Morpheus took the approach of providing an interpreter that could execute Cypher queries within the Spark environment, and provided a native graph representation for Spark. By contrast, this connector does not provide that functionality and focuses on doing reads and writes back and forth between Neo4j and Spark. Via this connector, all Cypher code is executed strictly within Neo4j. The Spark environment operates in terms of DataFrames as it always did, and this connector does not provide graph API primitives for Spark.

Can this connector be used for pre-processing of data and loading it into Neo4j?

Yes. This connector enables Spark to be used as a good method of loading data directly into Neo4j. See the architecture section for a detailed discussion of Normalized loading vs. Cypher destructuring and guidance on different approaches for how to do performant data loads into Neo4j.

How can I speed up writes to Neo4j?

The Spark connector fundamentally writes data to Neo4j in batches. Neo4j is a transactional database, and all modifications are made within a transaction. Those transactions in turn have overhead.

The two simplest ways of increasing write performance are the following:

  • You can increase the batch size (option batch.size). The larger the batch, the fewer transactions are executed to write all of your data, and the less transactional overhead is incurred.

  • Ensure that your Neo4j instance has ample free heap and properly sized page cache. Small heaps make you unable to commit large batches, which in turn slows overall import.

For best performance, make sure you are familiar with the material in the Operations manual.

It is important to keep in mind that Neo4j scales writes vertically and reads horizontally. In the Causal Cluster Model, only the cluster leader (1 machine) may accept writes. For this reason, focus on getting the best hardware and performance on your cluster leader to maximize write throughput.

My writes are failing due to Deadlock Exceptions. What should I do?

In some cases, Neo4j rejects write transactions due to a deadlock exception that you may see in the stacktrace.

This Neo4j Knowledge Base article describes the issue.

Typically this is caused by too much parallelism in writing to Neo4j. For example, when you write a relationship (:A)-[:REL]→(:B), this creates a 'lock' in the database on both nodes. If some simultaneous other thread is attempting to write to those nodes too often, deadlock exceptions can result and a transaction fails.

In general, the solution is to repartition the DataFrame before writing it to Neo4j, to avoid multiple partitioned writes from locking the same nodes and relationships.

I am getting a cast error like UTF8String cannot be cast to Long. What is the solution?

You might be getting an error like the following one or similar, with different types:

java.lang.ClassCastException: org.apache.spark.unsafe.types.UTF8String cannot be cast to java.lang.Long

This is typically due to a field having different types on the same nodes label. You can solve it by installing APOC: this removes the error, but all the values for that field are cast to String. This happens because Spark is not schema-free and needs each column always to have its own type.

You can read more here.

Why are the returned columns not in the same order as I specified in the query?

Unfortunately, this is a known issue and is there for Neo4j 3.x and Neo4j 4.0. From Neo4j 4.1 onwards, you get the same order as it’s specified in the return statement.

I am getting an error TableProvider implementation org.neo4j.spark.DataSource cannot be written with ErrorIfExists mode, please use Append or Overwrite modes instead. What should I do?

If you are getting this error while trying to write to Neo4j, be aware that the current version of the connector doesn’t support SaveMode.ErrorIfExists on Spark 3, and that is the default save mode. So please, change the save mode to one of SaveMode.Append or SaveMode.Overwrite.

We are working to fully support all the save modes on Spark 3.

I am getting errors NoClassDefFoundError or ClassNotFoundException. What should I do?

You may get this type of error:

NoClassDefFoundError: org/apache/spark/sql/sources/v2/ReadSupport
Caused by: ClassNotFoundException: org.apache.spark.sql.sources.v2.ReadSupport

This means that your Spark version doesn’t match the Spark version on the connector. Refer to this page to know which version you need.

I am getting an error Failed to invoke procedure gds.graph.create.cypher: Caused by: java.lang.IllegalArgumentException: A graph with name [name] already exists. What should I do?

This might happen when creating a new graph using the GDS library. The issue here is that the query is run the first time to extract the DataFrame schema and then is run again to get the data.

To avoid this issue you can use the user defined schema approach.

Where can I get help?

The Neo4j Community site is a great place to ask questions, talk with other users who use the connector, and get help from Neo4j experts