Neo4j Connector for Apache Spark FAQ
The Spark connector fundamentally writes data to Neo4j in batches. Neo4j is a transactional database, and so all modifications are made within a transaction. Those transactions in turn have overhead.
The two simplest ways of increasing write performance are:
* Increase the batch size (option
batch.size). The larger the batch, the fewer transactions are executed to write all of your data, and the less transactional overhead is incurred.
* Ensure that your Neo4j instance has ample free heap & properly sized page cache. Small heaps will make you unable to commit large batches, which in turn will slow overall import
|For best performance, make sure you are familiar with the material in the Neo4j Performance Tuning Guide|
It is important to keep in mind that Neo4j scales writes vertically and reads horizontally. In the Causal Cluster Model, only the cluster leader (1 machine) may accept writes. For this reason, focus on getting the best hardware & performance on your cluster leader to maximize write throughput.
The Neo4j Community site is a great place to go to ask questions, and talk with other users who use the connector and get help from Neo4j pros.
The source code is offered under the terms of the Apache 2.0 open source license. You are free to download, modify, and redistribute the connector; however Neo4j support will apply only to official builds provided by Neo4j.
No. There is no shared code or approach between the two, and they take very different approaches. Cypher for Apache Spark/Morpheus took the approach of providing an interpreter that could execute Cypher queries within the Spark environment, and provided a native graph representation for Spark. By contrast, this connector does not provide that functionality, and focuses on doing reads and writes back and forth between Neo4j & Spark. Via this connector, all Cypher code is executed strictly within Neo4j. The spark environment operates in terms of DataFrames as it always did, and this connector does not provide graph API primitives for Spark.
Yes. This connector enables spark to be used as a good method of loading data directly into Neo4j. See the architecture section for a detailed discussion of "Normalized Loading" vs. "Cypher Destructuring" and guidance on different approaches for how to do performant data loads into Neo4j.
In some cases, Neo4j will reject write transactions due to a deadlock exception that you may see in the stacktrace.
This Neo4j Knowledge Base entry describes the issue.
Typically this is caused by too much parallelism in writing to Neo4j. For example, when you
write a relationship
(:A)-[:REL]→(:B), this creates a "lock" in the database on both nodes.
If some simultaneous other thread is attempting to write to those nodes too often, deadlock
exceptions can result and a transaction will fail.
In general, the solution is to repartition the dataframe prior to writing it to Neo4j, to avoid multiple partitioned writes from locking the same nodes & relationships.
You might be getting error like:
java.lang.ClassCastException: org.apache.spark.unsafe.types.UTF8String cannot be cast to java.lang.Long
or similar, with different types.
This is typically due to a field having different types on the same nodes label. You can solve it by adding APOC to your Neo4j installation; this will remove the error but all the values for that field will be casted to String. This because Spark is not schema free, and need each column to always have the same type.
You can read more here.
Unfortunately this is a known issue and is there for Neo4j 3.* and Neo4j 4.0. With Neo4j 4.1+ you will get the same order as specified in the return statement.
TableProvider implementation org.neo4j.spark.DataSource cannot be written with ErrorIfExists mode, please use Append or Overwrite modes instead.
If you are getting this error while trying to write to Neo4j be aware that the current version of the connector
doesn’t support SaveMode.ErrorIfExists on Spark 3,
and that is the default save mode.
So please, change the save mode to one of
We are working to fully support all the Save Mode on Spark 3.
If you see this type of error:
NoClassDefFoundError: org/apache/spark/sql/sources/v2/ReadSupport Caused by: ClassNotFoundException: org.apache.spark.sql.sources.v2.ReadSupport
This means that your Spark version doesn’t match the Spark version on the connector. Please refer to this page to know which version you need.
Getting "Failed to invoke procedure gds.graph.create.cypher: Caused by: java.lang.IllegalArgumentException: A graph with name [name] already exists."
This might happen when creating a new graph using the GDS library. The issue here is that the query is run the first time to extract the DataFrame schema and then is run again to get the data.
To avoid this issue we suggest using the user defined schema approach.
Was this page helpful?