Using with PySpark / Python

This connector uses the DataSource V2 API in Spark.

With a properly configured PySpark interpreter, you are able to use Python to call the connector and do all the Spark work.

The following examples show what the API looks like in Scala versus Python to aid adaptation of any code examples you might have and to get started quickly.

This first listing is a simple program that reads all Person nodes out of a Neo4j instance into a DataFrame, in Scala.

import org.apache.spark.sql.{SaveMode, SparkSession}

val spark = SparkSession.builder().getOrCreate()

spark.read.format("org.neo4j.spark.DataSource")
  .option("url", "bolt://localhost:7687")
  .option("labels", "Person:Customer:Confirmed")
  .load()

Here is the same program in Python:

spark.read.format("org.neo4j.spark.DataSource") \
  .option("url", "bolt://localhost:7687") \
  .option("labels", "Person:Customer:Confirmed") \
  .load()

For the most part, the API is the same; you should only adopt the syntax for Python by adding backslashes to allow line continuance and avoid running into Python’s indentation rules.

API differences

Some common API constants may need to be referred to as strings in the PySpark API. Consider the following two examples in Scala and Python, focusing on the SaveMode.

import org.apache.spark.sql.{SaveMode, SparkSession}

df.write
  .format("org.neo4j.spark.DataSource")
  .mode(SaveMode.ErrorIfExists)
  .option("url", "bolt://localhost:7687")
  .option("labels", ":Person")
  .save()

The same program in Python is very similar. Note the language syntax differences and the mode:

df.write \
  .format("org.neo4j.spark.DataSource") \
  .mode("ErrorIfExists") \
  .option("url", "bolt://localhost:7687") \
  .option("labels", ":Person") \
  .save()

To avoid the necessity for backslashes on each line, you can also use parentheses like so:

result = (df.write
  .format("org.neo4j.spark.DataSource")
  .mode("ErrorIfExists")
  .option("url", "bolt://localhost:7687")
  .option("labels", ":Person")
  .save())