Using with Graph Data Science

This chapter provides an information on using the Neo4j Connector for Apache Spark with Neo4j’s Graph Data Science Library.

Neo4j’s Graph Data Science (GDS) Library lets data scientists benefit from powerful graph algorithms. It provides unsupervised machine learning methods and heuristics that learn and describe the topology of your graph. The GDS Library includes hardened graph algorithms with enterprise features, like deterministic seeding for consistent results and reproducible machine learning workflows.

GDS Algorithms are bucketed into 5 "families":

  • Community detection which detects group clusters and partition options

  • Centrality which helps compute the importance of a node in a graph

  • Heuristic Link Prediction which estimates the liklihood of nodes forming a relationship

  • Similarity which evaluates how alike 2 nodes are

  • Pathfinding & Search which finds optimal paths, evalutes route availability, and so on.

GDS Operates via Cypher

All of the functionality of GDS is used by issuing cypher queries. As such, it is easily accessible via Spark, because the Neo4j Connector for Apache Spark can issue Cypher queries and read their results back. This combination means that you can use Neo4j & GDS as a graph co-processor in an existing ML workflow that you may implement in Apache Spark.

Example

In the sample Zeppelin Notebook repository, there is a GDS example that can be run against a Neo4j Sandbox, showing how to use the two together.

Create a Virtual Graph in GDS Using Spark

This is very simple, straightforward code; it just constructs the right Cypher statement to create a virtual graph in GDS, and returns the results.

%pyspark
query = """
    CALL gds.graph.create('got-interactions', 'Person', {
      INTERACTS: {
        orientation: 'UNDIRECTED'
      }
    })
    YIELD graphName, nodeCount, relationshipCount, createMillis
    RETURN graphName, nodeCount, relationshipCount, createMillis
"""

df = spark.read.format("org.neo4j.spark.DataSource") \
    .option("url", host) \
    .option("authentication.type", "basic") \
    .option("authentication.basic.username", user) \
    .option("authentication.basic.password", password) \
    .option("query", query) \
    .option("partitions", "1") \
    .load()
If you get a "Graph Already Exists" error, take a look at this FAQ.
Ensure partitions is set to 1. You do not want to execute this query in parallel, only once.
When you use stored procedures, you must include a RETURN clause

Run a GDS Analysis and Stream the Results Back

To run an analysis, the result is just another Cypher query, executed as a spark read from Neo4j.

%pyspark

query = """
    CALL gds.pageRank.stream('got-interactions')
    YIELD nodeId, score
    RETURN gds.util.asNode(nodeId).name AS name, score
"""

df = spark.read.format("org.neo4j.spark.DataSource") \
    .option("url", host) \
    .option("authentication.type", "basic") \
    .option("authentication.basic.username", user) \
    .option("authentication.basic.password", password) \
    .option("query", query) \
    .option("partitions", "1") \
    .load()

df.show()
Ensure partitions is set to 1. The algorithm should only be executed once.

Streaming versus Persisting GDS Results

When running GDS algorithms the library gives you the choice of either streaming the results of the algorithm back the caller, or mutating the underlying graph. Using GDS together with spark provides an additional option of transforming or otherwise using a GDS result. Ultimately, either modality will work with the Neo4j Connector for Apache Spark, and it is left up to your option what’s best for your use case.

If you have an architecture where the GDS algorithm is being run on a read replica or separate stand-alone instance, it may be convenient to stream the results back (as you cannot write them to a read replica), and then use the connector’s write functionality to take that stream of results and write them back to a different Neo4j connection, i.e. to a regular causal cluster.