Using with Neo4j Graph Data Science

Neo4j Graph Data Science (GDS) lets data scientists benefit from powerful graph algorithms. It provides unsupervised machine learning (ML) methods and heuristics that learn and describe the topology of your graph. GDS includes hardened graph algorithms with enterprise features, like deterministic seeding for consistent results and reproducible ML workflows.

GDS algorithms are bucketed into five groups:

  • Community detection which detects group clusters and partition options.

  • Centrality which helps to compute the importance of a node in a graph.

  • Topological link prediction which estimates the likelihood of nodes forming a relationship.

  • Similarity which evaluates the similarity of pairs of nodes.

  • Path finding & search which finds optimal paths, evalutes route availability, and so on.

GDS operates via Cypher

All of the functionality of GDS is used by issuing Cypher queries. As such, it is easily accessible via Spark, because the Neo4j Connector for Apache Spark can issue Cypher queries and read their results back. This combination means that you can use Neo4j and GDS as a graph co-processor in an existing ML workflow that you may implement in Apache Spark.

Example

In the sample Zeppelin Notebook repository, there is a GDS example that can be run against a Neo4j Sandbox, showing how to use the two together.

Create a virtual graph in GDS using Spark

This is very simple, straightforward code; it constructs the right Cypher statement to create a virtual graph in GDS and returns the results.

%pyspark
query = """
    CALL gds.graph.create('got-interactions', 'Person', {
      INTERACTS: {
        orientation: 'UNDIRECTED'
      }
    })
    YIELD graphName, nodeCount, relationshipCount, createMillis
    RETURN graphName, nodeCount, relationshipCount, createMillis
"""

df = spark.read.format("org.neo4j.spark.DataSource") \
    .option("url", host) \
    .option("authentication.type", "basic") \
    .option("authentication.basic.username", user) \
    .option("authentication.basic.password", password) \
    .option("query", query) \
    .option("partitions", "1") \
    .load()
If you get a A graph with name [name] already exists error, take a look at this FAQ.

Ensure that option partitions is set to 1. You do not want to execute this query in parallel, it should be executed only once.

When you use stored procedures, you must include a RETURN clause.

Run a GDS analysis and stream the results back

The following example shows how to run an analysis and get the result as just another Cypher query, executed as a Spark read from Neo4j.

%pyspark

query = """
    CALL gds.pageRank.stream('got-interactions')
    YIELD nodeId, score
    RETURN gds.util.asNode(nodeId).name AS name, score
"""

df = spark.read.format("org.neo4j.spark.DataSource") \
    .option("url", host) \
    .option("authentication.type", "basic") \
    .option("authentication.basic.username", user) \
    .option("authentication.basic.password", password) \
    .option("query", query) \
    .option("partitions", "1") \
    .load()

df.show()
Ensure that option partitions is set to 1. The algorithm should be executed only once.

Streaming versus persisting GDS results

When running GDS algorithms, the library gives you the choice of either streaming the algorithm results back to the caller, or mutating the underlying graph. Using GDS together with Spark provides an additional option of transforming or otherwise using a GDS result. Ultimately, either modality works with the Neo4j Connector for Apache Spark, and you choose what’s best for your use case.

If you have an architecture where the GDS algorithm is being run on a Read Replica or a separate standalone instance, it may be convenient to stream the results back (as you cannot write them to a Read Replica), and then use the connector’s write functionality to take that stream of results, and write them back to a different Neo4j connection, i.e., to a regular Causal Cluster.