Using GDS and Fabric

This section describes how the Neo4j Graph Data Science library can be used in a Neo4j Fabric deployment.

Neo4j Fabric is a way to store and retrieve data in multiple databases, whether they are on the same Neo4j DBMS or in multiple DBMSs, using a single Cypher query. For more information about Fabric itself, please visit the documentation.

A typical Neo4j Fabric setup consists of two components: one or more shards that hold the data and one or more Fabric proxies that coordinate the distributed queries. Currently, the way of running the Neo4j Graph Data Science library in a Fabric deployment is to run GDS on the shards. Executing GDS on a Fabric proxy is currently not supported.

1. Running GDS on the Shards

In this mode of using GDS in a Fabric environment, the GDS operations are executed on the shards. The graph projections and algorithms are then executed on each shard individually, and the results can be combined via the Fabric proxy. This scenario is useful, if the graph is partitioned into disjoint subgraphs across shards, i.e. there is no logical relationship between nodes on different shards. Another use case is to replicate the graph’s topology across multiple shards, where some shards act as operational and others as analytical databases.

1.1. Setup

In this scenario we need to set up the shards to run the Neo4j Graph Data Science library.

Every shard that will run the Graph Data Science library should be configured just as a standalone GDS database would be, for more information see Installation.

The Fabric proxy nodes do not require any special configuration, i.e., the GDS library plugin does not need to be installed. However, the proxy nodes should be configured to handle the amount of data received from the shards.

1.2. Examples

Let’s assume we have a Fabric setup with two shards. One shard functions as the operational database and holds a graph with the schema (Person)-[KNOWS]→(Person). Every Person node also stores an identifying property id and the persons name and possibly other properties.

The other shard, the analytical database, stores a graph with the same data, except that the only property is the unique identifier.

Using Fabric, we can now calculate the PageRank score for each Person and join the results with the name of that Person.

CALL {
  USE FABRIC_DB_NAME.ANALYTICS_DB
  CALL gds.pagerank.stream({nodeProjection: 'Person', relationshipProjection: 'KNOWS'})
  YIELD nodeId, score AS pageRank
  RETURN gds.util.asNode(nodeId).id AS personId, pageRank
}
CALL {
  USE FABRIC_DB_NAME.OPERATIONAL_DB
  WITH personId
  MATCH (p {id: personId})
  RETURN p.name AS name
}
RETURN name, personId, pageRank

The query first connects to the analytical database where the PageRank algorithm computes the rank for each node of an anonymous graph. The algorithm results are streamed to the proxy, together with the unique node id. For every row returned by the first subquery, the operational database is then queried for the persons name, again using the unique node id to identify the Person node across the shards.

1.3. Limitations

  • It is not possible to run algorithms across shards.