Running algorithms

In the examples below we assume that we have an instantiated GraphDataScience object called gds. Read more about this in Getting started.

1. Introduction

Running most algorithms with the Python client is structurally similar to using the Cypher API:

Syntax composition:
result = gds[.<tier>].<algorithm>.<execution-mode>[.<estimate>](
  G: Graph,
  **configuration: dict[str, any]
)

Here we can note a few key differences:

  • Instead of a graph name string as first argument, we have a graph object as first positional argument.

  • Instead of a configuration map, we have named keyword arguments.

The result of running a procedure is returned as either a pandas DataFrame or a pandas Series depending on the execution mode.

For example, to run the WCC and FastRP algorithms on a graph G, we could do the following:

G, _ = gds.graph.project(...)

wcc_res = gds.wcc.mutate(
    G,                          #  Graph object
    threshold=0.8,              #  Configuration parameters
    mutateProperty="wcc"
)
assert wcc_res["componentCount"] > 0

fastrp_res = gds.fastRP.write(
    G,                          #  Graph object
    featureProperties=["wcc"],  #  Configuration parameters
    embeddingDimension=256,
    propertyRatio=0.3,
    writeProperty="embedding"
)
assert fastrp_res["nodePropertiesWritten"] == G.node_count()

Some algorithms deviate from the standard syntactic structure. We describe how to use them in the Python client in the sections below.

2. Execution modes

Algorithms return results in a format that is controlled by its execution mode. These modes are explained in some detail in Running algorithms. In the Python client, the stats, mutate and write modes return a pandas Series containing the summary result of running the algorithm. The same applies to estimate procedures.

2.1. Stream

The stream mode is a bit different as this mode does not retain the result in any form on the server side. Instead, the result is streamed back to the Python client, as a pandas DataFrame. The result is materialized on the client side immediately once the computation is finished. Streaming results back in this way can be resource-intensive, as the result can be large. Typically, the result size will be in the same order of magnitude as the graph. Some algorithms produce particularly sizeable results, for example node embeddings.

2.2. Train

The train mode is used for algorithms that produce a machine learning model into the GDS Model Catalog. The Python client has special support for working with such models, which we describe in The model object.

3. Algorithms that require node matching

Some algorithms take (database) node ids as inputs. These node ids must be matched directly from the Neo4j database. This is straight-forward when working in Cypher. In the Python client there is a convenience method gds.find_node_id to retrieve a node id based on node labels and property key-value pairs.

For example, to find a source and target node of a graph G with cities to run Dijkstra Source-Target Shortest Path on, we could do the following:

source_id = gds.find_node_id(["City"], {"name": "New York"})
target_id = gds.find_node_id(["City"], {"name": "Philadelphia"})

res = gds.shortestPath.dijkstra.stream(G, sourceNode=source_id, targetNode=target_id)
assert res["totalCost"][0] == 100

gds.find_node_id takes a list of node labels and a dictionary of node property key-value pairs. The nodes found are those that have all labels specified and fully match all property key-value pairs given. Note that exactly one node per method call must be matched, otherwise an error will be raised.

3.1. Cypher mapping

The Python call:

gds.find_node_id(["A", "B"], {"p1": 1, "p2": "foo"})

is exactly equivalent to the Cypher statement:

MATCH (n:A:B {p1: 1, p2: 'foo'})
RETURN id(n) AS id

To do more advanced matching beyond the capabilities of find_node_id() we recommend using Cypher’s MATCH via gds.run_cypher.

The methods for doing Topological link prediction are a bit different. Just like in the GDS procedure API they do not take a graph as an argument, but rather two node references as positional arguments. And they simply return the similarity score of the prediction just made as a float - not any kind of pandas data structure.

For example, to run the Adamic Adar algorithm, we can use the following:

node1 = gds.find_node_id(["User"], {"name": "Mats"})
node2 = gds.find_node_id(["User"], {"name": "Adam"})

score = gds.alpha.linkprediction.adamicAdar(node1, node2)
assert score >= 0