Machine learning pipelines

The Python client has special support for Link prediction pipelines and pipelines for node property prediction. The GDS pipelines are represented as pipeline objects.

The Python method calls to create the pipelines match their Cypher counterparts exactly. However, the rest of the pipeline functionality is deferred to methods on the pipeline objects themselves. Once created, the TrainingPipeline can be passed as arguments to methods in the Python client, such as the pipeline catalog operations. Additionally, the TrainingPipeline has convenience methods allowing for inspection of the pipeline represented without explicitly involving the pipeline catalog.

In the examples below we assume that we have an instantiated GraphDataScience object called gds. Read more about this in Getting started.

1. Node classification

This section outlines how to use the Python client to build, configure and train a node classification pipeline, as well as how to use the model that training produces for predictions.

1.1. Pipeline

The creation of the node classification pipeline is very similar to how it’s done in Cypher. To create a new node classification pipeline one would make the following call:

pipe, res = gds.beta.pipeline.nodeClassification.create("my-pipe")

where pipe is a pipeline object, and res is a pandas Series containing metadata from the underlying procedure call.

To then go on to build, configure and train the pipeline we would call methods directly on the node classification pipeline object. Below is a description of the methods on such objects:

Table 1. Node classification pipeline methods
Name Arguments Return type Description

addNodeProperty

procedure_name: str,
config: **kwargs

Series

Add an algorithm that produces a node property to the pipeline, with optional algorithm-specific configuration.

selectFeatures

node_properties:
Union[str, list[str]]

Series

Select node properties to be used as features.

configureSplit

config: **kwargs

Series

Configure the train-test dataset split.

addLogisticRegression

parameter_space:
dict[str, any]

Series

Add a logistic regression model configuration to train as a candidate in the model selection phase. [1]

addRandomForest

parameter_space:
dict[str, any]

Series

Add a random forest model configuration to train as a candidate in the model selection phase. [1]

addMLP

parameter_space:
dict[str, any]

Series

Add an MLP model configuration to train as a candidate in the model selection phase. [1]

configureAutoTuning

config: **kwargs

Series

Configure the auto-tuning.

train

G: Graph,
config: **kwargs

NCPredictionPipeline,
Series

Train the pipeline on the given input graph using given keyword arguments.

train_estimate

G: Graph,
config: **kwargs

Series

Estimate training the pipeline on the given input graph using given keyword arguments.

node_property_steps

-

DataFrame

Returns the node property steps of the pipeline.

feature_properties

-

Series

Returns a list of the selected feature properties for the pipeline.

split_config

-

Series

Returns the configuration set up for train-test splitting of the dataset.

parameter_space

-

Series

Returns the model parameter space set up for model selection when training.

auto_tuning_config

-

Series

Returns the configuration set up for auto-tuning.

name

-

str

The name of the pipeline as it appears in the pipeline catalog.

type

-

str

The type of pipeline.

creation_time

-

neo4j.time.Datetime

Time when the pipeline was created.

drop

failIfMissing: Optional[bool]

Series

Removes the pipeline from the GDS Pipeline Catalog.

exists

-

bool

True if the model exists in the GDS Pipeline Catalog, False otherwise.

1. Ranges can also be given as length two Tuple`s. I.e. `(x, y) is the same as {range: [x, y]}.

There are two main differences when comparing the methods above that map to procedures of the Cypher API:

  • As the Python methods are called on the pipeline object, one does not need to provide a name when calling them.

  • Configuration parameters in the Cypher calls are represented by named keyword arguments in the Python method calls.

Another difference is that the train Python call takes a graph object instead of a graph name, and returns a NCModel model object that we can run predictions with as well as a pandas Series with the metadata from the training.

Please consult the node classification Cypher documentation for information about what kind of input the methods expect.

1.1.1. Example

Below is a small example of how one could configure and train a very basic node classification pipeline. Note that we don’t configure splits explicitly, but rather use the default.

To exemplify this, we introduce a small person graph:

gds.run_cypher(
  """
  CREATE
    (a:Person {name: "Bob", fraudster: 0}),
    (b:Person {name: "Alice", fraudster: 0}),
    (c:Person {name: "Eve", fraudster: 1}),
    (d:Person {name: "Chad", fraudster: 1}),
    (e:Person {name: "Dan", fraudster: 0}),
    (f:UnknownPerson {name: "Judy"}),

    (a)-[:KNOWS]->(b),
    (a)-[:KNOWS]->(c),
    (a)-[:KNOWS]->(d),
    (b)-[:KNOWS]->(d),
    (c)-[:KNOWS]->(d),
    (c)-[:KNOWS]->(e),
    (d)-[:KNOWS]->(e),
    (d)-[:KNOWS]->(f),
    (e)-[:KNOWS]->(f)
  """
)
G, project_result = gds.graph.project("person_graph", {"Person": {"properties": ["fraudster"]}}, "KNOWS")

assert G.node_labels() == ["Person"]
pipe, _ = gds.beta.pipeline.nodeClassification.create("my-pipe")

# Add Degree centrality as a property step producing "rank" node properties
pipe.addNodeProperty("degree", mutateProperty="rank")

# Select our "rank" property as a feature for the model training
pipe.selectFeatures("rank")

# Verify that the features to be used in model training are what we expect
feature_properties = pipe.feature_properties()
assert len(feature_properties) == 1
assert feature_properties[0]["feature"] == "rank"

# Configure the model training to do cross-validation over logistic regression
pipe.addLogisticRegression(tolerance=(0.01, 0.1))
pipe.addLogisticRegression(penalty=1.0)

# Train the pipeline targeting node property "class" as label and "ACCURACY" as only metric
fraud_model, train_result = pipe.train(
    G,
    modelName="fraud-model",
    targetProperty="fraudster",
    metrics=["ACCURACY"],
    randomSeed=111
)
assert train_result["trainMillis"] >= 0

A model referred to as "fraud-model" in the GDS Model Catalog is produced. In the next section we will go over how to use that model to make predictions.

1.2. Model

As we saw in the previous section, node classification models are created when training a node classification pipeline. In addition to inheriting the methods common to all model objects, node classification models have the following methods:

Table 2. Node classification model methods
Name Arguments Return type Description

predict_mutate

G: Graph,
config: **kwargs

Series

Predict classes for nodes of the input graph and mutate graph with predictions.

predict_mutate_estimate

G: Graph,
config: **kwargs

Series

Estimate predicting classes for nodes of the input graph and mutating graph with predictions.

predict_stream

G: Graph,
config: **kwargs

DataFrame

Predict classes for nodes of the input graph and stream the results.

predict_stream_estimate

G: Graph,
config: **kwargs

Series

Estimate predicting classes for nodes of the input graph and streaming the results.

predict_write

G: Graph,
config: **kwargs

Series

Predict classes for nodes of the input graph and write results back to the database.

predict_write_estimate

G: Graph,
config: **kwargs

Series

Estimate predicting classes for nodes of the input graph and writing the results back to the database.

metrics

-

Series

Returns values for the metrics specified when training, for this particular model.

One can note that the predict methods are indeed very similar to their Cypher counterparts. The three main differences are that:

  • They take a graph object instead of a graph name.

  • They have Python keyword arguments representing the keys of the configuration map.

  • One does not have to provide a "modelName" since the model object used itself have this information.

1.2.1. Example (continued)

We now continue the example above using the node classification model trained_pipe_model we trained there.

# Make sure we indeed obtained an accuracy score
metrics = fraud_model.metrics()
assert "ACCURACY" in metrics

H, project_result = gds.graph.project("full_person_graph", ["Person", "UnknownPerson"], "KNOWS")

# Predict on `H` and stream the results with a specific concurrency of 2
predictions = fraud_model.predict_stream(H, concurrency=2)
assert len(predictions) == H.node_count()

This section outlines how to use the Python client to build, configure and train a link prediction pipeline, as well as how to use the model that training produces for predictions.

2.1. Pipeline

The creation of the link prediction pipeline is very similar to how it’s done in Cypher. To create a new link prediction pipeline one would make the following call:

pipe, res = gds.beta.pipeline.linkPrediction.create("my-pipe")

where pipe is a pipeline object, and res is a pandas Series containing metadata from the underlying procedure call.

To then go on to build, configure and train the pipeline we would call methods directly on the link prediction pipeline object. Below is a description of the methods on such objects:

Table 3. Link prediction pipeline methods
Name Arguments Return type Description

addNodeProperty

procedure_name: str,
config: **kwargs

Series

Add an algorithm that produces a node property to the pipeline, with optional algorithm-specific configuration.

addFeature

feature_type: str,
config: **kwargs

Series

Add a link feature for model training based on node properties and a feature combiner.

configureSplit

config: **kwargs

Series

Configure the feature-train-test dataset split.

addLogisticRegression

parameter_space:
dict[str, any]

Series

Add a logistic regression model configuration to train as a candidate in the model selection phase. [2]

addRandomForest

parameter_space:
dict[str, any]

Series

Add a random forest model configuration to train as a candidate in the model selection phase. [2]

addMLP

parameter_space:
dict[str, any]

Series

Add an MLP model configuration to train as a candidate in the model selection phase. [2]

configureAutoTuning

config: **kwargs

Series

Configure the auto-tuning.

train

G: Graph,
config: **kwargs

LPPredictionPipeline,
Series

Train the model on the given input graph using given keyword arguments.

train_estimate

G: Graph,
config: **kwargs

Series

Estimate training the pipeline on the given input graph using given keyword arguments.

node_property_steps

-

DataFrame

Returns the node property steps of the pipeline.

feature_steps

-

DataFrame

Returns a list of the selected feature steps for the pipeline.

split_config

-

Series

Returns the configuration set up for feature-train-test splitting of the dataset.

parameter_space

-

Series

Returns the model parameter space set up for model selection when training.

auto_tuning_config

-

Series

Returns the configuration set up for auto-tuning.

name

-

str

The name of the pipeline as it appears in the pipeline catalog.

type

-

str

The type of pipeline.

creation_time

-

neo4j.time.Datetime

Time when the pipeline was created.

drop

failIfMissing: Optional[bool]

Series

Removes the pipeline from the GDS Pipeline Catalog.

exists

-

bool

True if the model exists in the GDS Pipeline Catalog, False otherwise.

2. Ranges can also be given as length two Tuple`s. I.e. `(x, y) is the same as {range: [x, y]}.

There are two main differences when comparing the methods above that map to procedures of the Cypher API:

  • As the Python methods are called on the pipeline object, one does not need to provide a name when calling them.

  • Configuration parameters in the Cypher calls are represented by named keyword arguments in the Python method calls.

Another difference is that the train Python call takes a graph object instead of a graph name, and returns a LPModel model object that we can run predictions with as well as a pandas Series with the metadata from the training.

Please consult the link prediction Cypher documentation for information about what kind of input the methods expect.

2.1.1. Example

Below is a small example of how one could configure and train a very basic link prediction pipeline. Note that we don’t configure training parameters explicitly, but rather use the default.

To exemplify this, we introduce a small person graph:

gds.run_cypher(
  """
  CREATE
    (a:Person {name: "Bob"}),
    (b:Person {name: "Alice"}),
    (c:Person {name: "Eve"}),
    (d:Person {name: "Chad"}),
    (e:Person {name: "Dan"}),
    (f:Person {name: "Judy"}),

    (a)-[:KNOWS]->(b),
    (a)-[:KNOWS]->(c),
    (a)-[:KNOWS]->(d),
    (b)-[:KNOWS]->(d),
    (c)-[:KNOWS]->(d),
    (c)-[:KNOWS]->(e),
    (d)-[:KNOWS]->(e),
    (d)-[:KNOWS]->(f),
    (e)-[:KNOWS]->(f)
  """
)
G, project_result = gds.graph.project("person_graph", "Person", {"KNOWS": {"orientation":"UNDIRECTED"}})

assert G.relationship_types() == ["KNOWS"]
pipe, _ = gds.beta.pipeline.linkPrediction.create("lp-pipe")

# Add FastRP as a property step producing "embedding" node properties
pipe.addNodeProperty("fastRP", embeddingDimension=128, mutateProperty="embedding", randomSeed=1337)

# Combine our "embedding" node properties with Hadamard to create link features for training
pipe.addFeature("hadamard", nodeProperties=["embedding"])

# Verify that the features to be used in model training are what we expect
steps = pipe.feature_steps()
assert len(steps) == 1
assert steps["name"][0] == "HADAMARD"

# Specify the fractions we want for our dataset split
pipe.configureSplit(trainFraction=0.2, testFraction=0.2, validationFolds=2)

# Add a random forest model with tuning over `maxDepth`
pipe.addRandomForest(maxDepth=(2, 20))

# Train the pipeline and produce a model named "friend-recommender"
friend_recommender, train_result = pipe.train(
    G,
    modelName="friend-recommender",
    targetRelationshipType="KNOWS",
    randomSeed=42
)
assert train_result["trainMillis"] >= 0

A model referred to as "my-model" in the GDS Model Catalog is produced. In the next section we will go over how to use that model to make predictions.

2.2. Model

As we saw in the previous section, link prediction models are created when training a link prediction pipeline. In addition to inheriting the methods common to all model objects, link prediction models have the following methods:

Table 4. Link prediction model methods
Name Arguments Return type Description

predict_mutate

G: Graph,
config: **kwargs

Series

Predict links between non-neighboring nodes of the input graph and mutate graph with predictions.

predict_mutate_estimate

G: Graph,
config: **kwargs

Series

Estimate predicting links between non-neighboring nodes of the input graph and mutating graph with predictions.

predict_stream

G: Graph,
config: **kwargs

DataFrame

Predict links between non-neighboring nodes of the input graph and stream the results.

predict_stream_estimate

G: Graph,
config: **kwargs

Series

Estimate predicting links between non-neighboring nodes of the input graph and streaming the results.

metrics

-

Series

Returns values for the metrics used when training, for this particular model.

One can note that the predict methods are indeed very similar to their Cypher counterparts. The three main differences are that:

  • They take a graph object instead of a graph name.

  • They have Python keyword arguments representing the keys of the configuration map.

  • One does not have to provide a "modelName" since the model object used itself have this information.

2.2.1. Example (continued)

We now continue the example above using the link prediction model trained_pipe_model we trained there.

# Make sure we indeed obtained an AUCPR score
metrics = friend_recommender.metrics()
assert AUCPR in metrics

# Predict on `G` and mutate it with the relationship predictions
mutate_result = friend_recommender.predict_mutate(G, topN=5, mutateRelationshipType="PRED_REL")
assert mutate_result["relationshipsWritten"] == 5 * 2  # Undirected relationships

3. Node regression

This section outlines how to use the Python client to build, configure and train a node regression pipeline, as well as how to use the model that training produces for predictions.

3.1. Pipeline

The creation of the node regression pipeline is very similar to how it’s done in Cypher. To create a new node regression pipeline one would make the following call:

pipe, res = gds.alpha.pipeline.nodeRegression.create("my-pipe")

where pipe is a pipeline object, and res is a pandas Series containing metadata from the underlying procedure call.

To then go on to build, configure and train the pipeline we would call methods directly on the node regression pipeline object. Below is a description of the methods on such objects:

Table 5. Node regression pipeline methods
Name Arguments Return type Description

addNodeProperty

procedure_name: str,
config: **kwargs

Series

Add an algorithm that produces a node property to the pipeline, with optional algorithm-specific configuration.

selectFeatures

node_properties:
Union[str, list[str]]

Series

Select node properties to be used as features.

configureSplit

config: **kwargs

Series

Configure the train-test dataset split.

addLinearRegression

parameter_space:
dict[str, any]

Series

Add a linear regression model configuration to train as a candidate in the model selection phase. [3]

addRandomForest

parameter_space:
dict[str, any]

Series

Add a random forest model configuration to train as a candidate in the model selection phase. [3]

configureAutoTuning

config: **kwargs

Series

Configure the auto-tuning.

train

G: Graph,
config: **kwargs

NCPredictionPipeline,
Series

Train the pipeline on the given input graph using given keyword arguments.

node_property_steps

-

DataFrame

Returns the node property steps of the pipeline.

feature_properties

-

Series

Returns a list of the selected feature properties for the pipeline.

split_config

-

Series

Returns the configuration set up for train-test splitting of the dataset.

parameter_space

-

Series

Returns the model parameter space set up for model selection when training.

auto_tuning_config

-

Series

Returns the configuration set up for auto-tuning.

name

-

str

The name of the pipeline as it appears in the pipeline catalog.

type

-

str

The type of pipeline.

creation_time

-

neo4j.time.Datetime

Time when the pipeline was created.

drop

failIfMissing: Optional[bool]

Series

Removes the pipeline from the GDS Pipeline Catalog.

exists

-

bool

True if the model exists in the GDS Pipeline Catalog, False otherwise.

3. Ranges can also be given as length two Tuple`s. I.e. `(x, y) is the same as {range: [x, y]}.

There are two main differences when comparing the methods above that map to procedures of the Cypher API:

  • As the Python methods are called on the pipeline object, one does not need to provide a name when calling them.

  • Configuration parameters in the Cypher calls are represented by named keyword arguments in the Python method calls.

Another difference is that the train Python call takes a graph object instead of a graph name, and returns a NRModel model object that we can run predictions with as well as a pandas Series with the metadata from the training.

Please consult the node regression Cypher documentation for information about what kind of input the methods expect.

3.1.1. Example

Below is a small example of how one could configure and train a very basic node regression pipeline. Note that we don’t configure splits explicitly, but rather use the default.

To exemplify this, we introduce a small person graph:

gds.run_cypher(
  """
  CREATE
    (a:Person {name: "Bob", age: 22}),
    (b:Person {name: "Alice", age: 5}),
    (c:Person {name: "Eve", age: 53}),
    (d:Person {name: "Chad", age: 44}),
    (e:Person {name: "Dan", age: 60}),
    (f:UnknownPerson {name: "Judy"}),

    (a)-[:KNOWS]->(b),
    (a)-[:KNOWS]->(c),
    (a)-[:KNOWS]->(d),
    (b)-[:KNOWS]->(d),
    (c)-[:KNOWS]->(d),
    (c)-[:KNOWS]->(e),
    (d)-[:KNOWS]->(e),
    (d)-[:KNOWS]->(f),
    (e)-[:KNOWS]->(f)
  """
)
G, project_result = gds.graph.project("person_graph", {"Person": {"properties": ["age"]}}, "KNOWS")

assert G.relationship_types() == ["KNOWS"]
pipe, _ = gds.alpha.pipeline.nodeRegression.create("nr-pipe")

# Add Degree centrality as a property step producing "rank" node properties
pipe.addNodeProperty("degree", mutateProperty="rank")

# Select our "rank" property as a feature for the model training
pipe.selectFeatures("rank")

# Verify that the features to be used in model training are what we expect
feature_properties = pipe.feature_properties()
assert len(feature_properties) == 1
assert feature_properties[0]["feature"] == "rank"

# Configure the model training to do cross-validation over linear regression
pipe.addLinearRegression(tolerance=(0.01, 0.1))
pipe.addLinearRegression(penalty=1.0)

# Train the pipeline targeting node property "age" as label and "MEAN_SQUARED_ERROR" as only metric
age_predictor, train_result = pipe.train(
    G,
    modelName="age-predictor",
    targetProperty="age",
    metrics=["MEAN_SQUARED_ERROR"],
    randomSeed=42
)
assert train_result["trainMillis"] >= 0

A model referred to as "my-model" in the GDS Model Catalog is produced. In the next section we will go over how to use that model to make predictions.

3.2. Model

As we saw in the previous section, node regression models are created when training a node regression pipeline. In addition to inheriting the methods common to all model objects, node regression models have the following methods:

Table 6. Node regression model methods
Name Arguments Return type Description

predict_mutate

G: Graph,
config: **kwargs

Series

Predict property values for nodes of the input graph and mutate graph with predictions.

predict_stream

G: Graph,
config: **kwargs

DataFrame

Predict property values for nodes of the input graph and stream the results.

metrics

-

Series

Returns values for the metrics specified when training, for this particular model.

One can note that the predict methods are indeed very similar to their Cypher counterparts. The three main differences are that:

  • They take a graph object instead of a graph name.

  • They have Python keyword arguments representing the keys of the configuration map.

  • One does not have to provide a "modelName" since the model object used itself have this information.

3.2.1. Example (continued)

We now continue the example above using the node regression model age_predictor we trained there. Suppose that we have a new graph H that we want to run predictions on.

# Make sure we indeed obtained an MEAN_SQUARED_ERROR score
metrics = age_predictor.metrics()
assert "MEAN_SQUARED_ERROR" in metrics

H, project_result = gds.graph.project("full_person_graph", ["Person", "UnknownPerson"], "KNOWS")

# Predict on `H` and stream the results with a specific concurrency of 2
predictions = age_predictor.predict_stream(H, concurrency=2)
assert len(predictions) == H.node_count()

4. The pipeline catalog

The primary way to use pipeline objects is for training models. Additionally, pipeline objects can be used as input to GDS Pipeline Catalog operations. For instance, supposing we have a pipeline object pipe, we could:

exists_result = gds.beta.pipeline.exists(pipe.name())

if exists_result["exists"]:
	gds.beta.pipeline.drop(pipe)  # same as pipe.drop()

A pipeline object that has already been created and is present in the pipeline catalog can be retrieved calling the get method with its name. For example, we can retrieve a pipeline object representing our node classification pipeline named "my-pipe" from the example above:

pipe = gds.pipeline.get("my-pipe")
assert pipe.name() == "my-pipe"

The get method does not use any tier prefix because it is not associated to any tier. It only exists in the client and does not have a corresponding Cypher procedure.