Machine learning pipelines
The Python client has special support for Link prediction pipelines and pipelines for node property prediction. The GDS pipelines are represented as pipeline objects.
The Python method calls to create the pipelines match their Cypher counterparts exactly.
However, the rest of the pipeline functionality is deferred to methods on the pipeline objects themselves.
Once created, the TrainingPipeline
can be passed as arguments to methods in the Python client, such as the pipeline catalog operations.
Additionally, the TrainingPipeline
has convenience methods allowing for inspection of the pipeline represented without explicitly involving the pipeline catalog.
In the examples below we assume that we have an instantiated GraphDataScience
object called gds
.
Read more about this in Getting started.
1. Node classification
This section outlines how to use the Python client to build, configure and train a node classification pipeline, as well as how to use the model that training produces for predictions.
1.1. Pipeline
The creation of the node classification pipeline is very similar to how it’s done in Cypher. To create a new node classification pipeline one would make the following call:
pipe, res = gds.beta.pipeline.nodeClassification.create("my-pipe")
where pipe
is a pipeline object, and res
is a pandas Series
containing metadata from the underlying procedure call.
To then go on to build, configure and train the pipeline we would call methods directly on the node classification pipeline object. Below is a description of the methods on such objects:
Name | Arguments | Return type | Description |
---|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Add a logistic regression model configuration to train as a candidate in the model selection phase. [1] |
|
|
|
Add a random forest model configuration to train as a candidate in the model selection phase. [1] |
|
|
|
Add an MLP model configuration to train as a candidate in the model selection phase. [1] |
|
|
|
|
|
|
|
Train the pipeline on the given input graph using given keyword arguments. |
|
|
|
Estimate training the pipeline on the given input graph using given keyword arguments. |
|
|
|
Returns the node property steps of the pipeline. |
|
|
|
Returns a list of the selected feature properties for the pipeline. |
|
|
|
Returns the configuration set up for train-test splitting of the dataset. |
|
|
|
Returns the model parameter space set up for model selection when training. |
|
|
|
Returns the configuration set up for auto-tuning. |
|
|
|
The name of the pipeline as it appears in the pipeline catalog. |
|
|
|
The type of pipeline. |
|
|
|
Time when the pipeline was created. |
|
|
|
Removes the pipeline from the GDS Pipeline Catalog. |
|
|
|
|
1. Ranges can also be given as length two |
There are two main differences when comparing the methods above that map to procedures of the Cypher API:
-
As the Python methods are called on the pipeline object, one does not need to provide a name when calling them.
-
Configuration parameters in the Cypher calls are represented by named keyword arguments in the Python method calls.
Another difference is that the train
Python call takes a graph object instead of a graph name, and returns a NCModel
model object that we can run predictions with as well as a pandas Series
with the metadata from the training.
Please consult the node classification Cypher documentation for information about what kind of input the methods expect.
1.1.1. Example
Below is a small example of how one could configure and train a very basic node classification pipeline. Note that we don’t configure splits explicitly, but rather use the default.
To exemplify this, we introduce a small person graph:
gds.run_cypher(
"""
CREATE
(a:Person {name: "Bob", fraudster: 0}),
(b:Person {name: "Alice", fraudster: 0}),
(c:Person {name: "Eve", fraudster: 1}),
(d:Person {name: "Chad", fraudster: 1}),
(e:Person {name: "Dan", fraudster: 0}),
(f:UnknownPerson {name: "Judy"}),
(a)-[:KNOWS]->(b),
(a)-[:KNOWS]->(c),
(a)-[:KNOWS]->(d),
(b)-[:KNOWS]->(d),
(c)-[:KNOWS]->(d),
(c)-[:KNOWS]->(e),
(d)-[:KNOWS]->(e),
(d)-[:KNOWS]->(f),
(e)-[:KNOWS]->(f)
"""
)
G, project_result = gds.graph.project("person_graph", {"Person": {"properties": ["fraudster"]}}, "KNOWS")
assert G.node_labels() == ["Person"]
pipe, _ = gds.beta.pipeline.nodeClassification.create("my-pipe")
# Add Degree centrality as a property step producing "rank" node properties
pipe.addNodeProperty("degree", mutateProperty="rank")
# Select our "rank" property as a feature for the model training
pipe.selectFeatures("rank")
# Verify that the features to be used in model training are what we expect
feature_properties = pipe.feature_properties()
assert len(feature_properties) == 1
assert feature_properties[0]["feature"] == "rank"
# Configure the model training to do cross-validation over logistic regression
pipe.addLogisticRegression(tolerance=(0.01, 0.1))
pipe.addLogisticRegression(penalty=1.0)
# Train the pipeline targeting node property "class" as label and "ACCURACY" as only metric
fraud_model, train_result = pipe.train(
G,
modelName="fraud-model",
targetProperty="fraudster",
metrics=["ACCURACY"],
randomSeed=111
)
assert train_result["trainMillis"] >= 0
A model referred to as "fraud-model" in the GDS Model Catalog is produced. In the next section we will go over how to use that model to make predictions.
1.2. Model
As we saw in the previous section, node classification models are created when training a node classification pipeline. In addition to inheriting the methods common to all model objects, node classification models have the following methods:
Name | Arguments | Return type | Description |
---|---|---|---|
|
|
|
Predict classes for nodes of the input graph and mutate graph with predictions. |
|
|
|
Estimate predicting classes for nodes of the input graph and mutating graph with predictions. |
|
|
|
Predict classes for nodes of the input graph and stream the results. |
|
|
|
Estimate predicting classes for nodes of the input graph and streaming the results. |
|
|
|
Predict classes for nodes of the input graph and write results back to the database. |
|
|
|
|
|
|
|
Returns values for the metrics specified when training, for this particular model. |
One can note that the predict methods are indeed very similar to their Cypher counterparts. The three main differences are that:
-
They take a graph object instead of a graph name.
-
They have Python keyword arguments representing the keys of the configuration map.
-
One does not have to provide a "modelName" since the model object used itself have this information.
1.2.1. Example (continued)
We now continue the example above using the node classification model trained_pipe_model
we trained there.
# Make sure we indeed obtained an accuracy score
metrics = fraud_model.metrics()
assert "ACCURACY" in metrics
H, project_result = gds.graph.project("full_person_graph", ["Person", "UnknownPerson"], "KNOWS")
# Predict on `H` and stream the results with a specific concurrency of 2
predictions = fraud_model.predict_stream(H, concurrency=2)
assert len(predictions) == H.node_count()
2. Link prediction
This section outlines how to use the Python client to build, configure and train a link prediction pipeline, as well as how to use the model that training produces for predictions.
2.1. Pipeline
The creation of the link prediction pipeline is very similar to how it’s done in Cypher. To create a new link prediction pipeline one would make the following call:
pipe, res = gds.beta.pipeline.linkPrediction.create("my-pipe")
where pipe
is a pipeline object, and res
is a pandas Series
containing metadata from the underlying procedure call.
To then go on to build, configure and train the pipeline we would call methods directly on the link prediction pipeline object. Below is a description of the methods on such objects:
Name | Arguments | Return type | Description |
---|---|---|---|
|
|
|
|
|
|
|
Add a link feature for model training based on node properties and a feature combiner. |
|
|
|
|
|
|
|
Add a logistic regression model configuration to train as a candidate in the model selection phase. [2] |
|
|
|
Add a random forest model configuration to train as a candidate in the model selection phase. [2] |
|
|
|
Add an MLP model configuration to train as a candidate in the model selection phase. [2] |
|
|
|
|
|
|
|
Train the model on the given input graph using given keyword arguments. |
|
|
|
Estimate training the pipeline on the given input graph using given keyword arguments. |
|
|
|
Returns the node property steps of the pipeline. |
|
|
|
Returns a list of the selected feature steps for the pipeline. |
|
|
|
Returns the configuration set up for feature-train-test splitting of the dataset. |
|
|
|
Returns the model parameter space set up for model selection when training. |
|
|
|
Returns the configuration set up for auto-tuning. |
|
|
|
The name of the pipeline as it appears in the pipeline catalog. |
|
|
|
The type of pipeline. |
|
|
|
Time when the pipeline was created. |
|
|
|
Removes the pipeline from the GDS Pipeline Catalog. |
|
|
|
|
2. Ranges can also be given as length two |
There are two main differences when comparing the methods above that map to procedures of the Cypher API:
-
As the Python methods are called on the pipeline object, one does not need to provide a name when calling them.
-
Configuration parameters in the Cypher calls are represented by named keyword arguments in the Python method calls.
Another difference is that the train
Python call takes a graph object instead of a graph name, and returns a LPModel
model object that we can run predictions with as well as a pandas Series
with the metadata from the training.
Please consult the link prediction Cypher documentation for information about what kind of input the methods expect.
2.1.1. Example
Below is a small example of how one could configure and train a very basic link prediction pipeline. Note that we don’t configure training parameters explicitly, but rather use the default.
To exemplify this, we introduce a small person graph:
gds.run_cypher(
"""
CREATE
(a:Person {name: "Bob"}),
(b:Person {name: "Alice"}),
(c:Person {name: "Eve"}),
(d:Person {name: "Chad"}),
(e:Person {name: "Dan"}),
(f:Person {name: "Judy"}),
(a)-[:KNOWS]->(b),
(a)-[:KNOWS]->(c),
(a)-[:KNOWS]->(d),
(b)-[:KNOWS]->(d),
(c)-[:KNOWS]->(d),
(c)-[:KNOWS]->(e),
(d)-[:KNOWS]->(e),
(d)-[:KNOWS]->(f),
(e)-[:KNOWS]->(f)
"""
)
G, project_result = gds.graph.project("person_graph", "Person", {"KNOWS": {"orientation":"UNDIRECTED"}})
assert G.relationship_types() == ["KNOWS"]
pipe, _ = gds.beta.pipeline.linkPrediction.create("lp-pipe")
# Add FastRP as a property step producing "embedding" node properties
pipe.addNodeProperty("fastRP", embeddingDimension=128, mutateProperty="embedding", randomSeed=1337)
# Combine our "embedding" node properties with Hadamard to create link features for training
pipe.addFeature("hadamard", nodeProperties=["embedding"])
# Verify that the features to be used in model training are what we expect
steps = pipe.feature_steps()
assert len(steps) == 1
assert steps["name"][0] == "HADAMARD"
# Specify the fractions we want for our dataset split
pipe.configureSplit(trainFraction=0.2, testFraction=0.2, validationFolds=2)
# Add a random forest model with tuning over `maxDepth`
pipe.addRandomForest(maxDepth=(2, 20))
# Train the pipeline and produce a model named "friend-recommender"
friend_recommender, train_result = pipe.train(
G,
modelName="friend-recommender",
targetRelationshipType="KNOWS",
randomSeed=42
)
assert train_result["trainMillis"] >= 0
A model referred to as "my-model" in the GDS Model Catalog is produced. In the next section we will go over how to use that model to make predictions.
2.2. Model
As we saw in the previous section, link prediction models are created when training a link prediction pipeline. In addition to inheriting the methods common to all model objects, link prediction models have the following methods:
Name | Arguments | Return type | Description |
---|---|---|---|
|
|
|
Predict links between non-neighboring nodes of the input graph and mutate graph with predictions. |
|
|
|
|
|
|
|
Predict links between non-neighboring nodes of the input graph and stream the results. |
|
|
|
Estimate predicting links between non-neighboring nodes of the input graph and streaming the results. |
|
|
|
Returns values for the metrics used when training, for this particular model. |
One can note that the predict methods are indeed very similar to their Cypher counterparts. The three main differences are that:
-
They take a graph object instead of a graph name.
-
They have Python keyword arguments representing the keys of the configuration map.
-
One does not have to provide a "modelName" since the model object used itself have this information.
2.2.1. Example (continued)
We now continue the example above using the link prediction model trained_pipe_model
we trained there.
# Make sure we indeed obtained an AUCPR score
metrics = friend_recommender.metrics()
assert AUCPR in metrics
# Predict on `G` and mutate it with the relationship predictions
mutate_result = friend_recommender.predict_mutate(G, topN=5, mutateRelationshipType="PRED_REL")
assert mutate_result["relationshipsWritten"] == 5 * 2 # Undirected relationships
3. Node regression
This section outlines how to use the Python client to build, configure and train a node regression pipeline, as well as how to use the model that training produces for predictions.
3.1. Pipeline
The creation of the node regression pipeline is very similar to how it’s done in Cypher. To create a new node regression pipeline one would make the following call:
pipe, res = gds.alpha.pipeline.nodeRegression.create("my-pipe")
where pipe
is a pipeline object, and res
is a pandas Series
containing metadata from the underlying procedure call.
To then go on to build, configure and train the pipeline we would call methods directly on the node regression pipeline object. Below is a description of the methods on such objects:
Name | Arguments | Return type | Description |
---|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Add a linear regression model configuration to train as a candidate in the model selection phase. [3] |
|
|
|
Add a random forest model configuration to train as a candidate in the model selection phase. [3] |
|
|
|
|
|
|
|
Train the pipeline on the given input graph using given keyword arguments. |
|
|
|
Returns the node property steps of the pipeline. |
|
|
|
Returns a list of the selected feature properties for the pipeline. |
|
|
|
Returns the configuration set up for train-test splitting of the dataset. |
|
|
|
Returns the model parameter space set up for model selection when training. |
|
|
|
Returns the configuration set up for auto-tuning. |
|
|
|
The name of the pipeline as it appears in the pipeline catalog. |
|
|
|
The type of pipeline. |
|
|
|
Time when the pipeline was created. |
|
|
|
Removes the pipeline from the GDS Pipeline Catalog. |
|
|
|
|
3. Ranges can also be given as length two |
There are two main differences when comparing the methods above that map to procedures of the Cypher API:
-
As the Python methods are called on the pipeline object, one does not need to provide a name when calling them.
-
Configuration parameters in the Cypher calls are represented by named keyword arguments in the Python method calls.
Another difference is that the train
Python call takes a graph object instead of a graph name, and returns a NRModel
model object that we can run predictions with as well as a pandas Series
with the metadata from the training.
Please consult the node regression Cypher documentation for information about what kind of input the methods expect.
3.1.1. Example
Below is a small example of how one could configure and train a very basic node regression pipeline. Note that we don’t configure splits explicitly, but rather use the default.
To exemplify this, we introduce a small person graph:
gds.run_cypher(
"""
CREATE
(a:Person {name: "Bob", age: 22}),
(b:Person {name: "Alice", age: 5}),
(c:Person {name: "Eve", age: 53}),
(d:Person {name: "Chad", age: 44}),
(e:Person {name: "Dan", age: 60}),
(f:UnknownPerson {name: "Judy"}),
(a)-[:KNOWS]->(b),
(a)-[:KNOWS]->(c),
(a)-[:KNOWS]->(d),
(b)-[:KNOWS]->(d),
(c)-[:KNOWS]->(d),
(c)-[:KNOWS]->(e),
(d)-[:KNOWS]->(e),
(d)-[:KNOWS]->(f),
(e)-[:KNOWS]->(f)
"""
)
G, project_result = gds.graph.project("person_graph", {"Person": {"properties": ["age"]}}, "KNOWS")
assert G.relationship_types() == ["KNOWS"]
pipe, _ = gds.alpha.pipeline.nodeRegression.create("nr-pipe")
# Add Degree centrality as a property step producing "rank" node properties
pipe.addNodeProperty("degree", mutateProperty="rank")
# Select our "rank" property as a feature for the model training
pipe.selectFeatures("rank")
# Verify that the features to be used in model training are what we expect
feature_properties = pipe.feature_properties()
assert len(feature_properties) == 1
assert feature_properties[0]["feature"] == "rank"
# Configure the model training to do cross-validation over linear regression
pipe.addLinearRegression(tolerance=(0.01, 0.1))
pipe.addLinearRegression(penalty=1.0)
# Train the pipeline targeting node property "age" as label and "MEAN_SQUARED_ERROR" as only metric
age_predictor, train_result = pipe.train(
G,
modelName="age-predictor",
targetProperty="age",
metrics=["MEAN_SQUARED_ERROR"],
randomSeed=42
)
assert train_result["trainMillis"] >= 0
A model referred to as "my-model" in the GDS Model Catalog is produced. In the next section we will go over how to use that model to make predictions.
3.2. Model
As we saw in the previous section, node regression models are created when training a node regression pipeline. In addition to inheriting the methods common to all model objects, node regression models have the following methods:
Name | Arguments | Return type | Description |
---|---|---|---|
|
|
|
Predict property values for nodes of the input graph and mutate graph with predictions. |
|
|
|
Predict property values for nodes of the input graph and stream the results. |
|
|
|
Returns values for the metrics specified when training, for this particular model. |
One can note that the predict methods are indeed very similar to their Cypher counterparts. The three main differences are that:
-
They take a graph object instead of a graph name.
-
They have Python keyword arguments representing the keys of the configuration map.
-
One does not have to provide a "modelName" since the model object used itself have this information.
3.2.1. Example (continued)
We now continue the example above using the node regression model age_predictor
we trained there.
Suppose that we have a new graph H
that we want to run predictions on.
# Make sure we indeed obtained an MEAN_SQUARED_ERROR score
metrics = age_predictor.metrics()
assert "MEAN_SQUARED_ERROR" in metrics
H, project_result = gds.graph.project("full_person_graph", ["Person", "UnknownPerson"], "KNOWS")
# Predict on `H` and stream the results with a specific concurrency of 2
predictions = age_predictor.predict_stream(H, concurrency=2)
assert len(predictions) == H.node_count()
4. The pipeline catalog
The primary way to use pipeline objects is for training models.
Additionally, pipeline objects can be used as input to GDS Pipeline Catalog operations.
For instance, supposing we have a pipeline object pipe
, we could:
exists_result = gds.beta.pipeline.exists(pipe.name())
if exists_result["exists"]:
gds.beta.pipeline.drop(pipe) # same as pipe.drop()
A pipeline object that has already been created and is present in the pipeline catalog can be retrieved calling the get
method with its name.
For example, we can retrieve a pipeline object representing our node classification pipeline named "my-pipe" from the example above:
pipe = gds.pipeline.get("my-pipe")
assert pipe.name() == "my-pipe"
The |
Was this page helpful?