Configuring the pipeline
This page explains how to create and configure a link prediction pipeline. It consists of the following sections:
1. Creating a pipeline
The first step of building a new pipeline is to create one using gds.beta.pipeline.linkPrediction.create
.
This stores a trainable pipeline object in the pipeline catalog of type Link prediction training pipeline
.
This represents a configurable pipeline that can later be invoked for training, which in turn creates a trained pipeline.
The latter is also a model which is stored in the catalog with type LinkPrediction
.
1.1. Syntax
CALL gds.beta.pipeline.linkPrediction.create(
pipelineName: String
)
YIELD
name: String,
nodePropertySteps: List of Map,
featureSteps: List of Map,
splitConfig: Map,
autoTuningConfig: Map,
parameterSpace: List of Map
Name | Type | Description |
---|---|---|
pipelineName |
String |
The name of the created pipeline. |
Name | Type | Description |
---|---|---|
name |
String |
Name of the pipeline. |
nodePropertySteps |
List of Map |
List of configurations for node property steps. |
featureSteps |
List of Map |
List of configurations for feature steps. |
splitConfig |
Map |
Configuration to define the split before the model training. |
autoTuningConfig |
Map |
Configuration to define the behavior of auto-tuning. |
parameterSpace |
List of Map |
List of parameter configurations for models which the train mode uses for model selection. |
1.2. Example
CALL gds.beta.pipeline.linkPrediction.create('pipe')
name | nodePropertySteps | featureSteps | splitConfig | autoTuningConfig | parameterSpace |
---|---|---|---|---|---|
"pipe" |
[] |
[] |
{negativeSamplingRatio=1.0, testFraction=0.1, validationFolds=3, trainFraction=0.1} |
{maxTrials=10} |
{RandomForest=[], LogisticRegression=[]} |
This shows that the newly created pipeline does not contain any steps yet, and has defaults for the split and train parameters.
2. Adding node properties
A link prediction pipeline can execute one or several GDS algorithms in mutate mode that create node properties in the projected graph. Such steps producing node properties can be chained one after another and created properties can also be used to add features. Moreover, the node property steps that are added to the pipeline will be executed both when training a pipeline and when the trained model is applied for prediction.
The name of the procedure that should be added can be a fully qualified GDS procedure name ending with .mutate
.
The ending .mutate
may be omitted and one may also use shorthand forms such as node2vec
instead of gds.beta.node2vec.mutate
.
For example, pre-processing algorithms can be used as node property steps.
2.1. Syntax
CALL gds.beta.pipeline.linkPrediction.addNodeProperty(
pipelineName: String,
procedureName: String,
procedureConfiguration: Map
)
YIELD
name: String,
nodePropertySteps: List of Map,
featureSteps: List of Map,
splitConfig: Map,
autoTuningConfig: Map,
parameterSpace: List of Map
Name | Type | Description |
---|---|---|
pipelineName |
String |
The name of the pipeline. |
procedureName |
String |
The name of the procedure to be added to the pipeline. |
procedureConfiguration |
Map |
The configuration of the procedure, excluding |
Name | Type | Description |
---|---|---|
name |
String |
Name of the pipeline. |
nodePropertySteps |
List of Map |
List of configurations for node property steps. |
featureSteps |
List of Map |
List of configurations for feature steps. |
splitConfig |
Map |
Configuration to define the split before the model training. |
autoTuningConfig |
Map |
Configuration to define the behavior of auto-tuning. |
parameterSpace |
List of Map |
List of parameter configurations for models which the train mode uses for model selection. |
2.2. Example
CALL gds.beta.pipeline.linkPrediction.addNodeProperty('pipe', 'fastRP', {
mutateProperty: 'embedding',
embeddingDimension: 256,
randomSeed: 42
})
name | nodePropertySteps | featureSteps | splitConfig | autoTuningConfig | parameterSpace |
---|---|---|---|---|---|
"pipe" |
[{name=gds.fastRP.mutate, config={randomSeed=42, embeddingDimension=256, mutateProperty=embedding}}] |
[] |
{negativeSamplingRatio=1.0, testFraction=0.1, validationFolds=3, trainFraction=0.1} |
{maxTrials=10} |
{RandomForest=[], LogisticRegression=[]} |
The pipeline will now execute the fastRP algorithm in mutate mode both before training a model, and when the trained model is applied for prediction.
This ensures the embedding
property can be used as an input for link features.
3. Adding link features
A Link Prediction pipeline executes a sequence of steps to compute the features used by a machine learning model. A feature step computes a vector of features for given node pairs. For each node pair, the results are concatenated into a single link feature vector. The order of the features in the link feature vector follows the order of the feature steps. Like with node property steps, the feature steps are also executed both at training and prediction time. The supported methods for obtaining features are described below.
3.1. Syntax
CALL gds.beta.pipeline.linkPrediction.addFeature(
pipelineName: String,
featureType: String,
configuration: Map
)
YIELD
name: String,
nodePropertySteps: List of Map,
featureSteps: List of Map,
splitConfig: Map,
autoTuningConfig: Map,
parameterSpace: List of Map
Name | Type | Description |
---|---|---|
pipelineName |
String |
The name of the pipeline. |
featureType |
String |
The featureType determines the method used for computing the link feature. See supported types. |
configuration |
Map |
Configuration for splitting the relationships. |
Name | Type | Default | Description |
---|---|---|---|
nodeProperties |
List of String |
no |
The names of the node properties that should be used as input. |
Name | Type | Description |
---|---|---|
name |
String |
Name of the pipeline. |
nodePropertySteps |
List of Map |
List of configurations for node property steps. |
featureSteps |
List of Map |
List of configurations for feature steps. |
splitConfig |
Map |
Configuration to define the split before the model training. |
autoTuningConfig |
Map |
Configuration to define the behavior of auto-tuning. |
parameterSpace |
List of Map |
List of parameter configurations for models which the train mode uses for model selection. |
3.2. Supported feature types
A feature step can use node properties that exist in the input graph or are added by the pipeline.
For each node in each potential link, the values of nodeProperties
are concatenated, in the configured order, into a vector f.
That is, for each potential link the feature vector for the source node, , is combined with the one for the target node,
, into a single feature vector f.
The supported types of features can then be described as follows:
Feature Type | Formula / Description |
---|---|
L2 |
|
HADAMARD |
|
COSINE |
3.3. Example
CALL gds.beta.pipeline.linkPrediction.addFeature('pipe', 'hadamard', {
nodeProperties: ['embedding', 'numberOfPosts']
}) YIELD featureSteps
featureSteps |
---|
[{name=HADAMARD, config={nodeProperties=[embedding, numberOfPosts]}}] |
When executing the pipeline, the nodeProperties
must be either present in the input graph, or created by a previous node property step.
For example, the embedding
property could be created by the previous example, and we expect numberOfPosts
to already be present in the in-memory graph used as input, at train and predict time.
4. Configuring the relationship splits
Link Prediction training pipelines manage splitting the relationships into several sets and add sampled negative relationships to some of these sets. Configuring the splitting is optional, and if omitted, splitting will be done using default settings.
The splitting configuration of a pipeline can be inspected by using gds.beta.model.list
and possibly only yielding splitConfig
.
The splitting of relationships proceeds internally in the following steps:
-
The graph is filtered according to specified
nodeLabels
andrelationshipTypes
, which are configured at train time. -
The relationships remaining after filtering we call positive, and they are split into a
test
set and remaining relationships which we refer to astest-complement
set.-
The
test
set contains atestFraction
fraction of the positive relationships. -
Random negative relationships are added to the
test
set. The number of negative relationships is the number of positive ones multiplied by thenegativeSamplingRatio
. -
The negative relationships do not coincide with positive relationships.
-
-
The relationships in the
test-complement
set are split into atrain
set and afeature-input
set.-
The
train
set contains atrainFraction
fraction of thetest-complement
set. -
The
feature-input
set contains(1-trainFraction)
fraction of thetest-complement
set. -
Random negative relationships are added to the
train
set. The number of negative relationships is the number of positive ones multiplied by thenegativeSamplingRatio
. -
The negative relationships do not coincide with positive relationships, nor with test relationships.
-
The sampled positive and negative relationships are given relationship weights of 1.0
and 0.0
respectively so that they can be distinguished.
The feature-input
graph is used, both in training and testing, for computing node properties and therefore also features which depend on node properties.
The train
and test
relationship sets are used for:
-
determining the label (positive or negative) for each training or test example
-
identifying the node pair for which link features are to be computed
However, they are not used by the algorithms run in the node property steps. The reason for this is that otherwise the model would use the prediction target (existence of a relationship) as a feature.
4.1. Syntax
CALL gds.beta.pipeline.linkPrediction.configureSplit(
pipelineName: String,
configuration: Map
)
YIELD
name: String,
nodePropertySteps: List of Map,
featureSteps: List of Map,
splitConfig: Map,
autoTuningConfig: Map,
parameterSpace: List of Map
Name | Type | Description |
---|---|---|
pipelineName |
String |
The name of the pipeline. |
configuration |
Map |
Configuration for splitting the relationships. |
Name | Type | Default | Description |
---|---|---|---|
validationFolds |
Integer |
3 |
Number of divisions of the training graph used during model selection. |
testFraction |
Double |
0.1 |
Fraction of the graph reserved for testing. Must be in the range (0, 1). |
trainFraction |
Double |
0.1 |
Fraction of the test-complement set reserved for training. Must be in the range (0, 1). |
negativeSamplingRatio |
Double |
1.0 |
The desired ratio of negative to positive samples in the test and train set. More details here. |
Name | Type | Description |
---|---|---|
name |
String |
Name of the pipeline. |
nodePropertySteps |
List of Map |
List of configurations for node property steps. |
featureSteps |
List of Map |
List of configurations for feature steps. |
splitConfig |
Map |
Configuration to define the split before the model training. |
autoTuningConfig |
Map |
Configuration to define the behavior of auto-tuning. |
parameterSpace |
List of Map |
List of parameter configurations for models which the train mode uses for model selection. |
4.2. Example
CALL gds.beta.pipeline.linkPrediction.configureSplit('pipe', {
testFraction: 0.25,
trainFraction: 0.6,
validationFolds: 3
})
YIELD splitConfig
splitConfig |
---|
{negativeSamplingRatio=1.0, testFraction=0.25, validationFolds=3, trainFraction=0.6} |
We now reconfigured the splitting of the pipeline, which will be applied during training.
5. Adding model candidates
A pipeline contains a collection of configurations for model candidates which is initially empty. This collection is called the parameter space. Each model candidate configuration contains either fixed values or ranges for training parameters. When a range is present, values from the range are determined automatically by an auto-tuning algorithm, see Auto-tuning. One or more model configurations must be added to the parameter space of the training pipeline, using one of the following procedures:
-
gds.beta.pipeline.linkPrediction.addLogisticRegression
-
gds.alpha.pipeline.linkPrediction.addRandomForest
For information about the available training methods in GDS, logistic regression and random forest, see Training methods.
In Training the pipeline, we explain further how the configured model candidates are trained, evaluated and compared.
The parameter space of a pipeline can be inspected using gds.beta.model.list
and optionally yielding only parameterSpace
.
At least one model candidate must be added to the pipeline before training it. |
5.1. Syntax
CALL gds.beta.pipeline.linkPrediction.addLogisticRegression(
pipelineName: String,
config: Map
)
YIELD
name: String,
nodePropertySteps: List of Map,
featureSteps: List of Map,
splitConfig: Map,
autoTuningConfig: Map,
parameterSpace: Map
Name | Type | Description |
---|---|---|
pipelineName |
String |
The name of the pipeline. |
config |
Map |
The logistic regression config for a model candidate. The allowed parameters for a model are defined in the next table. |
Name | Type | Default | Optional | Description |
---|---|---|---|---|
penalty [1] |
Float or Map [2] |
|
yes |
Penalty used for the logistic regression. By default, no penalty is applied. |
batchSize |
Integer or Map [2] |
|
yes |
Number of nodes per batch. |
minEpochs |
Integer or Map [2] |
|
yes |
Minimum number of training epochs. |
maxEpochs |
Integer or Map[2] |
|
yes |
Maximum number of training epochs. |
learningRate [1] |
Float or Map [2] |
|
yes |
The learning rate determines the step size at each epoch while moving in the direction dictated by the Adam optimizer for minimizing the loss. |
patience |
Integer or Map [2] |
|
yes |
Maximum number of unproductive consecutive epochs. |
tolerance [1] |
Float or Map [2] |
|
yes |
The minimal improvement of the loss to be considered productive. |
1. Ranges for this parameter are auto-tuned on a logarithmic scale.
2. A map should be of the form |
Name | Type | Description |
---|---|---|
name |
String |
Name of the pipeline. |
nodePropertySteps |
List of Map |
List of configurations for node property steps. |
featureSteps |
List of Map |
List of configurations for feature steps. |
splitConfig |
Map |
Configuration to define the split before the model training. |
autoTuningConfig |
Map |
Configuration to define the behavior of auto-tuning. |
parameterSpace |
List of Map |
List of parameter configurations for models which the train mode uses for model selection. |
CALL gds.alpha.pipeline.linkPrediction.addRandomForest(
pipelineName: String,
config: Map
)
YIELD
name: String,
nodePropertySteps: List of Map,
featureSteps: List of Map,
splitConfig: Map,
autoTuningConfig: Map,
parameterSpace: Map
Name | Type | Description |
---|---|---|
pipelineName |
String |
The name of the pipeline. |
config |
Map |
The random forest config for a model candidate. The allowed parameters for a model are defined in the next table. |
Name | Type | Default | Optional | Description |
---|---|---|---|---|
maxFeaturesRatio |
Float or Map [3] |
|
yes |
The ratio of features to consider when looking for the best split |
numberOfSamplesRatio |
Float or Map [3] |
|
yes |
The ratio of samples to consider per decision tree. We use sampling with replacement. A value of |
numberOfDecisionTrees |
Integer or Map [3] |
|
yes |
The number of decision trees. |
maxDepth |
Integer or Map [3] |
|
yes |
The maximum depth of a decision tree. |
minLeafSize |
Integer or Map [3] |
|
yes |
The minimum number of samples for a leaf node in a decision tree. Must be strictly smaller than |
minSplitSize |
Integer or Map [3] |
|
yes |
The minimum number of samples required to split an internal node in a decision tree. Must be strictly larger than |
criterion |
String |
|
yes |
The impurity criterion used to evaluate potential node splits during decision tree training. Valid options are |
3. A map should be of the form |
Name | Type | Description |
---|---|---|
name |
String |
Name of the pipeline. |
nodePropertySteps |
List of Map |
List of configurations for node property steps. |
featureSteps |
List of Map |
List of configurations for feature steps. |
splitConfig |
Map |
Configuration to define the split before the model training. |
autoTuningConfig |
Map |
Configuration to define the behavior of auto-tuning. |
parameterSpace |
List of Map |
List of parameter configurations for models which the train mode uses for model selection. |
5.2. Example
We can add multiple model candidates to our pipeline.
CALL gds.beta.pipeline.linkPrediction.addLogisticRegression('pipe')
YIELD parameterSpace
CALL gds.alpha.pipeline.linkPrediction.addRandomForest('pipe', {numberOfDecisionTrees: 10})
YIELD parameterSpace
CALL gds.beta.pipeline.linkPrediction.addLogisticRegression('pipe', {maxEpochs: 500, penalty: {range: [1e-4, 1e2]}})
YIELD parameterSpace
RETURN parameterSpace.RandomForest AS randomForestSpace, parameterSpace.LogisticRegression AS logisticRegressionSpace
randomForestSpace | logisticRegressionSpace |
---|---|
[{maxDepth=2147483647, minLeafSize=1, criterion=GINI, minSplitSize=2, numberOfDecisionTrees=10, methodName=RandomForest, numberOfSamplesRatio=1.0}] |
[{maxEpochs=100, minEpochs=1, penalty=0.0, patience=1, methodName=LogisticRegression, batchSize=100, tolerance=0.001, learningRate=0.001}, {maxEpochs=500, minEpochs=1, penalty={range=[1.0E-4, 100.0]}, patience=1, methodName=LogisticRegression, batchSize=100, tolerance=0.001, learningRate=0.001}] |
The parameterSpace
in the pipeline now contains the three different model candidates, expanded with the default values.
Each specified model candidate will be tried out during the model selection in training.
These are somewhat naive examples of how to add and configure model candidates. Please see Training methods for more information on how to tune the configuration parameters of each method. |
6. Configuring Auto-tuning
In order to find good models, the pipeline supports automatically tuning the parameters of the training algorithm. Optionally, the procedure described below can be used to configure the auto-tuning behavior. Otherwise, default auto-tuning configuration is used. Currently, it is only possible to configure the maximum number trials of hyper-parameter settings which are evaluated.
6.1. Syntax
CALL gds.alpha.pipeline.linkPrediction.configureAutoTuning(
pipelineName: String,
configuration: Map
)
YIELD
name: String,
nodePropertySteps: List of Map,
featureSteps: List of Map,
splitConfig: Map,
autoTuningConfig: Map,
parameterSpace: List of Map
Name | Type | Description |
---|---|---|
pipelineName |
String |
The name of the created pipeline. |
configuration |
Map |
The configuration for auto-tuning. |
Name | Type | Default | Description |
---|---|---|---|
maxTrials |
Integer |
10 |
The value of |
Name | Type | Description |
---|---|---|
name |
String |
Name of the pipeline. |
nodePropertySteps |
List of Map |
List of configurations for node property steps. |
featureSteps |
List of Map |
List of configurations for feature steps. |
splitConfig |
Map |
Configuration to define the split before the model training. |
autoTuningConfig |
Map |
Configuration to define the behavior of auto-tuning. |
parameterSpace |
List of Map |
List of parameter configurations for models which the train mode uses for model selection. |
6.2. Example
CALL gds.alpha.pipeline.linkPrediction.configureAutoTuning('pipe', {
maxTrials: 5
})
YIELD autoTuningConfig
autoTuningConfig |
---|
{maxTrials=5} |
We now reconfigured the auto-tuning to try out at most 100 model candidates during training.
Was this page helpful?