Link prediction pipelines

This section describes Link prediction pipelines in the Neo4j Graph Data Science library.

1. Introduction

Link prediction is a common machine learning task applied to graphs: training a model to learn, between pairs of nodes in a graph, where relationships should exist. More precisely, the input to the machine learning model are examples of node pairs. During training, the node pairs are labeled as adjacent or not adjacent.

In GDS, we have Link prediction pipelines which offer an end-to-end workflow, from feature extraction to link prediction. The training pipelines reside in the pipeline catalog. When a training pipeline is executed, a prediction model is created and stored in the model catalog.

A training pipeline is a sequence of three phases:

  1. From the graph three sets of node pairs are derived: feature set, training set, test set. The latter two are labeled.

  2. The nodes in the graph are augmented with new properties by running a series of steps on the graph with only relationships from the feature set.

  3. The train and test sets are used for training a link prediction pipeline. Link features are derived by combining node properties of node pairs.

For the training and test sets, positive examples are selected from the relationships in the graph. The negative examples are sampled from non-adjacent nodes.

One can configure which steps should be included above. The steps execute GDS algorithms that create new node properties. After configuring the node property steps, one can define how to combine node properties of node pairs into link features. The training phase (III) trains multiple model candidates using cross-validation, selects the best one, and reports relevant performance metrics.

After training the pipeline, a prediction model is created. This model includes the node property steps and link feature steps from the training pipeline and uses them to generate the relevant features for predicting new relationships. The prediction model can be applied to infer the probability of the existence of a relationship between two non-adjacent nodes.

Prediction can only be done with a prediction model (not with a training pipeline).

The rest of this page is divided as follows:

2. Creating a pipeline

The first step of building a new pipeline is to create one using gds.beta.pipeline.linkPrediction.create. This stores a trainable pipeline object in the pipeline catalog of type Link prediction training pipeline. This represents a configurable pipeline that can later be invoked for training, which in turn creates a trained pipeline. The latter is also a model which is stored in the catalog with type LinkPrediction.

2.1. Syntax

Create pipeline syntax
CALL gds.beta.pipeline.linkPrediction.create(
  pipelineName: String
)
YIELD
  name: String,
  nodePropertySteps: List of Map,
  featureSteps: List of Map,
  splitConfig: Map,
  parameterSpace: List of Map
Table 1. Parameters
Name Type Description

pipelineName

String

The name of the created pipeline.

Table 2. Results
Name Type Description

name

String

Name of the pipeline.

nodePropertySteps

List of Map

List of configurations for node property steps.

featureSteps

List of Map

List of configurations for feature steps.

splitConfig

Map

Configuration to define the split before the model training.

parameterSpace

List of Map

List of parameter configurations for models which the train mode uses for model selection.

2.2. Example

The following will create a pipeline:
CALL gds.beta.pipeline.linkPrediction.create('pipe')
Table 3. Results
name nodePropertySteps featureSteps splitConfig parameterSpace

"pipe"

[]

[]

{negativeSamplingRatio=1.0, testFraction=0.1, validationFolds=3, trainFraction=0.1}

{RandomForest=[], LogisticRegression=[]}

This shows that the newly created pipeline does not contain any steps yet, and has defaults for the split and train parameters.

3. Adding node properties

A link prediction pipeline can execute one or several GDS algorithms in mutate mode that create node properties in the projected graph. Such steps producing node properties can be chained one after another and created properties can also be used to add features. Moreover, the node property steps that are added to the pipeline will be executed both when training a pipeline and when the trained model is applied for prediction.

The name of the procedure that should be added can be a fully qualified GDS procedure name ending with .mutate. The ending .mutate may be omitted and one may also use shorthand forms such as node2vec instead of gds.beta.node2vec.mutate.

For example, pre-processing algorithms can be used as node property steps.

3.1. Syntax

Add node property syntax
CALL gds.beta.pipeline.linkPrediction.addNodeProperty(
  pipelineName: String,
  procedureName: String,
  procedureConfiguration: Map
)
YIELD
  name: String,
  nodePropertySteps: List of Map,
  featureSteps: List of Map,
  splitConfig: Map,
  parameterSpace: List of Map
Table 4. Parameters
Name Type Description

pipelineName

String

The name of the pipeline.

procedureName

String

The name of the procedure to be added to the pipeline.

procedureConfiguration

Map

The configuration of the procedure, excluding graphName, nodeLabels and relationshipTypes.

Table 5. Results
Name Type Description

name

String

Name of the pipeline.

nodePropertySteps

List of Map

List of configurations for node property steps.

featureSteps

List of Map

List of configurations for feature steps.

splitConfig

Map

Configuration to define the split before the model training.

parameterSpace

List of Map

List of parameter configurations for models which the train mode uses for model selection.

3.2. Example

The following will add a node property step to the pipeline:
CALL gds.beta.pipeline.linkPrediction.addNodeProperty('pipe', 'fastRP', {
  mutateProperty: 'embedding',
  embeddingDimension: 256,
  randomSeed: 42
})
Table 6. Results
name nodePropertySteps featureSteps splitConfig parameterSpace

"pipe"

[{name=gds.fastRP.mutate, config={randomSeed=42, embeddingDimension=256, mutateProperty=embedding}}]

[]

{negativeSamplingRatio=1.0, testFraction=0.1, validationFolds=3, trainFraction=0.1}

{RandomForest=[], LogisticRegression=[]}

The pipeline will now execute the fastRP algorithm in mutate mode both before training a model, and when the trained model is applied for prediction. This ensures the embedding property can be used as an input for link features.

4. Adding link features

A Link Prediction pipeline executes a sequence of steps to compute the features used by a machine learning model. A feature step computes a vector of features for given node pairs. For each node pair, the results are concatenated into a single link feature vector. The order of the features in the link feature vector follows the order of the feature steps. Like with node property steps, the feature steps are also executed both at training and prediction time. The supported methods for obtaining features are described below.

4.1. Syntax

Adding a link feature to a pipeline syntax
CALL gds.beta.pipeline.linkPrediction.addFeature(
  pipelineName: String,
  featureType: String,
  configuration: Map
)
YIELD
  name: String,
  nodePropertySteps: List of Map,
  featureSteps: List of Map,
  splitConfig: Map,
  parameterSpace: List of Map
Table 7. Parameters
Name Type Description

pipelineName

String

The name of the pipeline.

featureType

String

The featureType determines the method used for computing the link feature. See supported types.

configuration

Map

Configuration for splitting the relationships.

Table 8. Configuration
Name Type Default Description

nodeProperties

List of String

no

The names of the node properties that should be used as input.

Table 9. Results
Name Type Description

name

String

Name of the pipeline.

nodePropertySteps

List of Map

List of configurations for node property steps.

featureSteps

List of Map

List of configurations for feature steps.

splitConfig

Map

Configuration to define the split before the model training.

parameterSpace

List of Map

List of parameter configurations for models which the train mode uses for model selection.

4.2. Supported feature types

A feature step can use node properties that exist in the input graph or are added by the pipeline. For each node in each potential link, the values of nodeProperties are concatenated, in the configured order, into a vector f. That is, for each potential link the feature vector for the source node, s equals s1 comma s2 comma dot dot dot s d, is combined with the one for the target node, s equals t1 comma t2 comma dot dot dot t d, into a single feature vector f.

The supported types of features can then be described as follows:

Table 10. Supported feature types
Feature Type Formula / Description

L2

f equals vector of s1 minus t1 squared comma s2 minus t2 squared comma dot dot dot comma s d minus t d squared

HADAMARD

f equals vector of s1 dot t1 comma s2 dot t2 comma dot dot dot comma s d dot t d

COSINE

f equals sum of i from 1 to d of s i t i divided by square root of sum of i from 1 to d of s i squared times square root of sum of i from 1 to d of t i squared

4.3. Example

The following will add a feature step to the pipeline:
CALL gds.beta.pipeline.linkPrediction.addFeature('pipe', 'hadamard', {
  nodeProperties: ['embedding', 'numberOfPosts']
}) YIELD featureSteps
Table 11. Results
featureSteps

[{name=HADAMARD, config={nodeProperties=[embedding, numberOfPosts]}}]

When executing the pipeline, the nodeProperties must be either present in the input graph, or created by a previous node property step. For example, the embedding property could be created by the previous example, and we expect numberOfPosts to already be present in the in-memory graph used as input, at train and predict time.

5. Configuring the relationship splits

Link Prediction training pipelines manage splitting the relationships into several sets and add sampled negative relationships to some of these sets. Configuring the splitting is optional, and if omitted, splitting will be done using default settings.

The splitting configuration of a pipeline can be inspected by using gds.beta.model.list and possibly only yielding splitConfig.

The splitting of relationships proceeds internally in the following steps:

  1. The graph is filtered according to specified nodeLabels and relationshipTypes, which are configured at train time.

  2. The relationships remaining after filtering we call positive, and they are split into a test set and remaining relationships which we refer to as test-complement set.

    • The test set contains a testFraction fraction of the positive relationships.

    • Random negative relationships are added to the test set. The number of negative relationships is the number of positive ones multiplied by the negativeSamplingRatio.

    • The negative relationships do not coincide with positive relationships.

  3. The relationships in the test-complement set are split into a train set and a feature-input set.

    • The train set contains a trainFraction fraction of the test-complement set.

    • The feature-input set contains (1-trainFraction) fraction of the test-complement set.

    • Random negative relationships are added to the train set. The number of negative relationships is the number of positive ones multiplied by the negativeSamplingRatio.

    • The negative relationships do not coincide with positive relationships, nor with test relationships.

The sampled positive and negative relationships are given relationship weights of 1.0 and 0.0 respectively so that they can be distinguished.

The feature-input graph is used, both in training and testing, for computing node properties and therefore also features which depend on node properties.

The train and test relationship sets are used for:

  • determining the label (positive or negative) for each training or test example

  • identifying the node pair for which link features are to be computed

However, they are not used by the algorithms run in the node property steps. The reason for this is that otherwise the model would use the prediction target (existence of a relationship) as a feature.

5.1. Syntax

Configure the relationship split syntax
CALL gds.beta.pipeline.linkPrediction.configureSplit(
  pipelineName: String,
  configuration: Map
)
YIELD
  name: String,
  nodePropertySteps: List of Map,
  featureSteps: List of Map,
  splitConfig: Map,
  parameterSpace: List of Map
Table 12. Parameters
Name Type Description

pipelineName

String

The name of the pipeline.

configuration

Map

Configuration for splitting the relationships.

Table 13. Configuration
Name Type Default Description

validationFolds

Integer

3

Number of divisions of the training graph used during model selection.

testFraction

Double

0.1

Fraction of the graph reserved for testing. Must be in the range (0, 1).

trainFraction

Double

0.1

Fraction of the test-complement set reserved for training. Must be in the range (0, 1).

negativeSamplingRatio

Double

1.0

The desired ratio of negative to positive samples in the test and train set. More details here.

Table 14. Results
Name Type Description

name

String

Name of the pipeline.

nodePropertySteps

List of Map

List of configurations for node property steps.

featureSteps

List of Map

List of configurations for feature steps.

splitConfig

Map

Configuration to define the split before the model training.

parameterSpace

List of Map

List of parameter configurations for models which the train mode uses for model selection.

5.2. Example

The following will configure the splitting of the pipeline:
CALL gds.beta.pipeline.linkPrediction.configureSplit('pipe', {
  testFraction: 0.25,
  trainFraction: 0.6,
  validationFolds: 3
})
YIELD splitConfig
Table 15. Results
splitConfig

{negativeSamplingRatio=1.0, testFraction=0.25, validationFolds=3, trainFraction=0.6}

We now reconfigured the splitting of the pipeline, which will be applied during training.

6. Adding model candidates

A pipeline contains a collection of configurations for model candidates which is initially empty. This collection is called the parameter space. One or more model configurations must be added to the parameter space of the training pipeline, using one of the following procedures:

  • gds.beta.pipeline.linkPrediction.addLogisticRegression

  • gds.alpha.pipeline.linkPrediction.addRandomForest

For information about the available training methods in GDS, logistic regression and random forest, see Training methods.

In Training the pipeline, we explain further how the configured model candidates are trained, evaluated and compared.

The parameter space of a pipeline can be inspected using gds.beta.model.list and optionally yielding only parameterSpace.

At least one model candidate must be added to the pipeline before training it.

6.1. Syntax

Configure the train parameters syntax
CALL gds.beta.pipeline.linkPrediction.addLogisticRegression(
  pipelineName: String,
  config: Map
)
YIELD
  name: String,
  nodePropertySteps: List of Map,
  featureSteps: List of Map,
  splitConfig: Map,
  parameterSpace: Map
Table 16. Parameters
Name Type Description

pipelineName

String

The name of the pipeline.

config

Map

The logistic regression config for a model candidate. The allowed parameters for a model are defined in the next table.

Table 17. Logistic regression configuration
Name Type Default Optional Description

penalty

Float

0.0

yes

Penalty used for the logistic regression. By default, no penalty is applied.

batchSize

Integer

100

yes

Number of nodes per batch.

minEpochs

Integer

1

yes

Minimum number of training epochs.

maxEpochs

Integer

100

yes

Maximum number of training epochs.

patience

Integer

1

yes

Maximum number of unproductive consecutive epochs.

tolerance

Float

0.001

yes

The minimal improvement of the loss to be considered productive.

Table 18. Results
Name Type Description

name

String

Name of the pipeline.

nodePropertySteps

List of Map

List of configurations for node property steps.

featureSteps

List of Map

List of configurations for feature steps.

splitConfig

Map

Configuration to define the split before the model training.

parameterSpace

List of Map

List of parameter configurations for models which the train mode uses for model selection.

Configure the train parameters syntax
CALL gds.alpha.pipeline.linkPrediction.addRandomForest(
  pipelineName: String,
  config: Map
)
YIELD
  name: String,
  nodePropertySteps: List of Map,
  featureSteps: List of Map,
  splitConfig: Map,
  parameterSpace: Map
Table 19. Parameters
Name Type Description

pipelineName

String

The name of the pipeline.

config

Map

The random forest config for a model candidate. The allowed parameters for a model are defined in the next table.

Table 20. Random Forest configuration
Name Type Default Optional Description

maxFeaturesRatio

Float

1 / sqrt(|features|)

yes

The ratio of features to consider when looking for the best split

numberOfSamplesRatio

Float

1.0

yes

The ratio of samples to consider per decision tree. We use sampling with replacement. A value of 0 indicates using every training example (no sampling).

numberOfDecisionTrees

Integer

100

yes

The number of decision trees.

maxDepth

Integer

No max depth

yes

The maximum depth of a decision tree.

minSplitSize

Integer

2

yes

The minimum number of samples required to split an internal node.

Table 21. Results
Name Type Description

name

String

Name of the pipeline.

nodePropertySteps

List of Map

List of configurations for node property steps.

featureSteps

List of Map

List of configurations for feature steps.

splitConfig

Map

Configuration to define the split before the model training.

parameterSpace

List of Map

List of parameter configurations for models which the train mode uses for model selection.

6.2. Example

We can add multiple model candidates to our pipeline.

The following will add a logistic regression model:
CALL gds.beta.pipeline.linkPrediction.addLogisticRegression('pipe', {penalty: 0.0625})
YIELD parameterSpace
The following will add a random forest model:
CALL gds.alpha.pipeline.linkPrediction.addRandomForest('pipe', {numberOfDecisionTrees: 5})
YIELD parameterSpace
The following will add another logistic regression model:
CALL gds.beta.pipeline.linkPrediction.addLogisticRegression('pipe', {maxEpochs: 500})
YIELD parameterSpace
RETURN parameterSpace.RandomForest AS randomForestSpace, parameterSpace.LogisticRegression AS logisticRegressionSpace
Table 22. Results
randomForestSpace logisticRegressionSpace

[{maxDepth=2147483647, minSplitSize=2, numberOfDecisionTrees=5, methodName=RandomForest, numberOfSamplesRatio=1.0}]

[{maxEpochs=100, minEpochs=1, penalty=0.0625, patience=1, methodName=LogisticRegression, batchSize=100, tolerance=0.001}, {maxEpochs=500, minEpochs=1, penalty=0.0, patience=1, methodName=LogisticRegression, batchSize=100, tolerance=0.001}]

The parameterSpace in the pipeline now contains the three different model candidates, expanded with the default values. Each specified model candidate will be tried out during the model selection in training.

These are somewhat naive examples of how to add and configure model candidates. Please see Training methods for more information on how to tune the configuration parameters of each method.

7. Training the pipeline

The train mode, gds.beta.pipeline.linkPrediction.train, is responsible for splitting data, feature extraction, model selection, training and storing a model for future use. Running this mode results in a prediction model of type LinkPrediction being stored in the model catalog along with metrics collected during training. The model can be applied to a possibly different graph which produces a relationship type of predicted links, each having a predicted probability stored as a property.

More precisely, the procedure will in order:

  1. Apply nodeLabels and relationshipType filters to the graph. All subsequent graphs have the same node set.

  2. Create a relationship split of the graph into test, train and feature-input sets as described in Configuring the relationship splits. These graphs are internally managed and exist only for the duration of the training.

  3. Apply the node property steps, added according to Adding node properties, on the feature-input graph.

  4. Apply the feature steps, added according to Adding link features, to the train graph, which yields for each train relationship an instance, that is, a feature vector and a binary label.

  5. Split the training instances using stratified k-fold cross-validation. The number of folds k can be configured using validationFolds in gds.beta.pipeline.linkPrediction.configureSplit.

  6. Train each model candidate given by the parameter space for each of the folds and evaluate the model on the respective validation set. The evaluation uses the AUCPR metric.

  7. Declare as winner the model with the highest average metric across the folds.

  8. Re-train the winning model on the whole training set and evaluate it on both the train and test sets. In order to evaluate on the test set, the feature pipeline is first applied again as for the train set.

  9. Register the winning model in the Model Catalog.

The above steps describe what the procedure does logically. The actual steps as well as their ordering in the implementation may differ.
A step can only use node properties that are already present in the input graph or produced by steps, which were added before.

7.1. Syntax

Run Link Prediction in train mode on a named graph:
CALL gds.beta.pipeline.linkPrediction.train(
  graphName: String,
  configuration: Map
) YIELD
  trainMillis: Integer,
  modelInfo: Map,
  modelSelectionStats: Map,
  configuration: Map
Table 23. Parameters
Name Type Default Optional Description

graphName

String

n/a

no

The name of a graph stored in the catalog.

configuration

Map

{}

yes

Configuration for algorithm-specifics and/or graph filtering.

Table 24. Configuration
Name Type Default Optional Description

modelName

String

n/a

no

The name of the model to train, must not exist in the Model Catalog.

pipeline

String

n/a

no

The name of the pipeline to execute.

negativeClassWeight

Float

1.0

yes

Weight of negative examples in model evaluation. Positive examples have weight 1. More details here.

randomSeed

Integer

n/a

yes

Seed for the random number generator used during training.

nodeLabels

List of String

['*']

yes

Filter the named graph using the given node labels.

relationshipTypes

List of String

['*']

yes

Filter the named graph using the given relationship types.

concurrency

Integer

4

yes

The number of concurrent threads used for running the algorithm.

Table 25. Results
Name Type Description

trainMillis

Integer

Milliseconds used for training.

modelInfo

Map

Information about the training and the winning model.

modelSelectionStats

Map

Statistics about evaluated metrics for all model candidates.

configuration

Map

Configuration used for the train procedure.

The modelInfo can also be retrieved at a later time by using the Model List Procedure. The modelInfo return field has the following algorithm-specific subfields:

Table 26. Model info fields
Name Type Description

bestParameters

Map

The model parameters which performed best on average on validation folds according to the primary metric.

metrics

Map

Map from metric description to evaluated metrics for the winning model over the subsets of the data, see below.

trainingPipeline

Map

The pipeline used for the training.

The structure of modelInfo is:

{
    bestParameters: Map,        (1)
    trainingPipeline: Map       (2)
    metrics: {                  (3)
        AUCPR: {
            test: Float,        (4)
            outerTrain: Float,  (5)
            train: {           (6)
                avg: Float,
                max: Float,
                min: Float,
            },
            validation: {      (7)
                avg: Float,
                max: Float,
                min: Float
            }
        }
    }
}
1 The best scoring model candidate configuration.
2 The pipeline used for the training.
3 The metrics map contains an entry for each metric description (currently only AUCPR) and the corresponding results for that metric.
4 Numeric value for the evaluation of the best model on the test set.
5 Numeric value for the evaluation of the best model on the outer train set.
6 The train entry summarizes the metric results over the train set.
7 The validation entry summarizes the metric results over the validation set.

7.2. Example

In this example we will create a small graph and use the training pipeline we have built up thus far. The graph consists of a handful nodes connected in a particular pattern. The example graph looks like this:

Visualization of the example graph
The following Cypher statement will create the example graph in the Neo4j database:
CREATE
  (alice:Person {name: 'Alice', numberOfPosts: 38}),
  (michael:Person {name: 'Michael', numberOfPosts: 67}),
  (karin:Person {name: 'Karin', numberOfPosts: 30}),
  (chris:Person {name: 'Chris', numberOfPosts: 132}),
  (will:Person {name: 'Will', numberOfPosts: 6}),
  (mark:Person {name: 'Mark', numberOfPosts: 32}),
  (greg:Person {name: 'Greg', numberOfPosts: 29}),
  (veselin:Person {name: 'Veselin', numberOfPosts: 3}),

  (alice)-[:KNOWS]->(michael),
  (michael)-[:KNOWS]->(karin),
  (michael)-[:KNOWS]->(chris),
  (michael)-[:KNOWS]->(greg),
  (will)-[:KNOWS]->(michael),
  (will)-[:KNOWS]->(chris),
  (mark)-[:KNOWS]->(michael),
  (mark)-[:KNOWS]->(will),
  (greg)-[:KNOWS]->(chris),
  (veselin)-[:KNOWS]->(chris),
  (karin)-[:KNOWS]->(veselin),
  (chris)-[:KNOWS]->(karin);

With the graph in Neo4j we can now project it into the graph catalog. We do this using a native projection targeting the Person nodes and the KNOWS relationships. We will also project the numberOfPosts property, so it can be used when creating link features. For the relationships we must use the UNDIRECTED orientation. This is because the Link Prediction pipelines are defined only for undirected graphs.

The following statement will project a graph using a native projection and store it in the graph catalog under the name 'myGraph'.
CALL gds.graph.project(
  'myGraph',
  {
    Person: {
      properties: ['numberOfPosts']
    }
  },
  {
    KNOWS: {
      orientation: 'UNDIRECTED'
    }
  }
)
The Link Prediction model requires the graph to be created using the UNDIRECTED orientation for relationships.

7.2.1. Memory Estimation

First off, we will estimate the cost of training the pipeline by using the estimate procedure. Estimation is useful to understand the memory impact that training the pipeline on your graph will have. When actually training the pipeline the system will perform an estimation and prohibit the execution if the estimation shows there is a very high probability of the execution running out of memory. To read more about this, see Automatic estimation and execution blocking.

For more details on estimate in general, see Memory Estimation.

The following will estimate the memory requirements for training the pipeline:
CALL gds.beta.pipeline.linkPrediction.train.estimate('myGraph', {
  pipeline: 'pipe',
  modelName: 'lp-pipeline-model'
})
YIELD requiredMemory
Table 27. Results
requiredMemory

"[58 KiB ... 1807 KiB]"

7.2.2. Training

The following will train a model using a pipeline:
CALL gds.beta.pipeline.linkPrediction.train('myGraph', {
  pipeline: 'pipe',
  modelName: 'lp-pipeline-model',
  randomSeed: 42
}) YIELD modelInfo
RETURN
  modelInfo.bestParameters AS winningModel,
  modelInfo.metrics.AUCPR.train.avg AS avgTrainScore,
  modelInfo.metrics.AUCPR.outerTrain AS outerTrainScore,
  modelInfo.metrics.AUCPR.test AS testScore
Table 28. Results
winningModel avgTrainScore outerTrainScore testScore

{maxDepth=2147483647, minSplitSize=2, numberOfDecisionTrees=5, methodName=RandomForest, numberOfSamplesRatio=1.0}

0.904365079365079

0.971428571428572

0.583333333333333

We can see the RandomForest model configuration with numberOfDecisionTrees = 5 (and defaults filled for remaining parameters) was selected, and has a score of 0.58 on the test set. The score computed as the AUCPR metric, which is in the range [0, 1]. A model which gives higher score to all links than non-links will have a score of 1.0, and a model that assigns random scores will on average have a score of 0.5.

In the previous sections we have seen how to build up a Link Prediction training pipeline and train it to produce a predictive model. After training, the runnable model is of type LinkPrediction and resides in the model catalog.

The trained model can then be applied to a graph in the graph catalog to create a new relationship type containing the predicted links. The relationships also have a property which stores the predicted probability of the link, which can be seen as a relative measure of the model’s prediction confidence.

Since the model has been trained on features which are created using the feature pipeline, the same feature pipeline is stored within the model and executed at prediction time. As during training, intermediate node properties created by the node property steps in the feature pipeline are transient and not visible after execution.

When using the model for prediction, the relationships of the input graph are used in two ways. First, the input graph is fed into the feature pipeline and therefore influences predictions if there is at least one step in the pipeline which uses the input relationships (typically any node property step does). Second, predictions are carried out on each node pair that is not connected in the input graph.

The predicted links are sorted by score before the ones having score below the configured threshold are discarded. Finally, the configured topN predictions are stored back to the projected graph.

It is necessary that the predict graph contains the properties that the pipeline requires and that the used array properties have the same dimensions as in the train graph. If the predict and train graphs are distinct, it is also beneficial that they have similar origins and semantics, so that the model is able to generalize well.

8.1. Syntax

Link Prediction syntax per mode
Run Link Prediction in mutate mode on a named graph:
CALL gds.beta.pipeline.linkPrediction.predict.mutate(
  graphName: String,
  configuration: Map
)
YIELD
  preProcessingMillis: Integer,
  computeMillis: Integer,
  postProcessingMillis: Integer,
  mutateMillis: Integer,
  relationshipsWritten: Integer,
  probabilityDistribution: Integer,
  samplingStats: Map,
  configuration: Map
Table 29. Parameters
Name Type Default Optional Description

graphName

String

n/a

no

The name of a graph stored in the catalog.

configuration

Map

{}

yes

Configuration for algorithm-specifics and/or graph filtering.

Table 30. Configuration
Name Type Default Optional Description

modelName

String

n/a

no

The name of a Link Prediction model in the model catalog.

nodeLabels

List of String

['*']

yes

Filter the named graph using the given node labels.

relationshipTypes

List of String

['*']

yes

Filter the named graph using the given relationship types.

concurrency

Integer

4

yes

The number of concurrent threads used for running the algorithm.

mutateRelationshipType

String

n/a

no

The relationship type used for the new relationships written to the projected graph.

mutateProperty

String

'probability'

yes

The relationship property in the GDS graph to which the result is written.

Table 31. Algorithm specific configuration
Name Type Default Optional Description

sampleRate

Float

n/a

no

Sample rate to determine how many links are considered for each node. If set to 1, all possible links are considered, i.e., exhaustive search. Otherwise, a kNN-based approximate search will be used. Value must be between 0 (exclusive) and 1 (inclusive).

topN [1]

Integer

n/a

no

Limit on predicted relationships to output.

threshold [1]

Float

0.0

yes

Minimum predicted probability on relationships to output.

topK [2]

Integer

10

yes

Limit on number of predicted relationships to output for each node. This value cannot be lower than 1.

deltaThreshold [2]

Float

0.001

yes

Value as a percentage to determine when to stop early. If fewer updates than the configured value happen, the algorithm stops. Value must be between 0 (exclusive) and 1 (inclusive).

maxIterations [2]

Integer

100

yes

Hard limit to stop the algorithm after that many iterations.

randomJoins [2]

Integer

10

yes

Between every iteration, how many attempts are being made to connect new node neighbors based on random selection.

initialSampler [2]

String

"uniform"

yes

The method used to sample the first k random neighbors for each node. "uniform" and "randomWalk", both case-insensitive, are valid inputs.

randomSeed [2]

Integer

n/a

yes

The seed value to control the randomness of the algorithm. Note that concurrency must be set to 1 when setting this parameter.

1. Only applicable in the exhaustive search.

2. Only applicable in the approximate strategy. For more details look at the syntax section of kNN

Table 32. Results
Name Type Description

preProcessingMillis

Integer

Milliseconds for preprocessing the graph.

computeMillis

Integer

Milliseconds for running the algorithm.

postProcessingMillis

Integer

Milliseconds for computing the global metrics.

mutateMillis

Integer

Milliseconds for adding properties to the projected graph.

relationshipsWritten

Integer

Number of relationships created.

probabilityDistribution

Map

Description of distribution of predicted probabilities.

samplingStats

Map

Description of how predictions were sampled.

configuration

Map

Configuration used for running the algorithm.

Run Link Prediction in stream mode on a named graph:
CALL gds.beta.pipeline.linkPrediction.predict.stream(
  graphName: String,
  configuration: Map
)
YIELD
  node1: Integer,
  node2: Integer,
  probability: Float
Table 33. Parameters
Name Type Default Optional Description

graphName

String

n/a

no

The name of a graph stored in the catalog.

configuration

Map

{}

yes

Configuration for algorithm-specifics and/or graph filtering.

Table 34. General configuration for algorithm execution on a named graph.
Name Type Default Optional Description

modelName

String

n/a

no

The name of a LinkPrediction model in the model catalog.

nodeLabels

List of String

['*']

yes

Filter the named graph using the given node labels.

relationshipTypes

List of String

['*']

yes

Filter the named graph using the given relationship types.

concurrency

Integer

4

yes

The number of concurrent threads used for running the algorithm.

Table 35. Algorithm specific configuration
Name Type Default Optional Description

sampleRate

Float

n/a

no

Sample rate to determine how many links are considered for each node. If set to 1, all possible links are considered, i.e., exhaustive search. Otherwise, a kNN-based approximate search will be used. Value must be between 0 (exclusive) and 1 (inclusive).

topN [3]

Integer

n/a

no

Limit on predicted relationships to output.

threshold [3]

Float

0.0

yes

Minimum predicted probability on relationships to output.

topK [4]

Integer

10

yes

Limit on number of predicted relationships to output for each node. This value cannot be lower than 1.

deltaThreshold [4]

Float

0.001

yes

Value as a percentage to determine when to stop early. If fewer updates than the configured value happen, the algorithm stops. Value must be between 0 (exclusive) and 1 (inclusive).

maxIterations [4]

Integer

100

yes

Hard limit to stop the algorithm after that many iterations.

randomJoins [4]

Integer

10

yes

Between every iteration, how many attempts are being made to connect new node neighbors based on random selection.

initialSampler [4]

String

"uniform"

yes

The method used to sample the first k random neighbors for each node. "uniform" and "randomWalk", both case-insensitive, are valid inputs.

randomSeed [4]

Integer

n/a

yes

The seed value to control the randomness of the algorithm. Note that concurrency must be set to 1 when setting this parameter.

3. Only applicable in the exhaustive search.

4. Only applicable in the approximate strategy. For more details look at the syntax section of kNN

Table 36. Results
Name Type Description

node1

Integer

Node ID of the first node.

node2

Integer

Node ID of the second node.

probability

Float

Predicted probability of a link between the nodes.

In this example we will show how to use a trained model to predict new relationships in your projected graph. In order to do this, we must first have an already trained model registered in the Model Catalog. We will use the model which we trained in the train example which we gave the name lp-pipeline-model. The algorithm excludes predictions for existing relationships in the graph as well as self-loops.

There are two different strategies for choosing which node pairs to consider when predicting new links, exhaustive search and approximate search. Whereas the former considers all possible new links, the latter will use a randomized strategy that only considers a subset of them in order to run faster. We will explain each individually with examples in the mutate examples below.

The relationships that are produced by the write and mutate procedures are undirected, just like the input. However, no parallel relationships are produced. So for example if when doing approximate search, a — b are among the top predictions for a, and b — a are among the top predictions for b, then there will still only be one undirected relationship a — b produced. The stream procedure will yield a node pair only once.

First off, we will estimate the cost of running the algorithm using the estimate procedure. This can be done with any execution mode. We will use the stream mode in this example. Estimating the algorithm is useful to understand the memory impact that running the algorithm on your graph will have. When you later actually run the algorithm in one of the execution modes the system will perform an estimation. If the estimation shows that there is a very high probability of the execution going over its memory limitations, the execution is prohibited. To read more about this, see Automatic estimation and execution blocking.

For more details on estimate in general, see Memory Estimation.

The following will estimate the memory requirements for applying the model:
CALL gds.beta.pipeline.linkPrediction.predict.stream.estimate('myGraph', {
  modelName: 'lp-pipeline-model',
  topN: 5,
  threshold: 0.45
})
YIELD requiredMemory
Table 37. Results
requiredMemory

"24 KiB"

CALL gds.beta.pipeline.linkPrediction.predict.stream('myGraph', {
  modelName: 'lp-pipeline-model',
  topN: 5,
  threshold: 0.45
})
 YIELD node1, node2, probability
 RETURN gds.util.asNode(node1).name AS person1, gds.util.asNode(node2).name AS person2, probability
 ORDER BY probability DESC, person1

We specified threshold to filter out predictions with probability less than 45%, and topN to further limit output to the top 5 relationships.

Table 38. Results
person1 person2 probability

"Michael"

"Veselin"

0.8

"Alice"

"Mark"

0.6

"Alice"

"Will"

0.6

"Greg"

"Veselin"

0.6

"Karin"

"Greg"

0.6

We can see, that our model predicts the most likely link is between Michael and Veselin.

In these examples we will show how to write the predictions to your projected graph. We will use the model lp-pipeline-model, that we trained in the train example.

The exhaustive search will simply run through all possible new links, that is, check all node pairs that are not already connected by a relationship. For each such node pair the model we trained will be used to predict whether there they should be connected be a link or not.

CALL gds.beta.pipeline.linkPrediction.predict.mutate('myGraph', {
  modelName: 'lp-pipeline-model',
  relationshipTypes: ['KNOWS'],
  mutateRelationshipType: 'KNOWS_EXHAUSTIVE_PREDICTED',
  topN: 5,
  threshold: 0.45
}) YIELD relationshipsWritten, samplingStats

We specify threshold to filter out predictions with probability less than 45%, and topN to further limit output to the top 5 relationships. Note that we omit setting the sampleRate in our configuration as it defaults to 1 implying that the exhaustive search strategy is used. Because we are using the UNDIRECTED orientation, we will write twice as many relationships to the in-memory graph.

Table 39. Results
relationshipsWritten samplingStats

10

{linksConsidered=16, strategy=exhaustive}

As we can see in the samplingStats, we use the exhaustive search strategy and check 16 possible links during the prediction. Indeed, since there are a total of 8 * (8 - 1) / 2 = 28 possible links in the graph and we already have 12, that means we check all possible new links. Although 16 links were considered, we only mutate the best five (since topN = 5) that are above our threshold.

If our graph is very large there may be a lot of possible new links. As such it may take a very long time to run the predictions. It may therefore be a more viable option to use a search strategy that only looks at a subset of all possible new links.

To avoid possibly having to run for a very long time considering all possible new links (due to the inherent quadratic complexity over node count) we can use an approximate search strategy.

The approximate search strategy lets us leverage the K-Nearest Neighbors algorithm with our model’s prediction function as its similarity measure to trade off lower runtime for accuracy. Accuracy in this context refers to how close we are in finding the very best actual new possible links according to our models predictions, i.e. the best predictions that would be made by exhaustive search.

The initial set of considered links for each node is picked at random and then refined in multiple iterations based of previously predicted links. The number of iterations is limited by the configuration parameter maxIterations, and we also limit the number of random links considered between kNN iterations using randomJoins. The algorithm may stop earlier if the link predictions per node only change by a small amount, which can be controlled by the configuration parameter deltaThreshold. See the K-Nearest Neighbors documentation for more details on how the search works.

CALL gds.beta.pipeline.linkPrediction.predict.mutate('myGraph', {
  modelName: 'lp-pipeline-model',
  relationshipTypes: ['KNOWS'],
  mutateRelationshipType: 'KNOWS_APPROX_PREDICTED',
  sampleRate: 0.5,
  topK: 1,
  randomJoins: 2,
  maxIterations: 3,
  // necessary for deterministic results
  concurrency: 1,
  randomSeed: 42
})
 YIELD relationshipsWritten, samplingStats

In order to use the approximate strategy we make sure to set the sampleRate explicitly to a value < 1.0. In this small example we set topK = 1 to only get one link predicted for each node. Because we are using the UNDIRECTED orientation, we will write twice as many relationships to the in-memory graph.

Table 40. Results
relationshipsWritten samplingStats

16

{linksConsidered=49, didConverge=true, strategy=approximate, ranIterations=3}

As we can see in the samplingStats, we use the approximate search strategy and check 44 possible links during the prediction. Though in this small example we actually consider more links that in the exhaustive case, this will typically not be the case for larger graphs. Since the relationships we write are undirected, reported relationshipsWritten is 16 when we search for the best (topK = 1) prediction for each node.

9. Appendix

This section details some theoretical concepts related to how link prediction is performed in GDS. It’s not strictly required reading but can be helpful in improving understanding.

9.1. Metrics

The Link Prediction pipeline in the Neo4j GDS library supports only the Area Under the Precision-Recall Curve metric, abbreviated as AUCPR. In order to compute precision and recall we require a set of examples, each of which has a positive or negative label. For each example we have also a predicted label. Given the true and predicted labels, we can compute precision and recall (for reference, see f.e. Wikipedia).

Then, to compute the AUCPR, we construct the precision-recall curve, as follows:

  • Each prediction is associated with a prediction strength. We sort the examples in descending order of prediction strength.

  • For all prediction strengths that occur, we use that strength as a threshold and consider all examples of that strength or higher to be positively labeled.

  • We now compute precision p and recall r and consider the tuple (r, p) as a point on a curve, the precision-recall curve.

  • Finally, the curve is linearly interpolated and the area is computed as a union of trapezoids with corners on the points.

The curve will have a shape that looks something like this:

precision-recall curve with trapezoid

Note here the blue area which shows one trapezoid under the curve.

The area under the Precision-Recall curve can also be interpreted as an average precision where the average is over different classification thresholds.

9.2. Class imbalance

Most graphs have far more non-adjacent node pairs than adjacent ones (e.g. sparse graphs). Thus, typically we have an issue with class imbalance. There are multiple strategies to account for imbalanced data. In pipeline training procedure, the AUCPR metric is used. It is considered more suitable than the commonly used AUROC (Area Under the Receiver Operating Characteristic) metric for imbalanced data. For the metric to appropriately reflect both positive (adjacent node pairs) and negative (non-adjacent node pairs) examples, we provide the ability to both control the ratio of sampling between the classes, and to control the relative weight of classes via negativeClassWeight. The former is configured by the configuration parameter negativeSamplingRatio in configureSplits when using that procedure to generate the train and test sets. Tuning the negativeClassWeight, which is explained below, means weighting up or down the false positives when computing precision.

The recommended value for negativeSamplingRatio is the true class ratio of the graph, in other words, not applying undersampling. However, the higher the value, the bigger the test set and thus the time to evaluate. The ratio of total probability mass of negative versus positive examples in the test set is approximately negativeSamplingRatio * negativeClassWeight. Thus, both of these parameters can be adjusted in tandem to trade off evaluation accuracy with speed.

The true class ratio is computed as (q - r) / r, where q = n(n-1)/2 is the number of possible undirected relationships, and r is the number of actual undirected relationships. Please note that the relationshipCount reported by the graph list procedure is the directed count of relationships summed over all existing relationship types. Thus, we recommend using Cypher to obtain r on the source Neo4j graph. For example, this query will count the number of relationships of type T or R:

MATCH (a)-[rel:T | R]-(b)
WHERE a < b
RETURN count(rel) AS r

When choosing a value for negativeClassWeight, two factors should be considered. First, the desired ratio of total probability mass of negative versus positive examples in the test set. Second, what the ratio of sampled negative examples to positive examples was in the test set. To be consistent with traditional evaluation, one should choose parameters so that negativeSamplingRatio * negativeClassWeight = 1.0, for example by setting the values to the true class ratio and its reciprocal, or both values to 1.0.

Alternatively, one can aim for the ratio of total probability weight between the classes to be close to the true class ratio. That is, making sure negativeSamplingRatio * negativeClassWeight is close to the true class ratio. The reported metric (AUCPR) then better reflects the expected precision on unseen highly imbalanced data. With this type of evaluation one has to adjust expectations as the metric value then becomes much smaller.