Link Prediction Pipelines

This section describes Link Prediction Pipelines in the Neo4j Graph Data Science library.

1. Introduction

Link prediction is a common machine learning task applied to graphs: training a model to learn, between pairs of nodes in a graph, where relationships should exist. More precisely, the input of the machine learning model are examples of node pairs which are labeled as connected or not connected. The GDS library provides Link prediction, see here. Here we describe an additional method that provides an end-to-end Link prediction experience. In addition to managing a predictive model, it also manages:

  • splitting relationships into subsets for test, train and feature input

  • a pipeline of processing steps that supply custom features for the model

The motivation for using pipelines are:

  • easier to get splits right and prevent data leakage

  • ensuring that the same feature creation steps are applied at predict and train time

  • applying the trained model with a single procedure call

  • persisting the pipeline as a whole

The rest of this page is divided as follows:

2. Creating a pipeline

The first step of building a new pipeline is to create one using gds.alpha.ml.pipeline.linkPrediction.create. This stores a trainable model object in the model catalog of type Link prediction training pipeline. This represents a configurable pipeline that can later be invoked for training, which in turn creates a trained pipeline. The latter is also a model which is stored in the catalog with type Link prediction pipeline.

2.1. Syntax

Create pipeline syntax
CALL gds.alpha.ml.pipeline.linkPrediction.create(
  pipelineName: String
)
YIELD
  name: String,
  nodePropertySteps: List of Map,
  featureSteps: List of Map,
  splitConfig: Map,
  parameterSpace: List of Map
Table 1. Parameters
Name Type Description

pipelineName

String

The name of the created pipeline.

Table 2. Results
Name Type Description

name

String

Name of the pipeline.

nodePropertySteps

List of Map

List of configurations for node property steps.

featureSteps

List of Map

List of configurations for feature steps.

splitConfig

Map

Configuration to define the split before the model training.

parameterSpace

List of Map

List of parameter configurations for models which the train mode uses for model selection.

2.2. Example

The following will create a pipeline:
CALL gds.alpha.ml.pipeline.linkPrediction.create('pipe')
Table 3. Results
name nodePropertySteps featureSteps splitConfig parameterSpace

"pipe"

[]

[]

{negativeSamplingRatio=1.0, testFraction=0.1, validationFolds=3, trainFraction=0.1}

[{useBiasFeature=true, maxEpochs=100, minEpochs=1, penalty=0.0, patience=1, batchSize=100, tolerance=0.001, concurrency=4}]

This show that the newly created pipeline does not contain any steps yet, and has defaults for the split and train parameters.

3. Adding node properties

A link prediction pipeline can execute one or several GDS algorithms in mutate mode that create node properties in the in-memory graph. Such steps producing node properties can be chained one after another and created properties can also be used to add features. Moreover, the node property steps that are added to the pipeline will be executed both when training a model and when the trained model is applied for prediction.

The name of the procedure that should be added can be a fully qualified GDS procedure name ending with .mutate. The ending .mutate may be omitted and one may also use shorthand forms such as node2vec instead of gds.beta.node2vec.mutate.

For example, pre-processing algorithms can be used as node property steps.

3.1. Syntax

Add node property syntax
CALL gds.alpha.ml.pipeline.linkPrediction.addNodeProperty(
  pipelineName: String,
  procedureName: String,
  procedureConfiguration: Map
)
YIELD
  name: String,
  nodePropertySteps: List of Map,
  featureSteps: List of Map,
  splitConfig: Map,
  parameterSpace: List of Map
Table 4. Parameters
Name Type Description

pipelineName

String

The name of the pipeline.

procedureName

String

The name of the procedure to be added to the pipeline.

procedureConfiguration

Map

The configuration of the procedure, excluding graphName, nodeLabels and relationshipTypes.

Table 5. Results
Name Type Description

name

String

Name of the pipeline.

nodePropertySteps

List of Map

List of configurations for node property steps.

featureSteps

List of Map

List of configurations for feature steps.

splitConfig

Map

Configuration to define the split before the model training.

parameterSpace

List of Map

List of parameter configurations for models which the train mode uses for model selection.

3.2. Example

The following will add a node property step to the pipeline:
CALL gds.alpha.ml.pipeline.linkPrediction.addNodeProperty('pipe', 'fastRP', {
  mutateProperty: 'embedding',
  embeddingDimension: 256,
  randomSeed: 42
})
Table 6. Results
name nodePropertySteps featureSteps splitConfig parameterSpace

"pipe"

[{name=gds.fastRP.mutate, config={randomSeed=42, embeddingDimension=256, mutateProperty=embedding}}]

[]

{negativeSamplingRatio=1.0, testFraction=0.1, validationFolds=3, trainFraction=0.1}

[{useBiasFeature=true, maxEpochs=100, minEpochs=1, penalty=0.0, patience=1, batchSize=100, tolerance=0.001, concurrency=4}]

The pipeline will now execute the fastRP algorithm in mutate mode both before training a model, and when the trained model is applied for prediction. This ensures the embedding property can be used as an input for link features.

4. Adding link features

A Link Prediction pipeline executes a sequence of steps to compute the features used by a machine learning model. A feature step computes a vector of features for given node pairs. For each node pair, the results are concatenated into a single link feature vector. The order of the features in the link feature vector follows the order of the feature steps. Like with node property steps, the feature steps are also executed both at training and prediction time. The supported methods for obtaining features are described below.

4.1. Syntax

Adding a link feature to a pipeline syntax
CALL gds.alpha.ml.pipeline.linkPrediction.addFeature(
  pipelineName: String,
  featureType: String,
  configuration: Map
)
YIELD
  name: String,
  nodePropertySteps: List of Map,
  featureSteps: List of Map,
  splitConfig: Map,
  parameterSpace: List of Map
Table 7. Parameters
Name Type Description

pipelineName

String

The name of the pipeline.

featureType

String

The featureType determines the method used for computing the link feature. See supported types.

configuration

Map

Configuration for splitting the relationships.

Table 8. Configuration
Name Type Default Description

nodeProperties

List of String

no

The names of the node properties that should be used as input.

Table 9. Results
Name Type Description

name

String

Name of the pipeline.

nodePropertySteps

List of Map

List of configurations for node property steps.

featureSteps

List of Map

List of configurations for feature steps.

splitConfig

Map

Configuration to define the split before the model training.

parameterSpace

List of Map

List of parameter configurations for models which the train mode uses for model selection.

4.2. Supported feature types

A feature step can use node properties that exist in the input graph or are added by the pipeline. For each node in a node pair of interest, the values of nodeProperties are concatenated, in the configured order, into a vector. We denote the entries of the vectors of a pair by a[i] and b[i], and we take f[i] to be the i-th entry of the output of a feature step.

The supported types of features can then be described as follows:

Table 10. Supported feature types
Feature Type Formula / Description

L2

f[i] = (a[i] - b[i])^2

HADAMARD

f[i] = a[i] * b[i]

COSINE

f[0] = cosine similarity of vectors a and b

4.3. Example

The following will add a feature step to the pipeline:
CALL gds.alpha.ml.pipeline.linkPrediction.addFeature('pipe', 'hadamard', {
  nodeProperties: ['embedding', 'numberOfPosts']
}) YIELD featureSteps
Table 11. Results
featureSteps

[{name=HADAMARD, config={nodeProperties=[embedding, numberOfPosts]}}]

When executing the pipeline, the nodeProperties must be either present in the input graph, or created by a previous node property step. For example, the embedding property could be created by the previous example, and we expect numberOfPosts to already be present in the in-memory graph used as input, at train and predict time.

5. Configuring the relationship splits

Link Prediction pipelines manage splitting the relationships into several sets and add sampled negative relationships to some of these sets. Configuring the splitting is optional, and if omitted, splitting will be done using default settings.

The splitting configuration of a pipeline can be inspected by using gds.beta.model.list and possibly only yielding splitConfig.

The splitting of relationships proceeds internally in the following steps:

  1. The graph is filtered according to specified nodeLabels and relationshipTypes, which are configured at train time.

  2. The relationships remaining after filtering we call positive, and they are split into a test set and remaining relationships.

    • The test set contains a testFraction fraction of the positive relationships.

    • Random negative relationships are added to the test set. The number of negative relationships is the number of positive ones multiplied by the negativeSamplingRatio.

    • The negative relationships do not coincide with positive relationships.

  3. The remaining positive relationships are split into a train set and a feature input set.

    • The train set contains a trainFraction fraction of all the positive relationships.

      • Therefore we require trainFraction + testFraction < 1.0.

      • The feature input set contains the remaining 1.0 - (trainFraction + testFraction) fraction of the positive relationships.

    • Random negative relationships are added to the train set. The number of negative relationships is the number of positive ones multiplied by the negativeSamplingRatio.

    • The negative relationships do not coincide with positive relationships, nor with test relationships.

The sampled positive and negative relationships are given relationship weights of 1.0 and 0.0 respectively so that they can be distinguished.

The feature input graph is used, both in training and testing, for computing node properties and therefore also features which depend on node properties.

The train and test relationship sets are used for:

  • determining the label (positive or negative) for each training or test example

  • identifying the node pair for which link features are to be computed

However, they are not used by the algorithms run in the node property steps. The reason for this is that otherwise the model would use the prediction target (existence of a relationship) as a feature.

5.1. Syntax

Configure the relationship split syntax
CALL gds.alpha.ml.pipeline.linkPrediction.configureSplit(
  pipelineName: String,
  configuration: Map
)
YIELD
  name: String,
  nodePropertySteps: List of Map,
  featureSteps: List of Map,
  splitConfig: Map,
  parameterSpace: List of Map
Table 12. Parameters
Name Type Description

pipelineName

String

The name of the pipeline.

configuration

Map

Configuration for splitting the relationships.

Table 13. Configuration
Name Type Default Description

validationFolds

Integer

3

Number of divisions of the training graph used during model selection.

testFraction

Double

0.1

Portion of the graph reserved for testing. Must be in the range (0, 1).

trainFraction

Double

0.1

Portion of the graph reserved for training. Must be in the range (0, 1).

negativeSamplingRatio

Double

1.0

The desired ratio of negative to positive samples in the test and train set.

Table 14. Results
Name Type Description

name

String

Name of the pipeline.

nodePropertySteps

List of Map

List of configurations for node property steps.

featureSteps

List of Map

List of configurations for feature steps.

splitConfig

Map

Configuration to define the split before the model training.

parameterSpace

List of Map

List of parameter configurations for models which the train mode uses for model selection.

5.2. Example

The following will configure the splitting of the pipeline:
CALL gds.alpha.ml.pipeline.linkPrediction.configureSplit('pipe', {
  testFraction: 0.3,
  trainFraction: 0.3,
  validationFolds: 7
})
YIELD splitConfig
Table 15. Results
splitConfig

{negativeSamplingRatio=1.0, testFraction=0.3, validationFolds=7, trainFraction=0.3}

We now reconfigured the splitting of the pipeline, which will be applied during training.

6. Configuring the model parameters

The gds.alpha.ml.pipeline.linkPrediction.configureParams mode is used to set up the train mode with a list of configurations of logistic regression models. The set of model configurations is called the parameter space which parametrizes a set of model candidates. The parameter space can be configured by passing this procedure a list of maps, where each map configures the training of one logistic regression model. In Training the pipeline, we explain further how the configured model candidates are trained, evaluated and compared.

The allowed model parameters are listed in the table Model configuration.

If configureParams is not used, then a single model with defaults for all the model parameters is used. The parameter space of a pipeline can be inspected using gds.beta.model.list and optionally yielding only parameterSpace.

6.1. Syntax

Configure the train parameters syntax
CALL gds.alpha.ml.pipeline.linkPrediction.configureParams(
  pipelineName: String,
  parameterSpace: List of Map
)
YIELD
  name: String,
  nodePropertySteps: List of Map,
  featureSteps: List of Map,
  splitConfig: Map,
  parameterSpace: List of Map
Table 16. Parameters
Name Type Description

pipelineName

String

The name of the pipeline.

parameterSpace

List of Map

The parameter space used to select the best model from. Each Map corresponds to potential model. The allowed parameters for a model are defined in the next table.

Table 17. Model configuration
Name Type Default Optional Description

penalty

Float

0.0

yes

Penalty used for the logistic regression. By default, no penalty is applied.

batchSize

Integer

100

yes

Number of nodes per batch.

minEpochs

Integer

1

yes

Minimum number of training epochs.

maxEpochs

Integer

100

yes

Maximum number of training epochs.

patience

Integer

1

yes

Maximum number of unproductive consecutive epochs.

tolerance

Float

0.001

yes

The minimal improvement of the loss to be considered productive.

useBiasFeature

Boolean

true

yes

Whether the logistic regression model uses a bias feature.

concurrency

Integer

see description

yes

Concurrency for training the model candidate. By default, the value of concurrency defined at training is used.

Table 18. Results
Name Type Description

name

String

Name of the pipeline.

nodePropertySteps

List of Map

List of configurations for node property steps.

featureSteps

List of Map

List of configurations for feature steps.

splitConfig

Map

Configuration to define the split before the model training.

parameterSpace

List of Map

List of parameter configurations for models which the train mode uses for model selection.

6.2. Example

The following will configure the parameter space of the pipeline:
CALL gds.alpha.ml.pipeline.linkPrediction.configureParams('pipe',
  [{tolerance: 0.001}, {tolerance: 0.01}, {maxEpochs: 500}]
) YIELD parameterSpace
Table 19. Results
parameterSpace

[{useBiasFeature=true, maxEpochs=100, minEpochs=1, penalty=0.0, patience=1, batchSize=100, tolerance=0.001}, {useBiasFeature=true, maxEpochs=100, minEpochs=1, penalty=0.0, patience=1, batchSize=100, tolerance=0.01}, {useBiasFeature=true, maxEpochs=500, minEpochs=1, penalty=0.0, patience=1, batchSize=100, tolerance=0.001}]

The parameterSpace in the pipeline now contains the three different model parameters, expanded with the default values. Each specified model configuration will be tried out during the model selection in training.

7. Training the pipeline

The train mode, gds.alpha.ml.pipeline.linkPrediction.train, is responsible for splitting data, feature extraction, model selection, training and storing a model for future use. Running this mode results in a Link prediction pipeline model being stored in the model catalog along with metrics collected during training. The trained pipeline can be applied to a possibly different graph which produces a relationship type of predicted links, each having a predicted probability stored as a property.

More precisely, the procedure will in order:

  1. apply nodeLabels and relationshipType filters to the graph. All subsequent graphs have the same node set.

  2. create a relationship split of the graph into test, train and feature input sets as described in Configuring the relationship splits. These graphs are internally managed and exist only for the duration of the training.

  3. apply the node property steps, added according to Adding node properties, on the feature input graph.

  4. apply the feature steps, added according to Adding link features, to the train graph, which yields for each train relationship an instance, that is, a feature vector and a binary label.

  5. split the training instances using stratified k-fold crossvalidation. The number of folds k can be configured using validationFolds in gds.alpha.ml.pipeline.linkPrediction.configureSplit.

  6. train each model candidate given by the parameter space for each of the folds and evaluate the model on the respective validation set. The training process uses a logistic regression algorithm, and the evaluation uses the AUCPR metric.

  7. declare as winner the model with the highest average metric across the folds.

  8. re-train the winning model on the whole training set and evaluate it on both the train and test sets. In order to evaluate on the test set, the feature pipeline is first applied again as for the train set.

  9. register the winning model in the Model Catalog.

The above steps describe what the procedure does logically. The actual steps as well as their ordering in the implementation may differ.
A step can only use node properties that are already present in the input graph or produced by steps, which were added before.

7.1. Syntax

Run Link Prediction in train mode on a named graph:
CALL gds.alpha.ml.pipeline.linkPrediction.train(
  graphName: String,
  configuration: Map
) YIELD
  trainMillis: Integer,
  modelInfo: Map,
  configuration: Map
Table 20. Parameters
Name Type Default Optional Description

graphName

String

n/a

no

The name of a graph stored in the catalog.

configuration

Map

{}

yes

Configuration for algorithm-specifics and/or graph filtering.

Table 21. Configuration
Name Type Default Optional Description

modelName

String

n/a

no

The name of the model to train, must not exist in the Model Catalog.

pipeline

String

n/a

no

The name of the pipeline to execute.

negativeClassWeight

Float

1.0

yes

Weight of negative examples in model evaluation. Positive examples have weight 1.

randomSeed

Integer

n/a

yes

Seed for the random number generator used during training.

nodeLabels

List of String

['*']

yes

Filter the named graph using the given node labels.

relationshipTypes

List of String

['*']

yes

Filter the named graph using the given relationship types.

concurrency

Integer

4

yes

The number of concurrent threads used for running the algorithm.

Table 22. Results
Name Type Description

trainMillis

Integer

Milliseconds used for training.

modelInfo

Map

Information about the training and the winning model.

configuration

Map

Configuration used for the train procedure.

The modelInfo can also be retrieved at a later time by using the Model List Procedure. The modelInfo return field has the following algorithm-specific subfields:

Table 23. Model info fields
Name Type Description

bestParameters

Map

The model parameters which performed best on average on validation folds according to the primary metric.

metrics

Map

Map from metric description to evaluated metrics for various models and subsets of the data, see below.

trainingPipeline

Map

The pipeline used for the training.

The structure of modelInfo is:

{
    bestParameters: Map,        (1)
    trainingPipeline: Map       (2)
    metrics: {                  (3)
        AUCPR: {
            test: Float,        (4)
            outerTrain: Float,  (5)
            train: [{           (6)
                avg: Float,
                max: Float,
                min: Float,
                params: Map
            },
            {
                avg: Float,
                max: Float,
                min: Float,
                params: Map
            },
            ...
            ],
            validation: [{      (7)
                avg: Float,
                max: Float,
                min: Float,
                params: Map
            },
            {
                avg: Float,
                max: Float,
                min: Float,
                params: Map
            },
            ...
            ]
        }
    }
}
1 The best scoring model candidate configuration.
2 The pipeline used for the training.
3 The metrics map contains an entry for each metric description (currently only AUCPR) and the corresponding results for that metric.
4 Numeric value for the evaluation of the best model on the test set.
5 Numeric value for the evaluation of the best model on the outer train set.
6 The train entry lists the scores over the train set for all candidate models (e.g., params). Each such result is in turn also a map with keys params, avg, min and max.
7 The validation entry lists the scores over the validation set for all candidate models (e.g., params). Each such result is in turn also a map with keys params, avg, min and max.

7.2. Example

In this example we will create a small graph and train the pipeline we have built up thus far. The graph consists of a handful nodes connected in a particular pattern. The example graph looks like this:

Visualization of the example graph
The following Cypher statement will create the example graph in the Neo4j database:
CREATE
  (alice:Person {name: 'Alice', numberOfPosts: 38}),
  (michael:Person {name: 'Michael', numberOfPosts: 67}),
  (karin:Person {name: 'Karin', numberOfPosts: 30}),
  (chris:Person {name: 'Chris', numberOfPosts: 132}),
  (will:Person {name: 'Will', numberOfPosts: 6}),
  (mark:Person {name: 'Mark', numberOfPosts: 32}),
  (greg:Person {name: 'Greg', numberOfPosts: 29}),
  (veselin:Person {name: 'Veselin', numberOfPosts: 3}),

  (alice)-[:KNOWS]->(michael),
  (michael)-[:KNOWS]->(karin),
  (michael)-[:KNOWS]->(chris),
  (michael)-[:KNOWS]->(greg),
  (will)-[:KNOWS]->(michael),
  (will)-[:KNOWS]->(chris),
  (mark)-[:KNOWS]->(michael),
  (mark)-[:KNOWS]->(will),
  (greg)-[:KNOWS]->(chris),
  (veselin)-[:KNOWS]->(chris),
  (karin)-[:KNOWS]->(veselin),
  (chris)-[:KNOWS]->(karin);

With the graph in Neo4j we can now project it into the graph catalog. We do this using a native projection targeting the Person nodes and the KNOWS relationships. We will also project the numberOfPosts property, so it can be used when creating link features. For the relationships we must use the UNDIRECTED orientation. This is because the Link Prediction pipelines are defined only for undirected graphs.

The following statement will create a graph using a native projection and store it in the graph catalog under the name 'myGraph'.
CALL gds.graph.create(
  'myGraph',
  {
    Person: {
      properties: ['numberOfPosts']
    }
  },
  {
    KNOWS: {
      orientation: 'UNDIRECTED'
    }
  }
)
The Link Prediction model requires the graph to be created using the UNDIRECTED orientation for relationships.
The following will train a model using a pipeline:
CALL gds.alpha.ml.pipeline.linkPrediction.train('myGraph', {
  pipeline: 'pipe',
  modelName: 'lp-pipeline-model',
  randomSeed: 42
}) YIELD modelInfo
RETURN
  modelInfo.bestParameters AS winningModel,
  modelInfo.metrics.AUCPR.outerTrain AS trainGraphScore,
  modelInfo.metrics.AUCPR.test AS testGraphScore
Table 24. Results
winningModel trainGraphScore testGraphScore

{useBiasFeature=true, maxEpochs=100, minEpochs=1, penalty=0.0, patience=1, batchSize=100, tolerance=0.001, concurrency=4}

0.41666666666666663

0.7638888888888888

We can see the model configuration with tolerance = 0.001 (and defaults filled for remaining parameters) was selected, and has a score of 0.76 on the test set. The score computed as the AUCPR metric, which is in the range [0, 1]. A model which gives higher score to all links than non-links will have a score of 1.0, and a model that assigns random scores will on average have a score of 0.5.

In the previous sections we have seen how to build up a Link Prediction training pipeline and train it to produce a predictive model. After training, the runnable model is of type Link prediction pipeline and resides in the model catalog.

The trained model can then be applied to a graph in the graph catalog to create a new relationship type containing the predicted links. The relationships also have a property which stores the predicted probability of the link, which can be seen as a relative measure of the model’s prediction confidence.

Since the model has been trained on features which are created using the feature pipeline, the same feature pipeline is stored within the model and executed at prediction time. As during training, intermediate node properties created by the node property steps in the feature pipeline are transient and not visible after execution.

When using the model for prediction, the relationships of the input graph are used in two ways. First, the input graph is fed into the feature pipeline and therefore influences predictions if there is at least one step in the pipeline which uses the input relationships (typically any node property step does). Second, predictions are carried out on each node pair that is not connected in the input graph.

The predicted links are sorted by score before the ones having score below the configured threshold are discarded. Finally, the configured topN predictions are stored back to the in-memory graph.

It is necessary that the predict graph contains the properties that the pipeline requires and that the used array properties have the same dimensions as in the train graph. If the predict and train graphs are distinct, it is also beneficial that they have similar origins and semantics, so that the model is able to generalize well.

8.1. Syntax

Link Prediction syntax per mode
Run Link Prediction in mutate mode on a named graph:
CALL gds.alpha.ml.pipeline.linkPrediction.predict.mutate(
  graphName: String,
  configuration: Map
)
YIELD
  createMillis: Integer,
  computeMillis: Integer,
  postProcessingMillis: Integer,
  mutateMillis: Integer,
  relationshipsWritten: Integer,
  configuration: Map
Table 25. Parameters
Name Type Default Optional Description

graphName

String

n/a

no

The name of a graph stored in the catalog.

configuration

Map

{}

yes

Configuration for algorithm-specifics and/or graph filtering.

Table 26. Configuration
Name Type Default Optional Description

modelName

String

n/a

no

The name of a Link Prediction model in the model catalog.

nodeLabels

List of String

['*']

yes

Filter the named graph using the given node labels.

relationshipTypes

List of String

['*']

yes

Filter the named graph using the given relationship types.

concurrency

Integer

4

yes

The number of concurrent threads used for running the algorithm.

mutateRelationshipType

String

n/a

no

The relationship type used for the new relationships written to the in-memory graph.

mutateProperty

String

'probability'

yes

The relationship property in the GDS graph to which the result is written.

Table 27. Algorithm specific configuration
Name Type Default Optional Description

topN

Integer

n/a

no

Limit on predicted relationships to output.

threshold

Float

0.0

yes

Minimum predicted probability on relationships to output.

Table 28. Results
Name Type Description

createMillis

Integer

Milliseconds for creating the graph.

computeMillis

Integer

Milliseconds for running the algorithm.

postProcessingMillis

Integer

Milliseconds for computing the global metrics.

mutateMillis

Integer

Milliseconds for adding properties to the in-memory graph.

relationshipsWritten

Integer

Number of relationships created.

configuration

Map

Configuration used for running the algorithm.

8.2. Example

In this example we will show how to use a trained model to predict new relationships in your in-memory graph. In order to do this, we must first have an already trained model registered in the Model Catalog. We will use the model which we trained in the train example which we gave the name lp-pipeline-model. The algorithm excludes predictions for existing relationships in the graph.

CALL gds.alpha.ml.pipeline.linkPrediction.predict.mutate('myGraph', {
  modelName: 'lp-pipeline-model',
  mutateRelationshipType: 'KNOWS_PREDICTED',
  topN: 5,
  threshold: 0.45
}) YIELD relationshipsWritten
Table 29. Results
relationshipsWritten

10

We specified threshold to filter out predictions with probability less than 45%, and topN to further limit output to the top 5 relationships. Because we are using the UNDIRECTED orientation, we will write twice as many relationships to the in-memory graph.

In the following, we will inspect the predicted relationships:

Stream the predicted relationships:
CALL gds.graph.streamRelationshipProperty('myGraph', 'probability', ['KNOWS_PREDICTED'])
YIELD
  sourceNodeId, targetNodeId, propertyValue
WHERE sourceNodeId < targetNodeId
RETURN
  gds.util.asNode(sourceNodeId).name as source, gds.util.asNode(targetNodeId).name as target, propertyValue AS probability
ORDER BY source ASC, target ASC
Table 30. Results
source target probability

"Alice"

"Chris"

0.5422350772807373

"Alice"

"Greg"

0.51204718863418

"Alice"

"Karin"

0.5123040606165334

"Alice"

"Mark"

0.5130009448848327

"Chris"

"Mark"

0.5364414066546659

We can see, that our model predicts the most likely link is between Alice and Chris.