Link Prediction

This section describes the Link Prediction Model in the Neo4j Graph Data Science library.

1. Introduction

Link prediction is a common machine learning task applied to graphs: training a model to learn, between pairs of nodes in a graph, where relationships should exist. The predicted links are undirected. You can think of this as building a model to predict missing relationships in your dataset or relationships that are likely to form in the future. Neo4j GDS trains supervised machine learning models based on the relationships and node properties in your graph to predict the existence - and probability - of relationships.

Link Prediction can be used favorably together with pre-processing algorithms.

The basic work flow of Link Prediction contains the following parts which are described below:

1.1. Training, Model Selection and Evaluation

When building a model, it is possible to specify multiple model configurations and a model selection metric. The train mode, gds.alpha.ml.linkPrediction.train, is responsible for training and evaluating the models, selecting the best model, and storing it in the model catalog.

The train mode takes as input two relationship types representing the training graph and test graph respectively. The relationship types must have an integer property, with values being either 0 or 1. If the value is 0 the relationship represents a negative example, meaning a node pair which is not connected in the original graph. If the value is 1 the relationship represents a positive example, meaning a relationship which does exist in the original graph.

To obtain the feature vector for an example, the algorithm first forms node feature vectors for the source and target nodes of the example in question. This is done by concatenating the property values of the specified feature properties in order into a node feature vector. Thereafter, the algorithm uses a link feature combiner to combine the two node feature vectors into the feature vector for the training example. There are three supported link feature combiners:

  • L2

    • which gives a feature vector as [(s_0 - t_0)^2 , (s_1 - t_1)^2, …​, (s_d - t_d)^2], where [s_i] and [t_i] are the source and target node feature vectors.

  • HADAMARD

    • which gives a feature vector as [s_0 * t_0, s_1 * t_1, …​, s_d * t_d].

  • COSINE

    • which gives a single scalar feature, using the the Cosine similarity of source and target node feature vectors.

The precise steps of the train mode are:

  1. The relationships of the training graph are divided into a number of folds, consisting of a training part and a validation part.

  2. Each model candidate is trained on each train part and evaluated on the respective validation part. The training process uses a logistic regression algorithm, and the evaluation uses the AUCPR metric.

  3. The model with the highest average score according to the metric will win the training.

  4. The winning model will then be re-trained on the whole training graph and evaluated on the training graph as well as on the test graph.

  5. The winning model will be registered in the Model Catalog.

Trained models may then be used to predict the probability of a relationship between two nodes.

A previously trained model can be applied by invoking the gds.alpha.ml.linkPrediction.predict.mutate mode. This will retrieve the model by name from the model catalog. The model will thereby be used to predict the probability of relationships between all node pairs in the graph that are not connected. There are two mandatory configuration parameters, which limit the size of the output:

  • topN retains the most probable predictions.

  • threshold retains predictions whose probability is above the threshold.

1.3. Train/Test Splitting

In order to train a Link Prediction model, one needs training and test graphs as described above. The recommended way to obtain these is by using gds.alpha.ml.splitRelationships() procedure, once to produce the test graph, and another time for the training graph. By invoking this procedure the first time, one obtains two new relationship types that represent the test graph and a 'remaining' graph. One can then, invoke the procedure again, on the just created 'remaining' graph, which then creates a training graph and an even smaller 'remaining' graph. Below, is an example usage of how the splitRelationships procedure can be used to prepare the required datasets for training.

The 'remaining' graph after the second split can optionally be used to create node embeddings without data leakage from test or validation sets.

Note that, after the first invocation, we cannot use the 'remaining' graph as the training graph, because it not guaranteed to have 0/1 value relationship labels nor negative link examples.

1.4. Metrics

The Link Prediction model in the Neo4j GDS library supports only the Area Under the Precision-Recall Curve metric, abbreviated as AUCPR. In order to compute precision and recall we require a set of examples, each of which has a positive or negative target. For each example we have also a predicted target. Given the true and predicted targets, we can compute precision and recall (for reference, see f.e. Wikipedia).

Then, to compute the AUCPR, we construct the precision-recall curve, as follows:

  • Each predicted target is associated with a prediction strength, for example the predicted probability of a positive target. We sort the examples in descending order of prediction strength.

  • For all prediction strengths that occur, we use that strength as a threshold and for all examples of that strength or higher predict that these examples have positive targets.

  • We now compute precision p and recall r and consider the tuple (r, p) as a point on a curve, the precision-recall curve.

  • Finally, the curve is linearly interpolated and the area is computed as a union of trapezoids with corners on the points.

The curve will have a shape that looks something like this:

precision-recall curve with trapezoid

Note here the blue area which shows one trapezoid under the curve.

The area under the Precision-Recall curve can also be interpreted as an average precision where the average is over different classification thresholds.

1.4.1. Class imbalance

Most graphs have far more non-connected node pairs than connected ones (e.g. sparse graphs). Thus, typically we have an issue with class imbalance. There are multiple strategies to account for imbalanced data. In our procedure, the AUCPR metric is used which is considered more suitable than the commonly used AUROC (Area Under the Receiver Operating Characteristic) metric for imbalanced data. For the metric to appropriately reflect both positive (connected node pairs) and negative (non-connected pairs) examples, we provide the ability to both control the ratio of sampling between the classes, and to control the relative weight of classes via negativeClassWeight. The former is configured by the configuration parameter negativeSamplingRatio in splitRelationships when using that procedure to generate the test set. Tuning the negativeClassWeight, which is explained below, means weighting up or down the false positives when computing precision.

The recommended value for negativeSamplingRatio is the true class ratio of the graph, in other words, not applying undersampling. However, the higher the value, the bigger the test set and thus the time to evaluate. The ratio of total probability mass of negative versus positive examples in the test set is approximately negativeSamplingRatio * negativeClassWeight. Thus, both of these parameters can be adjusted in tandem to trade off evaluation accuracy with speed.

The true class ratio is computed as (q - r) / r, where q = n(n-1)/2 is the number of possible undirected relationships, and r is the number of actual undirected relationships. Please note that the relationshipCount reported by the graph list procedure is the directed count of relationships summed over all existing relationship types. Thus, we recommend using Cypher to obtain r on the source Neo4j graph. For example, this query will count the number of relationships of type T or R:

MATCH (a)-[rel:T | R]-(b)
WHERE a < b
RETURN count(rel) AS r

When choosing a value for negativeClassWeight, two factors should be considered. First, the desired ratio of total probability mass of negative versus positive examples in the test set. Second, what the ratio of sampled negative examples to positive examples was in the test set. To be consistent with traditional evaluation, one should choose parameters so that negativeSamplingRatio * negativeClassWeight = 1.0, for example by setting the values to the true class ratio and its reciprocal, or both values to 1.0.

Alternatively, one can aim for the ratio of total probability weight between the classes to be close to the true class ratio. That is, making sure negativeSamplingRatio * negativeClassWeight is close to the true class ratio. The reported metric (AUCPR) then better reflects the expected precision on unseen highly imbalanced data. With this type of evaluation one has to adjust expectations as the metric value then becomes much smaller.

2. Syntax

This section covers the syntax used to execute the Link Prediction algorithm in each of its execution modes. We are describing the named graph variant of the syntax. To learn more about general syntax variants, see Syntax overview.

The named graphs must be projected in the UNDIRECTED orientation for the Link Prediction model.
Example 1. Link Prediction syntax per mode
Run Link Prediction in train mode on a named graph:
CALL gds.alpha.ml.linkPrediction.train(
  graphName: String,
  configuration: Map
) YIELD
  trainMillis: Integer,
  modelInfo: Map,
  configuration: Map
Table 1. Parameters
Name Type Default Optional Description

graphName

String

n/a

no

The name of a graph stored in the catalog.

configuration

Map

{}

yes

Configuration for algorithm-specifics and/or graph filtering.

Table 2. General configuration for algorithm execution on a named graph.
Name Type Default Optional Description

modelName

String

n/a

no

The name of the model to train, must not exist in the Model Catalog.

featureProperties

List<String>

[]

yes

The names of the node properties that should be used as input features. All property names must exist in the in-memory graph and be of type Float or List<Float>.

nodeLabels

String[]

['*']

yes

Filter the named graph using the given node labels.

relationshipTypes

String[]

['*']

yes

Filter the named graph using the given relationship types.

concurrency

Integer

4

yes

The number of concurrent threads used for running the algorithm.

Table 3. Algorithm specific configuration
Name Type Default Optional Description

trainRelationshipType

String

n/a

no

Relationship type to use during model training.

testRelationshipType

String

n/a

no

Relationship type to use during model evaluation.

validationFolds

Integer

n/a

no

Number of divisions of the training graph used during model selection.

negativeClassWeight

Float

n/a

no

Weight of negative examples in model evaluation. Positive examples have weight 1.

params

List<Map>

n/a

no

List of model configurations to be trained and compared. See next table for details.

randomSeed

Integer

n/a

yes

Seed for the random number generator used during training.

Table 4. Model configuration
Name Type Default Optional Description

penalty

Float

0.0

yes

Penalty used for the logistic regression. By default, no penalty is applied.

linkFeatureCombiner

String

"L2"

yes

Link feature combiner is used to combine two node feature vectors into the feature vector for the training. Available combiners are L2, HADAMARD and COSINE.

batchSize

Integer

100

yes

Number of nodes per batch.

minEpochs

Integer

1

yes

Minimum number of training epochs.

maxEpochs

Integer

100

yes

Maximum number of training epochs.

patience

Integer

1

yes

Maximum number of unproductive consecutive epochs.

tolerance

Float

0.001

yes

The minimal improvement of the loss to be considered productive.

concurrency

Integer

see description

yes

Concurrency for training the model candidate. By default, the value of the top level concurrency parameter is used.

For hyperparameter tuning ideas, look here.

Table 5. Results
Name Type Description

trainMillis

Integer

Milliseconds used for training.

modelInfo

Map

Information about the training and the winning model.

configuration

Map

Configuration used for the train procedure.

The modelInfo can also be retrieved at a later time by using the Model List Procedure. The modelInfo return field has the following algorithm-specific subfields:

Name Type Description

bestParameters

Map

The model parameters which performed best on average on validation folds according to the primary metric.

metrics

Map

Map from metric description to evaluated metrics for various models and subsets of the data, see below.

The metrics map contains an entry for each metric description (currently only AUCPR) and the corresponding results for that metric. Each such result map contains the keys train, validation, outerTrain and test. The latter two correspond to numeric values for the evaluation of the best model on the test set and its complement — the outer train set. The train and validation results contain lists of results for all candidate models (i.e. params). Each such result is in turn also a map with keys params, avg, min and max. These correspond to the hyper-parameters of the model candidate, and statistics for the result of that model over cross-validation folds. The structure of modelInfo is:

{
    bestParameters: Map,     // one of the maps specified in `params`
    metrics: {
        AUCPR: {
            test: Float,
            outerTrain: Float,
            train: [{
                avg: Float,
                max: Float,
                min: Float,
                params: Map
            },
            {
                avg: Float,
                max: Float,
                min: Float,
                params: Map
            },
            ...              // more results per model candidate
            ],
            validation: [{
                avg: Float,
                max: Float,
                min: Float,
                params: Map
            },
            {
                avg: Float,
                max: Float,
                min: Float,
                params: Map
            },
            ...              // more results per model candidate
            ]
        }
    }
}
Run Link Prediction in stream mode on a named graph:
CALL gds.alpha.ml.linkPrediction.predict.stream(
  graphName: String,
  configuration: Map
)
YIELD
  node1: Integer,
  node2: Integer,
  probability: Float
Table 6. Parameters
Name Type Default Optional Description

graphName

String

n/a

no

The name of a graph stored in the catalog.

configuration

Map

{}

yes

Configuration for algorithm-specifics and/or graph filtering.

Table 7. General configuration for algorithm execution on a named graph.
Name Type Default Optional Description

modelName

String

n/a

no

The name of a Link Prediction model in the model catalog.

nodeLabels

String[]

['*']

yes

Filter the named graph using the given node labels.

relationshipTypes

String[]

['*']

yes

Filter the named graph using the given relationship types.

concurrency

Integer

4

yes

The number of concurrent threads used for running the algorithm.

Table 8. Algorithm specific configuration
Name Type Default Optional Description

topN

Integer

n/a

no

Limit on predicted relationships to output.

threshold

Float

n/a

no

Minimum predicted probability on relationships to output.

Table 9. Results
Name Type Description

node1

Integer

Node ID of the first node.

node2

Integer

Node ID of the second node.

probability

Float

Predicted probability of a link between the nodes.

Run Link Prediction in mutate mode on a named graph:
CALL gds.alpha.ml.linkPrediction.predict.mutate(
  graphName: String,
  configuration: Map
)
YIELD
  createMillis: Integer,
  computeMillis: Integer,
  postProcessingMillis: Integer,
  mutateMillis: Integer,
  relationshipsWritten: Integer,
  configuration: Map
Table 10. Parameters
Name Type Default Optional Description

graphName

String

n/a

no

The name of a graph stored in the catalog.

configuration

Map

{}

yes

Configuration for algorithm-specifics and/or graph filtering.

Table 11. General configuration for algorithm execution on a named graph.
Name Type Default Optional Description

modelName

String

n/a

no

The name of a Link Prediction model in the model catalog.

nodeLabels

String[]

['*']

yes

Filter the named graph using the given node labels.

relationshipTypes

String[]

['*']

yes

Filter the named graph using the given relationship types.

concurrency

Integer

4

yes

The number of concurrent threads used for running the algorithm.

mutateRelationshipType

String

n/a

no

The relationship type used for the new relationships written to the in-memory graph.

mutateProperty

String

'probability'

yes

The relationship property in the GDS graph to which the result is written.

Table 12. Algorithm specific configuration
Name Type Default Optional Description

topN

Integer

n/a

no

Limit on predicted relationships to output.

threshold

Float

n/a

no

Minimum predicted probability on relationships to output.

Table 13. Results
Name Type Description

createMillis

Integer

Milliseconds for creating the graph.

computeMillis

Integer

Milliseconds for running the algorithm.

postProcessingMillis

Integer

Milliseconds for computing the global metrics.

mutateMillis

Integer

Milliseconds for adding properties to the in-memory graph.

relationshipsWritten

Integer

Number of relationships created.

configuration

Map

Configuration used for running the algorithm.

Run Link Prediction in write mode on a named graph:
CALL gds.alpha.ml.linkPrediction.predict.write(
  graphName: String,
  configuration: Map
)
YIELD
  createMillis: Integer,
  computeMillis: Integer,
  postProcessingMillis: Integer,
  writeMillis: Integer,
  relationshipsWritten: Integer,
  configuration: Map
Table 14. Parameters
Name Type Default Optional Description

graphName

String

n/a

no

The name of a graph stored in the catalog.

configuration

Map

{}

yes

Configuration for algorithm-specifics and/or graph filtering.

Table 15. General configuration for algorithm execution on a named graph.
Name Type Default Optional Description

modelName

String

n/a

no

The name of a Link Prediction model in the model catalog.

nodeLabels

String[]

['*']

yes

Filter the named graph using the given node labels.

relationshipTypes

String[]

['*']

yes

Filter the named graph using the given relationship types.

concurrency

Integer

4

yes

The number of concurrent threads used for running the algorithm.

writeRelationshipType

String

n/a

no

The relationship type used to persist the computed relationships in the Neo4j database.

writeProperty

String

n/a

no

The relationship property in the Neo4j database to which the result is written.

Table 16. Algorithm specific configuration
Name Type Default Optional Description

topN

Integer

n/a

no

Limit on predicted relationships to output.

threshold

Float

n/a

no

Minimum predicted probability on relationships to output.

Table 17. Results
Name Type Description

createMillis

Integer

Milliseconds for creating the graph.

computeMillis

Integer

Milliseconds for running the algorithm.

postProcessingMillis

Integer

Milliseconds for computing the global metrics.

writeMillis

Integer

Milliseconds for writing result data back to Neo4j.

relationshipsWritten

Integer

Number of relationships created.

configuration

Map

Configuration used for running the algorithm.

3. Examples

In this section we will show examples of running the Link Prediction algorithm on a concrete graph. The intention is to illustrate what the results look like and to provide a guide in how to make use of the algorithm in a real setting. We will do this on a small social network graph of a handful nodes connected in a particular pattern. The example graph looks like this:

Visualization of the example graph
The following Cypher statement will create the example graph in the Neo4j database:
CREATE
  (alice:Person {name: 'Alice', numberOfPosts: 38}),
  (michael:Person {name: 'Michael', numberOfPosts: 67}),
  (karin:Person {name: 'Karin', numberOfPosts: 30}),
  (chris:Person {name: 'Chris', numberOfPosts: 132}),
  (will:Person {name: 'Will', numberOfPosts: 6}),
  (mark:Person {name: 'Mark', numberOfPosts: 32}),
  (greg:Person {name: 'Greg', numberOfPosts: 29}),
  (veselin:Person {name: 'Veselin', numberOfPosts: 3}),

  (alice)-[:KNOWS]->(michael),
  (michael)-[:KNOWS]->(karin),
  (michael)-[:KNOWS]->(chris),
  (michael)-[:KNOWS]->(greg),
  (will)-[:KNOWS]->(michael),
  (will)-[:KNOWS]->(chris),
  (mark)-[:KNOWS]->(michael),
  (mark)-[:KNOWS]->(will),
  (greg)-[:KNOWS]->(chris),
  (veselin)-[:KNOWS]->(chris),
  (karin)-[:KNOWS]->(veselin),
  (chris)-[:KNOWS]->(karin);

With the graph in Neo4j we can now project it into the graph catalog to prepare it for algorithm execution. We do this using a native projection targeting the Person nodes and the KNOWS relationships. We will also project the numberOfPosts property, so we can use it as a model feature. For the relationships we must use the UNDIRECTED orientation. This is because the Link Prediction model is defined only for undirected graphs.

In the examples below we will use named graphs and native projections as the norm. However, anonymous graphs and/or Cypher projections can also be used.

The following statement will create a graph using a native projection and store it in the graph catalog under the name 'myGraph'.
CALL gds.graph.create(
  'myGraph',
  {
    Person: {
      properties: ['numberOfPosts']
    }
  },
  {
    KNOWS: {
      orientation: 'UNDIRECTED'
    }
  }
)
The Link Prediction model requires the graph to be created using the UNDIRECTED orientation for relationships.

In the following examples we will demonstrate using the Link Prediction model on this graph.

3.1. Train

First, we must do the test/train splits. For this we will make use of the gds.alpha.ml.splitRelationships procedure. We will do one split to generate the test graph. We note that in the example graph there are eight nodes and twelve directed relationships. Recall that we compute the class ratio as (q - r) / q, where we then have q = 8(8-1)/2 and r = 12 which gives us class ratio of (28 - 12) / 12 ~= 1.33. We use this to configure negativeSampleRatio to achieve a sampling proportional to the class ratio.

CALL gds.alpha.ml.splitRelationships.mutate('myGraph', {
  relationshipTypes: ['KNOWS'],
  remainingRelationshipType: 'KNOWS_REMAINING',
  holdoutRelationshipType: 'KNOWS_TESTGRAPH',
  holdoutFraction: 0.2,
  negativeSamplingRatio: 1.33,
  randomSeed: 1984
}) YIELD relationshipsWritten
Table 18. Results
relationshipsWritten

25

We will create copied relationships for each existing relationship, into either the KNOWS_REMAINING or the KNOWS_TESTGRAPH relationship types. All relationships in KNOWS_TESTGRAPH will have a label property. Additionally, a number of non-existing relationships will be created into the KNOWS_TESTGRAPH relationship type to be used as negative examples, with a label of 0.

Next, we will create the train graph.

CALL gds.alpha.ml.splitRelationships.mutate('myGraph', {
  relationshipTypes: ['KNOWS_REMAINING'],
  remainingRelationshipType: 'KNOWS_IGNORED_FOR_TRAINING',
  holdoutRelationshipType: 'KNOWS_TRAINGRAPH',
  holdoutFraction: 0.2,
  negativeSamplingRatio: 1.33,
  randomSeed: 1984
}) YIELD relationshipsWritten
Table 19. Results
relationshipsWritten

20

With both training and test graphs, we are ready to train models. We will use 5 validation folds, meaning we will split the train graph into 5 pairs, using one part of each pair for training and one for validation. Since we set the negativeSamplingRatio to 1.33 (the class ratio of the graph) above, we’ll set the negativeClassWeight during training to 1 / 1.33 to assign equal weight to both classes.

Train a Link Prediction model:
CALL gds.alpha.ml.linkPrediction.train('myGraph', {
  trainRelationshipType: 'KNOWS_TRAINGRAPH',
  testRelationshipType: 'KNOWS_TESTGRAPH',
  modelName: 'lp-numberOfPosts-model',
  featureProperties: ['numberOfPosts'],
  validationFolds: 5,
  negativeClassWeight: 1.0 / 1.33,
  randomSeed: 2,
  concurrency: 1,
  params: [
    {penalty: 0.5, maxEpochs: 1000},
    {penalty: 1.0, maxEpochs: 1000},
    {penalty: 0.0, maxEpochs: 1000}
  ]
}) YIELD modelInfo
RETURN
  { maxEpochs: modelInfo.bestParameters.maxEpochs, penalty: modelInfo.bestParameters.penalty } AS winningModel,
  modelInfo.metrics.AUCPR.outerTrain AS trainGraphScore,
  modelInfo.metrics.AUCPR.test AS testGraphScore
Table 20. Results
winningModel trainGraphScore testGraphScore

{maxEpochs=1000, penalty=0.5}

0.38525757517173825

0.46710171439292664

Here we can observe that the model candidate with penalty 0.5 performed the best in the training phase, with a score of about 71% over the train graph. On the test graph, the model scored much lower at about 35%. This indicates that the model reacted fairly well to the train graph, but did not generalise very well to unseen data. In order to achieve a higher test score, we may need to use better features, a larger graph, or different model configuration.

3.2. Stream

In the stream execution mode, the algorithm returns the top predicted relationships. This allows us to inspect the results directly or post-process them in Cypher without any side effects.

For more details on the stream mode in general, see Stream.

CALL gds.alpha.ml.linkPrediction.predict.stream('myGraph', {
  relationshipTypes: ['KNOWS'],
  modelName: 'lp-numberOfPosts-model',
  topN: 5,
  threshold: 0.45
}) YIELD node1, node2, probability
MATCH (n), (m)
WHERE id(n) = node1 AND id(m) = node2
RETURN
  n.name AS name1,
  m.name AS name2,
  probability
Table 21. Results
name1 name2 probability

"Karin"

"Greg"

0.4991363247445545

"Karin"

"Mark"

0.49896977670628373

"Mark"

"Greg"

0.49869219716877955

"Will"

"Veselin"

0.49869219716877955

"Alice"

"Mark"

0.49719328593255546

We specified threshold to filter out predictions with probability less than 45%, and topN to further limit output to the top 5 relationships. Note that the predicted link between the Karin and Greg nodes does not reflect any particular direction between them.

3.3. Mutate

In this example we will show how to use a trained model to predict new relationships in your in-memory graph. In order to do this, we must first have an already trained model registered in the Model Catalog. We will use the model which we trained in the train example which we gave the name 'lp-numberOfPosts-model'.

We must also make sure that we do not include any of the relationships from the train or test graphs, which we do by specifying a relationship filter for the original relationship type 'KNOWS'.

CALL gds.alpha.ml.linkPrediction.predict.mutate('myGraph', {
  relationshipTypes: ['KNOWS'],
  modelName: 'lp-numberOfPosts-model',
  mutateRelationshipType: 'KNOWS_PREDICTED',
  topN: 5,
  threshold: 0.45
}) YIELD relationshipsWritten
Table 22. Results
relationshipsWritten

10

We specified threshold to filter out predictions with probability less than 45%, and topN to further limit output to the top 5 relationships. Because we are using the UNDIRECTED orientation, we will write twice as many relationships to the in-memory graph.

3.4. Write

In this example we will show how to use a trained model to predict new relationships in your in-memory graph, and write the predictions back to Neo4j. We will again use the model 'lp-numberOfPosts-model', as in the mutate example.

CALL gds.alpha.ml.linkPrediction.predict.write('myGraph', {
  relationshipTypes: ['KNOWS'],
  modelName: 'lp-numberOfPosts-model',
  writeRelationshipType: 'KNOWS_PREDICTED',
  topN: 5,
  threshold: 0.45
}) YIELD relationshipsWritten
Table 23. Results
relationshipsWritten

10

The end result looks like this:

link prediction mutate

In yellow we highlight the predicted relationships.