Link Prediction

This section describes the Link Prediction Model in the Neo4j Graph Data Science library.

1. Introduction

Link prediction is a common machine learning task applied to graphs: training a model to learn, between pairs of nodes in a graph, where relationships should exist. You can think of this as building a model to predict missing relationships in your dataset or relationships that are likely to form in the future. Neo4j GDS trains supervised machine learning models based on the relationships and node properties in your graph to predict the existence - and probability - of relationships.

The basic work flow of Link Prediction contains the following parts which are described below:

1.1. Training, Model Selection and Evaluation

When building a model, it is possible to specify multiple model configurations and a model selection metric. The train mode, gds.alpha.ml.linkPrediction.train, is responsible for training and evaluating the models, selecting the best model, and storing it in the model catalog.

The train mode takes as input two relationship types representing the training graph and test graph respectively. The relationship types must have an integer property, with values being either 0 or 1. If the value is 0 the relationship represents a negative example, meaning a node pair which is not connected in the original graph. If the value is 1 the relationship represents a positive example, meaning a relationship which does exist in the original graph.

To obtain the feature vector for an example, the algorithm first forms node feature vectors for the source and target nodes of the example in question. This is done by concatenating the property values of the specified feature properties in order into a node feature vector. Thereafter, the algorithm uses a link feature combiner to combine the two node feature vectors into the feature vector for the training example. There are three supported link feature combiners:

  • L2

    • which gives a feature vector as [(s_0 - t_0)^2 , (s_1 - t_1)^2, …​, (s_d - t_d)^2], where [s_i] and [t_i] are the source and target node feature vectors.

  • HADAMARD

    • which gives a feature vector as [s_0 * t_0, s_1 * t_1, …​, s_d * t_d].

  • COSINE

    • which gives a single scalar feature, using the the Cosine similarity of source and target node feature vectors.

The precise steps of the train mode are:

  1. The relationships of the training graph are divided into a number of folds, consisting of a training part and a validation part.

  2. Each model candidate is trained on each train part and evaluated on the respective validation part. The training process uses a logistic regression algorithm, and the evaluation uses the AUCPR metric.

  3. The model with the highest average score according to the metric will win the training.

  4. The winning model will then be re-trained on the whole training graph and evaluated on the training graph as well as on the test graph.

  5. The winning model will be registered in the Model Catalog.

Trained models may then be used to predict the probability of a relationship between two nodes.

A previously trained model can be applied by invoking the gds.alpha.ml.linkPrediction.predict.mutate mode. This will retrieve the model by name from the model catalog. The model will thereby be used to predict the probability of relationships between all node pairs in the graph that are not connected. There are two mandatory configuration parameters, which limit the size of the output:

  • topN retains the most probable predictions.

  • threshold retains predictions whose probability is above the threshold.

1.3. Train/Test Splitting

In order to train a Link Prediction model, one needs training and test graphs as described above. The recommended way to obtain these is by using gds.alpha.ml.splitRelationships() procedure, once to produce the test graph, and another time for the training graph. By invoking this procedure the first time, one obtains two new relationship types that represent the test graph and a 'remaining' graph. One can then, invoke the procedure again, on the just created 'remaining' graph, which then creates a training graph and an even smaller 'remaining' graph. Below, is an example usage of how the splitRelationships procedure can be used to prepare the required datasets for training.

The 'remaining' graph after the second split can optionally be used to create node embeddings without data leakage from test or validation sets.

Note that, after the first invocation, we cannot use the 'remaining' graph as the training graph, because it not guaranteed to have 0/1 value relationship labels nor negative link examples.

1.4. Metrics

The Link Prediction model in the Neo4j GDS library supports only the Area Under the Precision-Recall Curve metric, abbreviated as AUCPR. However, most graphs have far more non-connected node pairs than connected ones. Thus, typically we have an issue with class imbalance.

We will denote by classRatio the ratio of non-connected node pairs to connected node pairs.

Evaluating the performance of a model candidate naively using graphs created by the splitRelationships procedure can give highly skewed results due to the equal sampling of positive and negative examples. In order to compensate for the skewed class sampling, one can provide a value for the classRatio parameter to the train mode. This will then weigh the models' false positives higher by a factor of classRatio.

2. Syntax

This section covers the syntax used to execute the Link Prediction algorithm in each of its execution modes. We are describing the named graph variant of the syntax. To learn more about general syntax variants, see Syntax overview.

The named graphs must be projected in the UNDIRECTED orientation for the Link Prediction model.
Example 1. Link Prediction syntax per mode
Run Link Prediction in train mode on a named graph:
CALL gds.alpha.ml.linkPrediction.train(
  graphName: String,
  configuration: Map
) YIELD
  trainMillis: Integer,
  modelInfo: Map,
  configuration: Map
Table 1. Parameters
Name Type Default Optional Description

graphName

String

n/a

no

The name of a graph stored in the catalog.

configuration

Map

{}

yes

Configuration for algorithm-specifics and/or graph filtering.

Table 2. General configuration for algorithm execution on a named graph.
Name Type Default Optional Description

modelName

String

n/a

no

The name of the model to train, must not exist in the Model Catalog.

featureProperties

List<String>

[]

yes

The names of the node properties that should be used as input features. All property names must exist in the in-memory graph and be of type Float or List<Float>.

nodeLabels

String[]

['*']

yes

Filter the named graph using the given node labels.

relationshipTypes

String[]

['*']

yes

Filter the named graph using the given relationship types.

concurrency

Integer

4

yes

The number of concurrent threads used for running the algorithm.

Table 3. Algorithm specific configuration
Name Type Default Optional Description

trainRelationshipType

String

n/a

no

Relationship type to use during model training.

testRelationshipType

String

n/a

no

Relationship type to use during model evaluation.

validationFolds

Integer

n/a

no

Number of divisions of the training graph used during model selection.

classRatio

Float

n/a

no

Ratio of non-connected node pairs to connected node pairs in the original graph.

params

List<Map>

n/a

no

List of model configurations to be trained and compared. See next table for details.

randomSeed

Integer

n/a

yes

Seed for the random number generator used during training.

Table 4. Model configuration
Name Type Default Optional Description

batchSize

Integer

100

yes

Number of nodes per batch.

minIterations

Integer

1

yes

Minimum number of training epochs.

maxIterations

Integer

100

yes

Maximum number of training epochs.

maxStreakCount

Integer

1

yes

Maximum number of unproductive consecutive epochs.

tolerance

Float

0.001

yes

Epochs that do not improve the loss by a factor of 1 - tolerance are considered unproductive.

sharedUpdater

Boolean

false

yes

Whether to use a shared ADAM optimizer across training threads.

Table 5. Results
Name Type Description

trainMillis

Integer

Milliseconds used for training.

modelInfo

Map

Information about the training and the winning model.

configuration

Map

Configuration used for the train procedure.

The modelInfo can also be retrieved at a later time by using the Model List Procedure. The modelInfo return field has the following algorithm-specific subfields:

Name Type Description

bestParameters

Map

The model parameters which performed best on average on validation folds according to the primary metric.

metrics

Map

Map from metric description to evaluated metrics for various models and subsets of the data, see below.

The metrics map contains an entry for each metric description (currently only AUCPR) and the corresponding results for that metric. Each such result map contains the keys train, validation, outerTrain and test. The latter two correspond to numeric values for the evaluation of the best model on the test set and its complement — the outer train set. The train and validation results contain lists of results for all candidate models (i.e. params). Each such result is in turn also a map with keys params, avg, min and max. These correspond to the hyper-parameters of the model candidate, and statistics for the result of that model over cross-validation folds. The structure of modelInfo is:

{
    bestParameters: Map,     // one of the maps specified in `params`
    metrics: {
        AUCPR: {
            test: Float,
            outerTrain: Float,
            train: [{
                avg: Float,
                max: Float,
                min: Float,
                params: Map
            },
            {
                avg: Float,
                max: Float,
                min: Float,
                params: Map
            },
            ...              // more results per model candidate
            ],
            validation: [{
                avg: Float,
                max: Float,
                min: Float,
                params: Map
            },
            {
                avg: Float,
                max: Float,
                min: Float,
                params: Map
            },
            ...              // more results per model candidate
            ]
        }
    }
}
Run Link Prediction in mutate mode on a named graph:
CALL gds.alpha.ml.linkPrediction.predict.mutate(
  graphName: String,
  configuration: Map
)
YIELD
  createMillis: Integer,
  computeMillis: Integer,
  postProcessingMillis: Integer,
  mutateMillis: Integer,
  relationshipsWritten: Integer,
  configuration: Map
Table 6. Parameters
Name Type Default Optional Description

graphName

String

n/a

no

The name of a graph stored in the catalog.

configuration

Map

{}

yes

Configuration for algorithm-specifics and/or graph filtering.

Table 7. General configuration for algorithm execution on a named graph.
Name Type Default Optional Description

nodeLabels

String[]

['*']

yes

Filter the named graph using the given node labels.

relationshipTypes

String[]

['*']

yes

Filter the named graph using the given relationship types.

concurrency

Integer

4

yes

The number of concurrent threads used for running the algorithm.

mutateRelationshipType

String

n/a

no

The relationship type used for the new relationships written to the in-memory graph.

mutateProperty

String

n/a

no

The relationship property in the GDS graph to which the relationships is written.

Table 8. Algorithm specific configuration
Name Type Default Optional Description

topN

Integer

n/a

no

Limit on predicted relationships to output.

threshold

Float

n/a

no

Minimum predicted probability on relationships to output.

Table 9. Results
Name Type Description

createMillis

Integer

Milliseconds for creating the graph.

computeMillis

Integer

Milliseconds for running the algorithm.

postProcessingMillis

Integer

Milliseconds for computing the global metrics.

mutateMillis

Integer

Milliseconds for adding properties to the in-memory graph.

relationshipsWritten

Integer

Number of relationships created.

configuration

Map

Configuration used for running the algorithm.

3. Examples

In this section we will show examples of running the Link Prediction algorithm on a concrete graph. The intention is to illustrate what the results look like and to provide a guide in how to make use of the algorithm in a real setting. We will do this on a small social network graph of a handful nodes connected in a particular pattern. The example graph looks like this:

link prediction
The following Cypher statement will create the example graph in the Neo4j database:
CREATE
  (alice:Person {name: 'Alice', numberOfPosts: 38}),
  (michael:Person {name: 'Michael', numberOfPosts: 67}),
  (karin:Person {name: 'Karin', numberOfPosts: 30}),
  (chris:Person {name: 'Chris', numberOfPosts: 132}),
  (will:Person {name: 'Will', numberOfPosts: 6}),
  (mark:Person {name: 'Mark', numberOfPosts: 32}),
  (greg:Person {name: 'Greg', numberOfPosts: 29}),
  (veselin:Person {name: 'Veselin', numberOfPosts: 3}),

  (alice)-[:KNOWS]->(michael),
  (michael)-[:KNOWS]->(karin),
  (michael)-[:KNOWS]->(chris),
  (michael)-[:KNOWS]->(greg),
  (will)-[:KNOWS]->(michael),
  (will)-[:KNOWS]->(chris),
  (mark)-[:KNOWS]->(michael),
  (mark)-[:KNOWS]->(will),
  (greg)-[:KNOWS]->(chris),
  (veselin)-[:KNOWS]->(chris),
  (karin)-[:KNOWS]->(veselin),
  (chris)-[:KNOWS]->(karin);

With the graph in Neo4j we can now project it into the graph catalog to prepare it for algorithm execution. We do this using a native projection targeting the Person nodes and the KNOWS relationships. We will also project the numberOfPosts property, so we can use it as a model feature. For the relationships we must use the UNDIRECTED orientation. This is because the Link Prediction model is defined only for undirected graphs.

In the examples below we will use named graphs and native projections as the norm. However, anonymous graphs and/or Cypher projections can also be used.

The following statement will create a graph using a native projection and store it in the graph catalog under the name 'myGraph'.
CALL gds.graph.create(
  'myGraph',
  {
    Person: {
      properties: ['numberOfPosts']
    }
  },
  {
    KNOWS: {
      orientation: 'UNDIRECTED'
    }
  }
)
The Link Prediction model requires the graph to be created using the UNDIRECTED orientation for relationships.

In the following examples we will demonstrate using the Link Prediction model on this graph.

3.1. Train

First, we must do the test/train splits. For this we will make use of the gds.alpha.ml.splitRelationships procedure. We will do one split to generate the test graph.

CALL gds.alpha.ml.splitRelationships.mutate('myGraph', {
  relationshipTypes: ['KNOWS'],
  remainingRelationshipType: 'KNOWS_REMAINING',
  holdoutRelationshipType: 'KNOWS_TESTGRAPH',
  holdoutFraction: 0.2
}) YIELD relationshipsWritten
Table 10. Results
relationshipsWritten

24

We will create copied relationships for each existing relationship, into either the KNOWS_REMAINING or the KNOWS_TESTGRAPH relationship types. All relationships in KNOWS_TESTGRAPH will have a label property. Additionally, a number of non-existing relationships will be created into the KNOWS_TESTGRAPH relationship type to be used as negative examples, with a label of 0.

Next, we will create the train graph.

CALL gds.alpha.ml.splitRelationships.mutate('myGraph', {
  relationshipTypes: ['KNOWS_REMAINING'],
  remainingRelationshipType: 'KNOWS_IGNORED_FOR_TRAINING',
  holdoutRelationshipType: 'KNOWS_TRAINGRAPH',
  holdoutFraction: 0.2
}) YIELD relationshipsWritten
Table 11. Results
relationshipsWritten

20

With both training and test graphs, we are ready to train models. We will use 5 validation folds, meaning we will split the train graph into 5 pairs, using one part of each pair for training and one for validation. We set the class ratio to 1.33, which is calculated as the number of non-existing relationships over the number of existing relationships. In our graph we have 8 nodes, which gives us a maximum of 8 * 7 / 2 = 28 undirected relationships. We have 12 actual relationships, giving us a class ratio of (28 - 12) / 12 = 1.33.

Train a Link Prediction model:
CALL gds.alpha.ml.linkPrediction.train('myGraph', {
  trainRelationshipType: 'KNOWS_TRAINGRAPH',
  testRelationshipType: 'KNOWS_TESTGRAPH',
  modelName: 'lp-numberOfPosts-model',
  featureProperties: ['numberOfPosts'],
  validationFolds: 5,
  classRatio: 1.33,
  randomSeed: 2,
  params: [
    {penalty: 0.5, maxIterations: 1000},
    {penalty: 1.0, maxIterations: 1000},
    {penalty: 0.0, maxIterations: 1000}
  ]
}) YIELD modelInfo
RETURN
  modelInfo.bestParameters AS winningModel,
  modelInfo.metrics.AUCPR.outerTrain AS trainGraphScore,
  modelInfo.metrics.AUCPR.test AS testGraphScore
Table 12. Results
winningModel trainGraphScore testGraphScore

{maxIterations: 1000, penalty: 0.5}

0.7145922746781116

0.3512042965360351

Here we can observe that the model candidate with penalty 0.5 performed the best in the training phase, with a score of about 71% over the train graph. On the test graph, the model scored much lower at about 35%. This indicates that the model reacted fairly well to the train graph, but did not generalise very well to unseen data. In order to achieve a higher test score, we may need to use better features, a larger graph, or different model configuration.

3.2. Mutate

In this example we will show how to use a trained model to predict new relationships in your in-memory graph. In order to do this, we must first have an already trained model registered in the Model Catalog. We will use the model which we trained in the train example which we gave the name 'lp-numberOfPosts-model'.

We must also make sure that we do not include any of the relationships from the train or test graphs, which we do by specifying a relationship filter for the original relationship type 'KNOWS'.

CALL gds.alpha.ml.linkPrediction.predict.mutate('myGraph', {
  relationshipTypes: ['KNOWS'],
  modelName: 'lp-numberOfPosts-model',
  mutateRelationshipType: 'KNOWS_PREDICTED',
  topN: 5,
  threshold: 0.45
}) YIELD relationshipsWritten
Table 13. Results
relationshipsWritten

10

We specified threshold to filter out predictions with probability less than 45%, and topN to further limit output to the top 5 relationships. Because we are using the UNDIRECTED orientation, we will write twice as many relationships to the in-memory graph.

In order to analyze our predicted relationships we will write them back to Neo4j:

CALL gds.graph.writeRelationship('myGraph', 'KNOWS_PREDICTED', 'probability')
YIELD relationshipsWritten, propertiesWritten
Table 14. Results
relationshipsWritten propertiesWritten

10

10

The end result looks like this:

link prediction mutate

In yellow we highlight the predicted relationships.