Link Prediction
This section describes the Link Prediction Model in the Neo4j Graph Data Science library.
1. Introduction
Link prediction is a common machine learning task applied to graphs: training a model to learn, between pairs of nodes in a graph, where relationships should exist. The predicted links are undirected. You can think of this as building a model to predict missing relationships in your dataset or relationships that are likely to form in the future. Neo4j GDS trains supervised machine learning models based on the relationships and node properties in your graph to predict the existence  and probability  of relationships.
Link Prediction can be used favorably together with preprocessing algorithms.
The basic work flow of Link Prediction contains the following parts which are described below:
1.1. Training, Model Selection and Evaluation
When building a model, it is possible to specify multiple model configurations and a model selection metric.
The train mode, gds.alpha.ml.linkPrediction.train
, is responsible for training and evaluating the models, selecting the best model, and storing it in the model catalog.
The train mode takes as input two relationship types representing the training graph and test graph respectively.
The relationship types must have an integer property, with values being either 0
or 1
.
If the value is 0
the relationship represents a negative example, meaning a node pair which is not connected in the original graph.
If the value is 1
the relationship represents a positive example, meaning a relationship which does exist in the original graph.
To obtain the feature vector for an example, the algorithm first forms node feature vectors for the source and target nodes of the example in question. This is done by concatenating the property values of the specified feature properties in order into a node feature vector. Thereafter, the algorithm uses a link feature combiner to combine the two node feature vectors into the feature vector for the training example. There are three supported link feature combiners:

L2

which gives a feature vector as
[(s_0  t_0)^2 , (s_1  t_1)^2, …, (s_d  t_d)^2]
, where[s_i]
and[t_i]
are the source and target node feature vectors.


HADAMARD

which gives a feature vector as
[s_0 * t_0, s_1 * t_1, …, s_d * t_d]
.


COSINE

which gives a single scalar feature, using the the Cosine similarity of source and target node feature vectors.

The precise steps of the train mode are:

The relationships of the training graph are divided into a number of folds, consisting of a training part and a validation part.

Each model candidate is trained on each train part and evaluated on the respective validation part. The training process uses a logistic regression algorithm, and the evaluation uses the AUCPR metric.

The model with the highest average score according to the metric will win the training.

The winning model will then be retrained on the whole training graph and evaluated on the training graph as well as on the test graph.

The winning model will be registered in the Model Catalog.
Trained models may then be used to predict the probability of a relationship between two nodes.
1.2. Applying a Link Prediction model
A previously trained model can be applied by invoking the gds.alpha.ml.linkPrediction.predict.mutate
mode.
This will retrieve the model by name from the model catalog.
The model will thereby be used to predict the probability of relationships between all node pairs in the graph that are not connected.
There are two mandatory configuration parameters, which limit the size of the output:

topN
retains the most probable predictions. 
threshold
retains predictions whose probability is above the threshold.
1.3. Train/Test Splitting
In order to train a Link Prediction model, one needs training and test graphs as described above.
The recommended way to obtain these is by using gds.alpha.ml.splitRelationships()
procedure, once to produce the test graph, and another time for the training graph.
By invoking this procedure the first time, one obtains two new relationship types that represent the test graph and a 'remaining' graph.
One can then, invoke the procedure again, on the just created 'remaining' graph, which then creates a training graph and an even smaller 'remaining' graph.
Below, is an example usage of how the splitRelationships
procedure can be used to prepare the required datasets for training.
The 'remaining' graph after the second split can optionally be used to create node embeddings without data leakage from test or validation sets.
Note that, after the first invocation, we cannot use the 'remaining' graph as the training graph, because it not guaranteed to have 0/1
value relationship labels nor negative link examples.
1.4. Metrics
The Link Prediction model in the Neo4j GDS library supports only the Area Under the PrecisionRecall Curve metric, abbreviated as AUCPR. In order to compute precision and recall we require a set of examples, each of which has a positive or negative target. For each example we have also a predicted target. Given the true and predicted targets, we can compute precision and recall (for reference, see f.e. Wikipedia).
Then, to compute the AUCPR, we construct the precisionrecall curve, as follows:

Each predicted target is associated with a prediction strength, for example the predicted probability of a positive target. We sort the examples in descending order of prediction strength.

For all prediction strengths that occur, we use that strength as a threshold and for all examples of that strength or higher predict that these examples have positive targets.

We now compute precision
p
and recallr
and consider the tuple(r, p)
as a point on a curve, the precisionrecall curve. 
Finally, the curve is linearly interpolated and the area is computed as a union of trapezoids with corners on the points.
The curve will have a shape that looks something like this:
Note here the blue area which shows one trapezoid under the curve.
The area under the PrecisionRecall curve can also be interpreted as an average precision where the average is over different classification thresholds.
1.4.1. Class imbalance
Most graphs have far more nonconnected node pairs than connected ones (e.g. sparse graphs).
Thus, typically we have an issue with class imbalance.
There are multiple strategies to account for imbalanced data.
In our procedure, the AUCPR metric is used which is considered more suitable than the commonly used AUROC (Area Under the Receiver Operating Characteristic) metric for imbalanced data.
For the metric to appropriately reflect both positive (connected node pairs) and negative (nonconnected pairs) examples, we provide the ability to both control the ratio of sampling between the classes, and to control the relative weight of classes via negativeClassWeight
.
The former is configured by the configuration parameter negativeSamplingRatio
in splitRelationships when using that procedure to generate the test set.
Tuning the negativeClassWeight
, which is explained below, means weighting up or down the false positives when computing precision.
The recommended value for negativeSamplingRatio
is the true class ratio of the graph, in other words, not applying undersampling.
However, the higher the value, the bigger the test set and thus the time to evaluate.
The ratio of total probability mass of negative versus positive examples in the test set is approximately negativeSamplingRatio * negativeClassWeight
.
Thus, both of these parameters can be adjusted in tandem to trade off evaluation accuracy with speed.
The true class ratio is computed as (q  r) / r
, where q = n(n1)/2
is the number of possible undirected relationships, and r
is the number of actual undirected relationships.
Please note that the relationshipCount
reported by the graph list procedure is the directed count of relationships summed over all existing relationship types.
Thus, we recommend using Cypher to obtain r
on the source Neo4j graph.
For example, this query will count the number of relationships of type T
or R
:
MATCH (a)[rel:T  R](b)
WHERE a < b
RETURN count(rel) AS r
When choosing a value for negativeClassWeight
, two factors should be considered.
First, the desired ratio of total probability mass of negative versus positive examples in the test set.
Second, what the ratio of sampled negative examples to positive examples was in the test set.
To be consistent with traditional evaluation, one should choose parameters so that negativeSamplingRatio * negativeClassWeight = 1.0
, for example by setting the values to the true class ratio and its reciprocal, or both values to 1.0
.
Alternatively, one can aim for the ratio of total probability weight between the classes to be close to the true class ratio.
That is, making sure negativeSamplingRatio * negativeClassWeight
is close to the true class ratio.
The reported metric (AUCPR) then better reflects the expected precision on unseen highly imbalanced data.
With this type of evaluation one has to adjust expectations as the metric value then becomes much smaller.
2. Syntax
This section covers the syntax used to execute the Link Prediction algorithm in each of its execution modes. We are describing the named graph variant of the syntax. To learn more about general syntax variants, see Syntax overview.
The named graphs must be projected in the UNDIRECTED orientation for the Link Prediction model.

CALL gds.alpha.ml.linkPrediction.train(
graphName: String,
configuration: Map
) YIELD
trainMillis: Integer,
modelInfo: Map,
configuration: Map
Name  Type  Default  Optional  Description 

graphName 
String 

no 
The name of a graph stored in the catalog. 
configuration 
Map 

yes 
Configuration for algorithmspecifics and/or graph filtering. 
Name  Type  Default  Optional  Description 

modelName 
String 

no 
The name of the model to train, must not exist in the Model Catalog. 
featureProperties 
List<String> 

yes 
The names of the node properties that should be used as input features. All property names must exist in the inmemory graph and be of type Float or List<Float>. 
String[] 

yes 
Filter the named graph using the given node labels. 

String[] 

yes 
Filter the named graph using the given relationship types. 

Integer 

yes 
The number of concurrent threads used for running the algorithm. 
Name  Type  Default  Optional  Description 

trainRelationshipType 
String 

no 
Relationship type to use during model training. 
testRelationshipType 
String 

no 
Relationship type to use during model evaluation. 
validationFolds 
Integer 

no 
Number of divisions of the training graph used during model selection. 
negativeClassWeight 
Float 

no 
Weight of negative examples in model evaluation. Positive examples have weight 1. 
params 
List<Map> 

no 
List of model configurations to be trained and compared. See next table for details. 
randomSeed 
Integer 

yes 
Seed for the random number generator used during training. 
Name  Type  Default  Optional  Description 

batchSize 
Integer 

yes 
Number of nodes per batch. 
minEpochs 
Integer 

yes 
Minimum number of training epochs. 
maxEpochs 
Integer 

yes 
Maximum number of training epochs. 
patience 
Integer 

yes 
Maximum number of unproductive consecutive epochs. 
tolerance 
Float 

yes 
Epochs that do not improve the loss by a factor of 1  
sharedUpdater 
Boolean 

yes 
Whether to use a shared ADAM optimizer across training threads. 
concurrency 
Integer 

yes 
Concurrency for training the model candidate. By default the value of the top level 
For hyperparameter tuning ideas, look here.
Name  Type  Description 

trainMillis 
Integer 
Milliseconds used for training. 
modelInfo 
Map 
Information about the training and the winning model. 
configuration 
Map 
Configuration used for the train procedure. 
The modelInfo
can also be retrieved at a later time by using the Model List Procedure.
The modelInfo
return field has the following algorithmspecific subfields:
Name  Type  Description 

bestParameters 
Map 
The model parameters which performed best on average on validation folds according to the primary metric. 
metrics 
Map 
Map from metric description to evaluated metrics for various models and subsets of the data, see below. 
The metrics
map contains an entry for each metric description (currently only AUCPR
) and the corresponding results for that metric.
Each such result map contains the keys train
, validation
, outerTrain
and test
.
The latter two correspond to numeric values for the evaluation of the best model on the test set and its complement — the outer train set.
The train
and validation
results contain lists of results for all candidate models (i.e. params
). Each such result is in turn also a map with keys params
, avg
, min
and max
.
These correspond to the hyperparameters of the model candidate, and statistics for the result of that model over crossvalidation folds.
The structure of modelInfo
is:
{
bestParameters: Map, // one of the maps specified in `params`
metrics: {
AUCPR: {
test: Float,
outerTrain: Float,
train: [{
avg: Float,
max: Float,
min: Float,
params: Map
},
{
avg: Float,
max: Float,
min: Float,
params: Map
},
... // more results per model candidate
],
validation: [{
avg: Float,
max: Float,
min: Float,
params: Map
},
{
avg: Float,
max: Float,
min: Float,
params: Map
},
... // more results per model candidate
]
}
}
}
CALL gds.alpha.ml.linkPrediction.predict.stream(
graphName: String,
configuration: Map
)
YIELD
node1: Integer,
node2: Integer,
probability: Float
Name  Type  Default  Optional  Description 

graphName 
String 

no 
The name of a graph stored in the catalog. 
configuration 
Map 

yes 
Configuration for algorithmspecifics and/or graph filtering. 
Name  Type  Default  Optional  Description 

modelName 
String 

no 
The name of a Link Prediction model in the model catalog. 
String[] 

yes 
Filter the named graph using the given node labels. 

String[] 

yes 
Filter the named graph using the given relationship types. 

Integer 

yes 
The number of concurrent threads used for running the algorithm. 
Name  Type  Default  Optional  Description 

topN 
Integer 

no 
Limit on predicted relationships to output. 
threshold 
Float 

no 
Minimum predicted probability on relationships to output. 
Name  Type  Description 

node1 
Integer 
Node ID of the first node. 
node2 
Integer 
Node ID of the second node. 
probability 
Float 
Predicted probability of a link between the nodes. 
CALL gds.alpha.ml.linkPrediction.predict.mutate(
graphName: String,
configuration: Map
)
YIELD
createMillis: Integer,
computeMillis: Integer,
postProcessingMillis: Integer,
mutateMillis: Integer,
relationshipsWritten: Integer,
configuration: Map
Name  Type  Default  Optional  Description 

graphName 
String 

no 
The name of a graph stored in the catalog. 
configuration 
Map 

yes 
Configuration for algorithmspecifics and/or graph filtering. 
Name  Type  Default  Optional  Description 

modelName 
String 

no 
The name of a Link Prediction model in the model catalog. 
String[] 

yes 
Filter the named graph using the given node labels. 

String[] 

yes 
Filter the named graph using the given relationship types. 

Integer 

yes 
The number of concurrent threads used for running the algorithm. 

mutateRelationshipType 
String 

no 
The relationship type used for the new relationships written to the inmemory graph. 
mutateProperty 
String 

yes 
The relationship property in the GDS graph to which the result is written. 
Name  Type  Default  Optional  Description 

topN 
Integer 

no 
Limit on predicted relationships to output. 
threshold 
Float 

no 
Minimum predicted probability on relationships to output. 
Name  Type  Description 

createMillis 
Integer 
Milliseconds for creating the graph. 
computeMillis 
Integer 
Milliseconds for running the algorithm. 
postProcessingMillis 
Integer 
Milliseconds for computing the global metrics. 
mutateMillis 
Integer 
Milliseconds for adding properties to the inmemory graph. 
relationshipsWritten 
Integer 
Number of relationships created. 
configuration 
Map 
Configuration used for running the algorithm. 
CALL gds.alpha.ml.linkPrediction.predict.write(
graphName: String,
configuration: Map
)
YIELD
createMillis: Integer,
computeMillis: Integer,
postProcessingMillis: Integer,
writeMillis: Integer,
relationshipsWritten: Integer,
configuration: Map
Name  Type  Default  Optional  Description 

graphName 
String 

no 
The name of a graph stored in the catalog. 
configuration 
Map 

yes 
Configuration for algorithmspecifics and/or graph filtering. 
Name  Type  Default  Optional  Description 

modelName 
String 

no 
The name of a Link Prediction model in the model catalog. 
String[] 

yes 
Filter the named graph using the given node labels. 

String[] 

yes 
Filter the named graph using the given relationship types. 

Integer 

yes 
The number of concurrent threads used for running the algorithm. 

writeRelationshipType 
String 

no 
The relationship type used to persist the computed relationships in the Neo4j database. 
String 

no 
The relationship property in the Neo4j database to which the result is written. 
Name  Type  Default  Optional  Description 

topN 
Integer 

no 
Limit on predicted relationships to output. 
threshold 
Float 

no 
Minimum predicted probability on relationships to output. 
Name  Type  Description 

createMillis 
Integer 
Milliseconds for creating the graph. 
computeMillis 
Integer 
Milliseconds for running the algorithm. 
postProcessingMillis 
Integer 
Milliseconds for computing the global metrics. 
writeMillis 
Integer 
Milliseconds for writing result data back to Neo4j. 
relationshipsWritten 
Integer 
Number of relationships created. 
configuration 
Map 
Configuration used for running the algorithm. 
3. Examples
In this section we will show examples of running the Link Prediction algorithm on a concrete graph. The intention is to illustrate what the results look like and to provide a guide in how to make use of the algorithm in a real setting. We will do this on a small social network graph of a handful nodes connected in a particular pattern. The example graph looks like this:
CREATE
(alice:Person {name: 'Alice', numberOfPosts: 38}),
(michael:Person {name: 'Michael', numberOfPosts: 67}),
(karin:Person {name: 'Karin', numberOfPosts: 30}),
(chris:Person {name: 'Chris', numberOfPosts: 132}),
(will:Person {name: 'Will', numberOfPosts: 6}),
(mark:Person {name: 'Mark', numberOfPosts: 32}),
(greg:Person {name: 'Greg', numberOfPosts: 29}),
(veselin:Person {name: 'Veselin', numberOfPosts: 3}),
(alice)[:KNOWS]>(michael),
(michael)[:KNOWS]>(karin),
(michael)[:KNOWS]>(chris),
(michael)[:KNOWS]>(greg),
(will)[:KNOWS]>(michael),
(will)[:KNOWS]>(chris),
(mark)[:KNOWS]>(michael),
(mark)[:KNOWS]>(will),
(greg)[:KNOWS]>(chris),
(veselin)[:KNOWS]>(chris),
(karin)[:KNOWS]>(veselin),
(chris)[:KNOWS]>(karin);
With the graph in Neo4j we can now project it into the graph catalog to prepare it for algorithm execution.
We do this using a native projection targeting the Person
nodes and the KNOWS
relationships.
We will also project the numberOfPosts
property, so we can use it as a model feature.
For the relationships we must use the UNDIRECTED
orientation.
This is because the Link Prediction model is defined only for undirected graphs.
In the examples below we will use named graphs and native projections as the norm. However, anonymous graphs and/or Cypher projections can also be used. 
CALL gds.graph.create(
'myGraph',
{
Person: {
properties: ['numberOfPosts']
}
},
{
KNOWS: {
orientation: 'UNDIRECTED'
}
}
)
The Link Prediction model requires the graph to be created using the UNDIRECTED orientation for relationships.

In the following examples we will demonstrate using the Link Prediction model on this graph.
3.1. Train
First, we must do the test/train splits.
For this we will make use of the gds.alpha.ml.splitRelationships
procedure.
We will do one split to generate the test graph.
We note that in the example graph there are eight nodes and twelve directed relationships.
Recall that we compute the class ratio as (q  r) / q
, where we then have q = 8(81)/2
and r = 12
which gives us class ratio of (28  12) / 12 ~= 1.33
.
We use this to configure negativeSampleRatio
to achieve a sampling proportional to the class ratio.
CALL gds.alpha.ml.splitRelationships.mutate('myGraph', {
relationshipTypes: ['KNOWS'],
remainingRelationshipType: 'KNOWS_REMAINING',
holdoutRelationshipType: 'KNOWS_TESTGRAPH',
holdoutFraction: 0.2,
negativeSamplingRatio: 1.33,
randomSeed: 1984
}) YIELD relationshipsWritten
relationshipsWritten 

25 
We will create copied relationships for each existing relationship, into either the KNOWS_REMAINING
or the KNOWS_TESTGRAPH
relationship types.
All relationships in KNOWS_TESTGRAPH
will have a label
property.
Additionally, a number of nonexisting relationships will be created into the KNOWS_TESTGRAPH
relationship type to be used as negative examples, with a label
of 0
.
Next, we will create the train graph.
CALL gds.alpha.ml.splitRelationships.mutate('myGraph', {
relationshipTypes: ['KNOWS_REMAINING'],
remainingRelationshipType: 'KNOWS_IGNORED_FOR_TRAINING',
holdoutRelationshipType: 'KNOWS_TRAINGRAPH',
holdoutFraction: 0.2,
negativeSamplingRatio: 1.33,
randomSeed: 1984
}) YIELD relationshipsWritten
relationshipsWritten 

20 
With both training and test graphs, we are ready to train models.
We will use 5 validation folds, meaning we will split the train graph into 5 pairs, using one part of each pair for training and one for validation.
Since we set the negativeSamplingRatio
to 1.33 (the class ratio of the graph) above, we’ll set the negativeClassWeight
during training to 1 / 1.33
to assign equal weight to both classes.
CALL gds.alpha.ml.linkPrediction.train('myGraph', {
trainRelationshipType: 'KNOWS_TRAINGRAPH',
testRelationshipType: 'KNOWS_TESTGRAPH',
modelName: 'lpnumberOfPostsmodel',
featureProperties: ['numberOfPosts'],
validationFolds: 5,
negativeClassWeight: 1.0 / 1.33,
randomSeed: 2,
concurrency: 1,
params: [
{penalty: 0.5, maxEpochs: 1000},
{penalty: 1.0, maxEpochs: 1000},
{penalty: 0.0, maxEpochs: 1000}
]
}) YIELD modelInfo
RETURN
{ maxEpochs: modelInfo.bestParameters.maxEpochs, penalty: modelInfo.bestParameters.penalty } AS winningModel,
modelInfo.metrics.AUCPR.outerTrain AS trainGraphScore,
modelInfo.metrics.AUCPR.test AS testGraphScore
winningModel  trainGraphScore  testGraphScore 

{maxEpochs=1000, penalty=0.5} 
0.38525757517173825 
0.46710171439292664 
Here we can observe that the model candidate with penalty 0.5 performed the best in the training phase, with a score of about 71% over the train graph. On the test graph, the model scored much lower at about 35%. This indicates that the model reacted fairly well to the train graph, but did not generalise very well to unseen data. In order to achieve a higher test score, we may need to use better features, a larger graph, or different model configuration.
3.2. Stream
In the stream
execution mode, the algorithm returns the top predicted relationships.
This allows us to inspect the results directly or postprocess them in Cypher without any side effects.
For more details on the stream
mode in general, see Stream.
CALL gds.alpha.ml.linkPrediction.predict.stream('myGraph', {
relationshipTypes: ['KNOWS'],
modelName: 'lpnumberOfPostsmodel',
topN: 5,
threshold: 0.45
}) YIELD node1, node2, probability
MATCH (n), (m)
WHERE id(n) = node1 AND id(m) = node2
RETURN
n.name AS name1,
m.name AS name2,
probability
name1  name2  probability 

"Karin" 
"Greg" 
0.4991363247445545 
"Karin" 
"Mark" 
0.49896977670628373 
"Mark" 
"Greg" 
0.49869219716877955 
"Will" 
"Veselin" 
0.49869219716877955 
"Alice" 
"Mark" 
0.49719328593255546 
We specified threshold
to filter out predictions with probability less than 45%, and topN
to further limit output to the top 5 relationships.
Note that the predicted link between the Karin and Greg nodes does not reflect any particular direction between them.
3.3. Mutate
In this example we will show how to use a trained model to predict new relationships in your inmemory graph.
In order to do this, we must first have an already trained model registered in the Model Catalog.
We will use the model which we trained in the train example which we gave the name 'lpnumberOfPostsmodel'
.
We must also make sure that we do not include any of the relationships from the train or test graphs, which we do by specifying a relationship filter for the original relationship type 'KNOWS'
.
CALL gds.alpha.ml.linkPrediction.predict.mutate('myGraph', {
relationshipTypes: ['KNOWS'],
modelName: 'lpnumberOfPostsmodel',
mutateRelationshipType: 'KNOWS_PREDICTED',
topN: 5,
threshold: 0.45
}) YIELD relationshipsWritten
relationshipsWritten 

10 
We specified threshold
to filter out predictions with probability less than 45%, and topN
to further limit output to the top 5 relationships.
Because we are using the UNDIRECTED
orientation, we will write twice as many relationships to the inmemory graph.
3.4. Write
In this example we will show how to use a trained model to predict new relationships in your inmemory graph, and write the predictions back to Neo4j.
We will again use the model 'lpnumberOfPostsmodel'
, as in the mutate example.
CALL gds.alpha.ml.linkPrediction.predict.write('myGraph', {
relationshipTypes: ['KNOWS'],
modelName: 'lpnumberOfPostsmodel',
writeRelationshipType: 'KNOWS_PREDICTED',
topN: 5,
threshold: 0.45
}) YIELD relationshipsWritten
relationshipsWritten 

10 
The end result looks like this:
In yellow we highlight the predicted relationships.
Was this page helpful?