Link Prediction
This section describes the Link Prediction Model in the Neo4j Graph Data Science library.
1. Introduction
Link prediction is a common machine learning task applied to graphs: training a model to learn, between pairs of nodes in a graph, where relationships should exist. You can think of this as building a model to predict missing relationships in your dataset or relationships that are likely to form in the future. Neo4j GDS trains supervised machine learning models based on the relationships and node properties in your graph to predict the existence  and probability  of relationships.
The basic work flow of Link Prediction contains the following parts which are described below:
1.1. Training, Model Selection and Evaluation
When building a model, it is possible to specify multiple model configurations and a model selection metric.
The train mode, gds.alpha.ml.linkPrediction.train
, is responsible for training and evaluating the models, selecting the best model, and storing it in the model catalog.
The train mode takes as input two relationship types representing the training graph and test graph respectively.
The relationship types must have an integer property, with values being either 0
or 1
.
If the value is 0
the relationship represents a negative example, meaning a node pair which is not connected in the original graph.
If the value is 1
the relationship represents a positive example, meaning a relationship which does exist in the original graph.
To obtain the feature vector for an example, the algorithm first forms node feature vectors for the source and target nodes of the example in question. This is done by concatenating the property values of the specified feature properties in order into a node feature vector. Thereafter, the algorithm uses a link feature combiner to combine the two node feature vectors into the feature vector for the training example. There are three supported link feature combiners:

L2

which gives a feature vector as
[(s_0  t_0)^2 , (s_1  t_1)^2, …, (s_d  t_d)^2]
, where[s_i]
and[t_i]
are the source and target node feature vectors.


HADAMARD

which gives a feature vector as
[s_0 * t_0, s_1 * t_1, …, s_d * t_d]
.


COSINE

which gives a single scalar feature, using the the Cosine similarity of source and target node feature vectors.

The precise steps of the train mode are:

The relationships of the training graph are divided into a number of folds, consisting of a training part and a validation part.

Each model candidate is trained on each train part and evaluated on the respective validation part. The training process uses a logistic regression algorithm, and the evaluation uses the AUCPR metric.

The model with the highest average score according to the metric will win the training.

The winning model will then be retrained on the whole training graph and evaluated on the training graph as well as on the test graph.

The winning model will be registered in the Model Catalog.
Trained models may then be used to predict the probability of a relationship between two nodes.
1.2. Applying a Link Prediction model
A previously trained model can be applied by invoking the gds.alpha.ml.linkPrediction.predict.mutate
mode.
This will retrieve the model by name from the model catalog.
The model will thereby be used to predict the probability of relationships between all node pairs in the graph that are not connected.
There are two mandatory configuration parameters, which limit the size of the output:

topN
retains the most probable predictions. 
threshold
retains predictions whose probability is above the threshold.
1.3. Train/Test Splitting
In order to train a Link Prediction model, one needs training and test graphs as described above.
The recommended way to obtain these is by using gds.alpha.ml.splitRelationships()
procedure, once to produce the test graph, and another time for the training graph.
By invoking this procedure the first time, one obtains two new relationship types that represent the test graph and a 'remaining' graph.
One can then, invoke the procedure again, on the just created 'remaining' graph, which then creates a training graph and an even smaller 'remaining' graph.
Below, is an example usage of how the splitRelationships
procedure can be used to prepare the required datasets for training.
The 'remaining' graph after the second split can optionally be used to create node embeddings without data leakage from test or validation sets.
Note that, after the first invocation, we cannot use the 'remaining' graph as the training graph, because it not guaranteed to have 0/1
value relationship labels nor negative link examples.
1.4. Metrics
The Link Prediction model in the Neo4j GDS library supports only the Area Under the PrecisionRecall Curve metric, abbreviated as AUCPR. However, most graphs have far more nonconnected node pairs than connected ones. Thus, typically we have an issue with class imbalance.
We will denote by classRatio
the ratio of nonconnected node pairs to connected node pairs.
Evaluating the performance of a model candidate naively using graphs created by the splitRelationships
procedure can give highly skewed results due to the equal sampling of positive and negative examples.
In order to compensate for the skewed class sampling, one can provide a value for the classRatio
parameter to the train
mode.
This will then weigh the models' false positives higher by a factor of classRatio
.
2. Syntax
This section covers the syntax used to execute the Link Prediction algorithm in each of its execution modes. We are describing the named graph variant of the syntax. To learn more about general syntax variants, see Syntax overview.
The named graphs must be projected in the UNDIRECTED orientation for the Link Prediction model.

CALL gds.alpha.ml.linkPrediction.train(
graphName: String,
configuration: Map
) YIELD
trainMillis: Integer,
modelInfo: Map,
configuration: Map
Name  Type  Default  Optional  Description 

graphName 
String 

no 
The name of a graph stored in the catalog. 
configuration 
Map 

yes 
Configuration for algorithmspecifics and/or graph filtering. 
Name  Type  Default  Optional  Description 

modelName 
String 

no 
The name of the model to train, must not exist in the Model Catalog. 
featureProperties 
List<String> 

yes 
The names of the node properties that should be used as input features. All property names must exist in the inmemory graph and be of type Float or List<Float>. 
nodeLabels 
String[] 

yes 
Filter the named graph using the given node labels. 
String[] 

yes 
Filter the named graph using the given relationship types. 

Integer 

yes 
The number of concurrent threads used for running the algorithm. 
Name  Type  Default  Optional  Description 

trainRelationshipType 
String 

no 
Relationship type to use during model training. 
testRelationshipType 
String 

no 
Relationship type to use during model evaluation. 
validationFolds 
Integer 

no 
Number of divisions of the training graph used during model selection. 
classRatio 
Float 

no 
Ratio of nonconnected node pairs to connected node pairs in the original graph. 
params 
List<Map> 

no 
List of model configurations to be trained and compared. See next table for details. 
randomSeed 
Integer 

yes 
Seed for the random number generator used during training. 
Name  Type  Default  Optional  Description 

batchSize 
Integer 

yes 
Number of nodes per batch. 
minIterations 
Integer 

yes 
Minimum number of training epochs. 
maxIterations 
Integer 

yes 
Maximum number of training epochs. 
maxStreakCount 
Integer 

yes 
Maximum number of unproductive consecutive epochs. 
tolerance 
Float 

yes 
Epochs that do not improve the loss by a factor of 1  
sharedUpdater 
Boolean 

yes 
Whether to use a shared ADAM optimizer across training threads. 
Name  Type  Description 

trainMillis 
Integer 
Milliseconds used for training. 
modelInfo 
Map 
Information about the training and the winning model. 
configuration 
Map 
Configuration used for the train procedure. 
The modelInfo
can also be retrieved at a later time by using the Model List Procedure.
The modelInfo
return field has the following algorithmspecific subfields:
Name  Type  Description 

bestParameters 
Map 
The model parameters which performed best on average on validation folds according to the primary metric. 
metrics 
Map 
Map from metric description to evaluated metrics for various models and subsets of the data, see below. 
The metrics
map contains an entry for each metric description (currently only AUCPR
) and the corresponding results for that metric.
Each such result map contains the keys train
, validation
, outerTrain
and test
.
The latter two correspond to numeric values for the evaluation of the best model on the test set and its complement — the outer train set.
The train
and validation
results contain lists of results for all candidate models (i.e. params
). Each such result is in turn also a map with keys params
, avg
, min
and max
.
These correspond to the hyperparameters of the model candidate, and statistics for the result of that model over crossvalidation folds.
The structure of modelInfo
is:
{
bestParameters: Map, // one of the maps specified in `params`
metrics: {
AUCPR: {
test: Float,
outerTrain: Float,
train: [{
avg: Float,
max: Float,
min: Float,
params: Map
},
{
avg: Float,
max: Float,
min: Float,
params: Map
},
... // more results per model candidate
],
validation: [{
avg: Float,
max: Float,
min: Float,
params: Map
},
{
avg: Float,
max: Float,
min: Float,
params: Map
},
... // more results per model candidate
]
}
}
}
CALL gds.alpha.ml.linkPrediction.predict.mutate(
graphName: String,
configuration: Map
)
YIELD
createMillis: Integer,
computeMillis: Integer,
postProcessingMillis: Integer,
mutateMillis: Integer,
relationshipsWritten: Integer,
configuration: Map
Name  Type  Default  Optional  Description 

graphName 
String 

no 
The name of a graph stored in the catalog. 
configuration 
Map 

yes 
Configuration for algorithmspecifics and/or graph filtering. 
Name  Type  Default  Optional  Description 

nodeLabels 
String[] 

yes 
Filter the named graph using the given node labels. 
String[] 

yes 
Filter the named graph using the given relationship types. 

Integer 

yes 
The number of concurrent threads used for running the algorithm. 

mutateRelationshipType 
String 

no 
The relationship type used for the new relationships written to the inmemory graph. 
mutateProperty 
String 

no 
The relationship property in the GDS graph to which the relationships is written. 
Name  Type  Default  Optional  Description 

topN 
Integer 

no 
Limit on predicted relationships to output. 
threshold 
Float 

no 
Minimum predicted probability on relationships to output. 
Name  Type  Description 

createMillis 
Integer 
Milliseconds for creating the graph. 
computeMillis 
Integer 
Milliseconds for running the algorithm. 
postProcessingMillis 
Integer 
Milliseconds for computing the global metrics. 
mutateMillis 
Integer 
Milliseconds for adding properties to the inmemory graph. 
relationshipsWritten 
Integer 
Number of relationships created. 
configuration 
Map 
Configuration used for running the algorithm. 
3. Examples
In this section we will show examples of running the Link Prediction algorithm on a concrete graph. The intention is to illustrate what the results look like and to provide a guide in how to make use of the algorithm in a real setting. We will do this on a small social network graph of a handful nodes connected in a particular pattern. The example graph looks like this:
CREATE
(alice:Person {name: 'Alice', numberOfPosts: 38}),
(michael:Person {name: 'Michael', numberOfPosts: 67}),
(karin:Person {name: 'Karin', numberOfPosts: 30}),
(chris:Person {name: 'Chris', numberOfPosts: 132}),
(will:Person {name: 'Will', numberOfPosts: 6}),
(mark:Person {name: 'Mark', numberOfPosts: 32}),
(greg:Person {name: 'Greg', numberOfPosts: 29}),
(veselin:Person {name: 'Veselin', numberOfPosts: 3}),
(alice)[:KNOWS]>(michael),
(michael)[:KNOWS]>(karin),
(michael)[:KNOWS]>(chris),
(michael)[:KNOWS]>(greg),
(will)[:KNOWS]>(michael),
(will)[:KNOWS]>(chris),
(mark)[:KNOWS]>(michael),
(mark)[:KNOWS]>(will),
(greg)[:KNOWS]>(chris),
(veselin)[:KNOWS]>(chris),
(karin)[:KNOWS]>(veselin),
(chris)[:KNOWS]>(karin);
With the graph in Neo4j we can now project it into the graph catalog to prepare it for algorithm execution.
We do this using a native projection targeting the Person
nodes and the KNOWS
relationships.
We will also project the numberOfPosts
property, so we can use it as a model feature.
For the relationships we must use the UNDIRECTED
orientation.
This is because the Link Prediction model is defined only for undirected graphs.
In the examples below we will use named graphs and native projections as the norm. However, anonymous graphs and/or Cypher projections can also be used. 
CALL gds.graph.create(
'myGraph',
{
Person: {
properties: ['numberOfPosts']
}
},
{
KNOWS: {
orientation: 'UNDIRECTED'
}
}
)
The Link Prediction model requires the graph to be created using the UNDIRECTED orientation for relationships.

In the following examples we will demonstrate using the Link Prediction model on this graph.
3.1. Train
First, we must do the test/train splits.
For this we will make use of the gds.alpha.ml.splitRelationships
procedure.
We will do one split to generate the test graph.
CALL gds.alpha.ml.splitRelationships.mutate('myGraph', {
relationshipTypes: ['KNOWS'],
remainingRelationshipType: 'KNOWS_REMAINING',
holdoutRelationshipType: 'KNOWS_TESTGRAPH',
holdoutFraction: 0.2
}) YIELD relationshipsWritten
relationshipsWritten 

24 
We will create copied relationships for each existing relationship, into either the KNOWS_REMAINING
or the KNOWS_TESTGRAPH
relationship types.
All relationships in KNOWS_TESTGRAPH
will have a label
property.
Additionally, a number of nonexisting relationships will be created into the KNOWS_TESTGRAPH
relationship type to be used as negative examples, with a label
of 0
.
Next, we will create the train graph.
CALL gds.alpha.ml.splitRelationships.mutate('myGraph', {
relationshipTypes: ['KNOWS_REMAINING'],
remainingRelationshipType: 'KNOWS_IGNORED_FOR_TRAINING',
holdoutRelationshipType: 'KNOWS_TRAINGRAPH',
holdoutFraction: 0.2
}) YIELD relationshipsWritten
relationshipsWritten 

20 
With both training and test graphs, we are ready to train models.
We will use 5 validation folds, meaning we will split the train graph into 5 pairs, using one part of each pair for training and one for validation.
We set the class ratio to 1.33, which is calculated as the number of nonexisting relationships over the number of existing relationships.
In our graph we have 8 nodes, which gives us a maximum of 8 * 7 / 2 = 28
undirected relationships.
We have 12 actual relationships, giving us a class ratio of (28  12) / 12 = 1.33
.
CALL gds.alpha.ml.linkPrediction.train('myGraph', {
trainRelationshipType: 'KNOWS_TRAINGRAPH',
testRelationshipType: 'KNOWS_TESTGRAPH',
modelName: 'lpnumberOfPostsmodel',
featureProperties: ['numberOfPosts'],
validationFolds: 5,
classRatio: 1.33,
randomSeed: 2,
params: [
{penalty: 0.5, maxIterations: 1000},
{penalty: 1.0, maxIterations: 1000},
{penalty: 0.0, maxIterations: 1000}
]
}) YIELD modelInfo
RETURN
modelInfo.bestParameters AS winningModel,
modelInfo.metrics.AUCPR.outerTrain AS trainGraphScore,
modelInfo.metrics.AUCPR.test AS testGraphScore
winningModel  trainGraphScore  testGraphScore 

{maxIterations: 1000, penalty: 0.5} 
0.7145922746781116 
0.3512042965360351 
Here we can observe that the model candidate with penalty 0.5 performed the best in the training phase, with a score of about 71% over the train graph. On the test graph, the model scored much lower at about 35%. This indicates that the model reacted fairly well to the train graph, but did not generalise very well to unseen data. In order to achieve a higher test score, we may need to use better features, a larger graph, or different model configuration.
3.2. Mutate
In this example we will show how to use a trained model to predict new relationships in your inmemory graph.
In order to do this, we must first have an already trained model registered in the Model Catalog.
We will use the model which we trained in the train example which we gave the name 'lpnumberOfPostsmodel'
.
We must also make sure that we do not include any of the relationships from the train or test graphs, which we do by specifying a relationship filter for the original relationship type 'KNOWS'
.
CALL gds.alpha.ml.linkPrediction.predict.mutate('myGraph', {
relationshipTypes: ['KNOWS'],
modelName: 'lpnumberOfPostsmodel',
mutateRelationshipType: 'KNOWS_PREDICTED',
topN: 5,
threshold: 0.45
}) YIELD relationshipsWritten
relationshipsWritten 

10 
We specified threshold
to filter out predictions with probability less than 45%, and topN
to further limit output to the top 5 relationships.
Because we are using the UNDIRECTED
orientation, we will write twice as many relationships to the inmemory graph.
In order to analyze our predicted relationships we will write them back to Neo4j:
CALL gds.graph.writeRelationship('myGraph', 'KNOWS_PREDICTED', 'probability')
YIELD relationshipsWritten, propertiesWritten
relationshipsWritten  propertiesWritten 

10 
10 
The end result looks like this:
In yellow we highlight the predicted relationships.
Was this page helpful?