Training the pipeline
The train mode, gds.beta.pipeline.linkPrediction.train
, is responsible for splitting data, feature extraction, model selection, training and storing a model for future use.
Running this mode results in a prediction model of type LinkPrediction
being stored in the model catalog along with metrics collected during training.
The model can be applied to a possibly different graph which produces a relationship type of predicted links, each having a predicted probability stored as a property.
More precisely, the procedure will in order:
-
Apply
nodeLabels
andrelationshipType
filters to the graph. All subsequent graphs have the same node set. -
Create a relationship split of the graph into
test
,train
andfeature-input
sets as described in Configuring the relationship splits. These graphs are internally managed and exist only for the duration of the training. -
Apply the node property steps, added according to Adding node properties, on the
feature-input
graph. -
Apply the feature steps, added according to Adding link features, to the
train
graph, which yields for eachtrain
relationship an instance, that is, a feature vector and a binary label. -
Split the training instances using stratified k-fold cross-validation. The number of folds
k
can be configured usingvalidationFolds
ingds.beta.pipeline.linkPrediction.configureSplit
. -
Train each model candidate given by the parameter space for each of the folds and evaluate the model on the respective validation set. The evaluation uses the specified metric.
-
Declare as winner the model with the highest average metric across the folds.
-
Re-train the winning model on the whole training set and evaluate it on both the
train
andtest
sets. In order to evaluate on thetest
set, the feature pipeline is first applied again as for thetrain
set. -
Register the winning model in the Model Catalog.
The above steps describe what the procedure does logically. The actual steps as well as their ordering in the implementation may differ. |
A step can only use node properties that are already present in the input graph or produced by steps, which were added before. |
1. Syntax
CALL gds.beta.pipeline.linkPrediction.train(
graphName: String,
configuration: Map
) YIELD
trainMillis: Integer,
modelInfo: Map,
modelSelectionStats: Map,
configuration: Map
Name | Type | Default | Optional | Description |
---|---|---|---|---|
graphName |
String |
|
no |
The name of a graph stored in the catalog. |
configuration |
Map |
|
yes |
Configuration for algorithm-specifics and/or graph filtering. |
Name | Type | Default | Optional | Description |
---|---|---|---|---|
modelName |
String |
|
no |
The name of the model to train, must not exist in the Model Catalog. |
pipeline |
String |
|
no |
The name of the pipeline to execute. |
negativeClassWeight |
Float |
|
yes |
Weight of negative examples in model evaluation. Positive examples have weight 1. More details here. |
metrics |
List of String |
|
no |
Metrics used to evaluate the models. |
randomSeed |
Integer |
|
yes |
Seed for the random number generator used during training. |
List of String |
|
yes |
Filter the named graph using the given node labels. |
|
List of String |
|
yes |
Filter the named graph using the given relationship types. |
|
Integer |
|
yes |
The number of concurrent threads used for running the algorithm. |
|
String |
|
yes |
An ID that can be provided to more easily track the training’s progress. |
Name | Type | Description |
---|---|---|
trainMillis |
Integer |
Milliseconds used for training. |
modelInfo |
Map |
Information about the training and the winning model. |
modelSelectionStats |
Map |
Statistics about evaluated metrics for all model candidates. |
configuration |
Map |
Configuration used for the train procedure. |
The modelInfo
can also be retrieved at a later time by using the Model List Procedure.
The modelInfo
return field has the following algorithm-specific subfields:
Name | Type | Description |
---|---|---|
bestParameters |
Map |
The model parameters which performed best on average on validation folds according to the primary metric. |
modelCandidates |
List |
List of maps, where each map contains information about one model candidate. This information includes the candidates parameters, training statistics and validation statistics. |
bestTrial |
Integer |
The trial that produced the best model. The first trial has number 1. |
Name | Type | Description |
---|---|---|
modelName |
String |
The name of the trained model. |
modelType |
String |
The type of the trained model. |
bestParameters |
Map |
The model parameters which performed best on average on validation folds according to the primary metric. |
metrics |
Map |
Map from metric description to evaluated metrics for the winning model over the subsets of the data, see below. |
trainingPipeline |
Map |
The pipeline used for the training. |
The structure of modelInfo
is:
{ bestParameters: Map, (1) trainingPipeline: Map (2) metrics: { (3) AUCPR: { test: Float, (4) outerTrain: Float, (5) train: { (6) avg: Float, max: Float, min: Float, }, validation: { (7) avg: Float, max: Float, min: Float } } } }
1 | The best scoring model candidate configuration. |
2 | The pipeline used for the training. |
3 | The metrics map contains an entry for each metric description (currently only AUCPR ) and the corresponding results for that metric. |
4 | Numeric value for the evaluation of the best model on the test set. |
5 | Numeric value for the evaluation of the best model on the outer train set. |
6 | The train entry summarizes the metric results over the train set. |
7 | The validation entry summarizes the metric results over the validation set. |
In (4)-(6), if the metric is |
In addition to the data the procedure yields, there’s a fair amount of information about the training that’s being sent to the Neo4j database’s logs as the procedure progresses. For example, how well each model candidates perform is logged with Some information is only logged with |
2. Example
In this example we will create a small graph and use the training pipeline we have built up thus far. The graph consists of a handful nodes connected in a particular pattern. The example graph looks like this:
CREATE
(alice:Person {name: 'Alice', numberOfPosts: 38}),
(michael:Person {name: 'Michael', numberOfPosts: 67}),
(karin:Person {name: 'Karin', numberOfPosts: 30}),
(chris:Person {name: 'Chris', numberOfPosts: 132}),
(will:Person {name: 'Will', numberOfPosts: 6}),
(mark:Person {name: 'Mark', numberOfPosts: 32}),
(greg:Person {name: 'Greg', numberOfPosts: 29}),
(veselin:Person {name: 'Veselin', numberOfPosts: 3}),
(alice)-[:KNOWS]->(michael),
(michael)-[:KNOWS]->(karin),
(michael)-[:KNOWS]->(chris),
(michael)-[:KNOWS]->(greg),
(will)-[:KNOWS]->(michael),
(will)-[:KNOWS]->(chris),
(mark)-[:KNOWS]->(michael),
(mark)-[:KNOWS]->(will),
(greg)-[:KNOWS]->(chris),
(veselin)-[:KNOWS]->(chris),
(karin)-[:KNOWS]->(veselin),
(chris)-[:KNOWS]->(karin);
With the graph in Neo4j we can now project it into the graph catalog.
We do this using a native projection targeting the Person
nodes and the KNOWS
relationships.
We will also project the numberOfPosts
property, so it can be used when creating link features.
For the relationships we must use the UNDIRECTED
orientation.
This is because the Link Prediction pipelines are defined only for undirected graphs.
CALL gds.graph.project(
'myGraph',
{
Person: {
properties: ['numberOfPosts']
}
},
{
KNOWS: {
orientation: 'UNDIRECTED'
}
}
)
The Link Prediction model requires the graph to be created using the UNDIRECTED orientation for relationships.
|
2.1. Memory Estimation
First off, we will estimate the cost of training the pipeline by using the estimate
procedure.
Estimation is useful to understand the memory impact that training the pipeline on your graph will have.
When actually training the pipeline the system will perform an estimation and prohibit the execution if the estimation shows there is a very high probability of the execution running out of memory.
To read more about this, see Automatic estimation and execution blocking.
For more details on estimate
in general, see Memory Estimation.
CALL gds.beta.pipeline.linkPrediction.train.estimate('myGraph', {
pipeline: 'pipe',
modelName: 'lp-pipeline-model'
})
YIELD requiredMemory
requiredMemory |
---|
"[24 KiB ... 522 KiB]" |
2.2. Training
CALL gds.beta.pipeline.linkPrediction.train('myGraph', {
pipeline: 'pipe',
modelName: 'lp-pipeline-model',
metrics: ['AUCPR', 'OUT_OF_BAG_ERROR'],
randomSeed: 42
}) YIELD modelInfo, modelSelectionStats
RETURN
modelInfo.bestParameters AS winningModel,
modelInfo.metrics.AUCPR.train.avg AS avgTrainScore,
modelInfo.metrics.AUCPR.outerTrain AS outerTrainScore,
modelInfo.metrics.AUCPR.test AS testScore,
[cand IN modelSelectionStats.modelCandidates | cand.metrics.AUCPR.validation.avg] AS validationScores
winningModel | avgTrainScore | outerTrainScore | testScore | validationScores |
---|---|---|---|---|
{maxDepth=2147483647, minLeafSize=1, criterion=GINI, minSplitSize=2, numberOfDecisionTrees=10, methodName=RandomForest, numberOfSamplesRatio=1.0} |
0.779365079365079 |
0.788888888888889 |
0.766666666666667 |
[0.3333333333333333, 0.6388888888888888, 0.3333333333333333, 0.3333333333333333, 0.3333333333333333] |
We can see the RandomForest model configuration with numberOfDecisionTrees = 5
(and defaults filled for remaining parameters) was selected, and has a score of 0.58
on the test set.
The score computed as the AUCPR metric, which is in the range [0, 1].
A model which gives higher score to all links than non-links will have a score of 1.0, and a model that assigns random scores will on average have a score of 0.5.
Was this page helpful?