Training the pipeline

The train mode, gds.beta.pipeline.linkPrediction.train, is responsible for splitting data, feature extraction, model selection, training and storing a model for future use. Running this mode results in a prediction model of type LinkPrediction being stored in the model catalog along with metrics collected during training. The model can be applied to a possibly different graph which produces a relationship type of predicted links, each having a predicted probability stored as a property.

Visualization of Link Prediction pipeline data flow

More precisely, the procedure will in order:

Apply node filtering using sourceNodeLabel and targetNodeLabel, and relationship filtering using targetRelationshipType. The resulting graph is used as input to splitting.
Create a relationship split of the graph into test, train and feature-input graphs as described in Configuring the relationship splits. These graphs are internally managed and exist only for the duration of the training.
Apply the node property steps, added according to Adding node properties. The graph filter on each step consists of contextNodeLabels + targetNodeLabel + sourceNodeLabel and contextRelationships + feature-input relationships.
Apply the feature steps, added according to Adding link features, to the train graph, which yields for each train relationship an instance, that is, a feature vector and a binary label.
Split the training instances using stratified k-fold cross-validation. The number of folds k can be configured using validationFolds in gds.beta.pipeline.linkPrediction.configureSplit.
Train each model candidate given by the parameter space for each of the folds and evaluate the model on the respective validation set. The evaluation uses the specified metric.
Declare as winner the model with the highest average metric across the folds.
Re-train the winning model on the whole training set and evaluate it on both the train and test sets. In order to evaluate on the test set, the feature pipeline is first applied again as for the train set.
Register the winning model in the Model Catalog.

The above steps describe what the procedure does logically. The actual steps as well as their ordering in the implementation may differ.

A step can only use node properties that are already present in the input graph or produced by steps, which were added before.

Parallel executions of the same pipeline on the same graph is not supported.

Syntax

Run Link Prediction in train mode on a named graph:

CALL gds.beta.pipeline.linkPrediction.train(
  graphName: String,
  configuration: Map
) YIELD
  trainMillis: Integer,
  modelInfo: Map,
  modelSelectionStats: Map,
  configuration: Map

Table 1. Parameters
Name	Type	Default	Optional	Description
graphName	String	`n/a`	no	The name of a graph stored in the catalog.
configuration	Map	`{}`	yes	Configuration for algorithm-specifics and/or graph filtering.

Table 2. Configuration
Name	Type	Default	Optional	Description
modelName	String	`n/a`	no	The name of the model to train, must not exist in the Model Catalog.
pipeline	String	`n/a`	no	The name of the pipeline to execute.
targetRelationshipType	String	`n/a`	no	The name of the relationship type to train the model on. The relationship type must be undirected.
sourceNodeLabel	String	`'*'`	yes	The name of the node label relationships in the training and test sets should start from ^[1].
targetNodeLabel	String	`'*'`	yes	The name of the node label relationships in the training and test sets should end at ^[1].
negativeClassWeight	Float	`1.0`	yes	Weight of negative examples in model evaluation. Positive examples have weight 1. More details here.
metrics	List of String	`[AUCPR]`	no	Metrics used to evaluate the models.
randomSeed	Integer	`n/a`	yes	Seed for the random number generator used during training.
concurrency	Integer	`4 ^[2]`	yes	The number of concurrent threads used for running the algorithm.
jobId	String	`Generated internally`	yes	An ID that can be provided to more easily track the algorithm’s progress.
logProgress	Boolean	`true`	yes	If disabled the progress percentage will not be logged.
storeModelToDisk	Boolean	`false`	yes	Automatically store model to disk after training.
1. This helps to train the model to predict links with a certain label combination. 2. In a GDS Session the default is the number of available processors

Table 3. Results
Name	Type	Description
trainMillis	Integer	Milliseconds used for training.
modelInfo	Map	Information about the training and the winning model.
modelSelectionStats	Map	Statistics about evaluated metrics for all model candidates.
configuration	Map	Configuration used for the train procedure.

The modelInfo can also be retrieved at a later time by using the Model List Procedure. The modelInfo return field has the following algorithm-specific subfields:

Table 4. Fields of modelSelectionStats
Name	Type	Description
bestParameters	Map	The model parameters which performed best on average on validation folds according to the primary metric.
modelCandidates	List	List of maps, where each map contains information about one model candidate. This information includes the candidates parameters, training statistics and validation statistics.
bestTrial	Integer	The trial that produced the best model. The first trial has number 1.

Table 5. Fields of modelInfo
Name	Type	Description
modelName	String	The name of the trained model.
modelType	String	The type of the trained model.
bestParameters	Map	The model parameters which performed best on average on validation folds according to the primary metric.
metrics	Map	Map from metric description to evaluated metrics for the winning model over the subsets of the data, see below.
nodePropertySteps	List of Map	Algorithms that produce node properties within the pipeline.
linkFeatures	List of Map	Feature steps that combine node properties from endpoint nodes to produce features for relationships (links) as input to the pipeline model.

The structure of modelInfo is:

{
    bestParameters: Map,              (1)
    nodePropertySteps: List of Map,
    linkFeatures: List of Map,
    metrics: {                        (2)
        AUCPR: {
            test: Float,              (3)
            outerTrain: Float,        (4)
            train: {                  (5)
                avg: Float,
                max: Float,
                min: Float,
            },
            validation: {             (6)
                avg: Float,
                max: Float,
                min: Float
            }
        }
    }
}

1	The best scoring model candidate configuration.
2	The `metrics` map contains an entry for each metric description (currently only `AUCPR`) and the corresponding results for that metric.
3	Numeric value for the evaluation of the best model on the test set.
4	Numeric value for the evaluation of the best model on the outer train set.
5	The `train` entry summarizes the metric results over the `train` set.
6	The `validation` entry summarizes the metric results over the `validation` set.

In (3)-(5), if the metric is OUT_OF_BAG_ERROR, these statistics are not reported. The OUT_OF_BAG_ERROR is only reported in (6) as validation metric and only if the model is RandomForest.

In addition to the data the procedure yields, there’s a fair amount of information about the training that’s being sent to the Neo4j database’s logs as the procedure progresses.

For example, how well each model candidates perform is logged with info log level and thus end up the neo4j.log file of the database.

Some information is only logged with debug log level, and thus end up in the debug.log file of the database. An example of this is training method specific metadata - such as per epoch loss for logistic regression - during model candidate training (in the model selection phase). Please note that this particular data is not yielded by the procedure call.

Example

In this example we will create a small graph and use the training pipeline we have built up thus far. The graph is a small social network of people and cities, including some information about where people live, were born, and what other people they know. We will attempt to train a model to predict which additional people might know each other. The example graph looks like this:

The following Cypher statement will create the example graph in the Neo4j database:

CREATE
  (alice:Person {name: 'Alice', age: 38}),
  (michael:Person {name: 'Michael', age: 67}),
  (karin:Person {name: 'Karin', age: 30}),
  (chris:Person {name: 'Chris', age: 52}),
  (will:Person {name: 'Will', age: 6}),
  (mark:Person {name: 'Mark', age: 32}),
  (greg:Person {name: 'Greg', age: 29}),
  (veselin:Person {name: 'Veselin', age: 3}),

  (london:City {name: 'London'}),
  (malmo:City {name: 'Malmo'}),

  (alice)-[:KNOWS]->(michael),
  (michael)-[:KNOWS]->(karin),
  (michael)-[:KNOWS]->(chris),
  (michael)-[:KNOWS]->(greg),
  (will)-[:KNOWS]->(michael),
  (will)-[:KNOWS]->(chris),
  (mark)-[:KNOWS]->(michael),
  (mark)-[:KNOWS]->(will),
  (greg)-[:KNOWS]->(chris),
  (veselin)-[:KNOWS]->(chris),
  (karin)-[:KNOWS]->(veselin),
  (chris)-[:KNOWS]->(karin),

  (alice)-[:LIVES]->(london),
  (michael)-[:LIVES]->(london),
  (karin)-[:LIVES]->(london),
  (chris)-[:LIVES]->(malmo),
  (will)-[:LIVES]->(malmo),

  (alice)-[:BORN]->(london),
  (michael)-[:BORN]->(london),
  (karin)-[:BORN]->(malmo),
  (chris)-[:BORN]->(london),
  (will)-[:BORN]->(malmo),
  (greg)-[:BORN]->(london),
  (veselin)-[:BORN]->(malmo)

With the graph in Neo4j we can now project it into the graph catalog. We do this using a Cypher projection targeting the Person nodes and the KNOWS relationships. We will also project the age property, so it can be used when creating link features. For the relationships we must use the UNDIRECTED orientation. This is because the Link Prediction pipelines are defined only for undirected graphs. We ignore the additional nodes and relationship types, in order for our projection to be homogeneous. We will illustrate how to make use of the larger graph in a subsequent example.

The following statement will project a graph using a Cypher projection and store it in the graph catalog under the name 'myGraph'.

MATCH (source:Person)-[r:KNOWS]->(target:Person)
RETURN gds.graph.project(
  'myGraph',
  source,
  target,
  {
    sourceNodeProperties: source { .age },
    targetNodeProperties: target { .age },
    relationshipType: 'KNOWS'
  },
  { undirectedRelationshipTypes: ['KNOWS'] }
)

The Link Prediction model requires the graph to be created using the UNDIRECTED orientation for relationships.

Memory Estimation

First off, we will estimate the cost of training the pipeline by using the estimate procedure. Estimation is useful to understand the memory impact that training the pipeline on your graph will have. When actually training the pipeline the system will perform an estimation and prohibit the execution if the estimation shows there is a very high probability of the execution running out of memory. To read more about this, see Automatic estimation and execution blocking.

For more details on estimate in general, see Memory Estimation.

The following will estimate the memory requirements for training the pipeline:

CALL gds.beta.pipeline.linkPrediction.train.estimate('myGraph', {
  pipeline: 'pipe',
  modelName: 'lp-pipeline-model',
  targetRelationshipType: 'KNOWS'
})
YIELD requiredMemory

Table 6. Results
requiredMemory
"[24 KiB ... 522 KiB]"

Training

Now we are ready to actually train a LinkPrediction model. We must make sure to specify the targetRelationshipType to instruct the model to train only using that type. With the graph myGraph there are actually no other relationship types projected, but that is not always the case.

The following will train a model using a pipeline:

CALL gds.beta.pipeline.linkPrediction.train('myGraph', {
  pipeline: 'pipe',
  modelName: 'lp-pipeline-model',
  metrics: ['AUCPR', 'OUT_OF_BAG_ERROR'],
  targetRelationshipType: 'KNOWS',
  randomSeed: 18
}) YIELD modelInfo, modelSelectionStats
RETURN
  modelInfo.bestParameters AS winningModel,
  modelInfo.metrics.AUCPR.train.avg AS avgTrainScore,
  modelInfo.metrics.AUCPR.outerTrain AS outerTrainScore,
  modelInfo.metrics.AUCPR.test AS testScore,
  [cand IN modelSelectionStats.modelCandidates | cand.metrics.AUCPR.validation.avg] AS validationScores

Table 7. Results
winningModel	avgTrainScore	outerTrainScore	testScore	validationScores
{batchSize=100, classWeights=[0.55, 0.45], focusWeight=0.070341817, hiddenLayerSizes=[4, 2], learningRate=0.001, maxEpochs=100, methodName="MultilayerPerceptron", minEpochs=1, patience=2, penalty=0.5, tolerance=0.001}	0.7579365079	0.7	0.6666666667	[0.4305555556, 0.5833333333, 0.4305555556, 0.75]

We can see the MLP model configuration won, and has a score of 0.67 on the test set. The score computed as the AUCPR metric, which is in the range [0, 1]. A model which gives higher score to all links than non-links will have a score of 1.0, and a model that assigns random scores will on average have a score of 0.5.

Training with context filters

In the above example we projected a Person-KNOWS-Person subgraph and used it for training and testing. Much information in the original graph is not used. We might want to utilize more node and relationship types to generate node properties (and link features) and investigate whether it improves link prediction. We can do that by passing in contextNodeLabels and contextRelationshipTypes. We explicitly pass in sourceNodeLabel and targetNodeLabel to specify a narrower set of nodes to be used for training and testing.

The following statement will project the full graph using a Cypher projection and store it in the graph catalog under the name 'fullGraph'.

MATCH (source:Person)-[r:KNOWS|LIVES|BORN]->(target:Person|City)
RETURN gds.graph.project(
  'fullGraph',
  source,
  target,
  {
    sourceNodeLabels: labels(source),
    targetNodeLabels: labels(target),
    sourceNodeProperties: source { age: coalesce(source.age, 1) },
    targetNodeProperties: target { age: coalesce(target.age, 1) },
    relationshipType: type(r)
  },
  { undirectedRelationshipTypes: ['KNOWS'] }
)

The full graph contains 2 node labels and 3 relationship types. We still train a Person-KNOWS-Person model, but use context information Person-LIVES-City, Person-BORN-City to generate node properties that the model uses in training. Note that we do not require the UNDIRECTED orientation for the context relationship types, as these are excluded from the LinkPrediction training.

First we’ll create a new pipeline.

CALL gds.beta.pipeline.linkPrediction.create('pipe-with-context')

Next we add the nodePropertyStep with context configurations.

CALL gds.beta.pipeline.linkPrediction.addNodeProperty('pipe-with-context', 'fastRP', {
  mutateProperty: 'embedding',
  embeddingDimension: 256,
  randomSeed: 42,
  contextNodeLabels: ['City'],
  contextRelationshipTypes: ['LIVES', 'BORN']
})

Then we add the link feature.

CALL gds.beta.pipeline.linkPrediction.addFeature('pipe-with-context', 'hadamard', {
  nodeProperties: ['embedding', 'age']
})

And then similarly configure the data splits.

CALL gds.beta.pipeline.linkPrediction.configureSplit('pipe-with-context', {
  testFraction: 0.25,
  trainFraction: 0.6,
  validationFolds: 3
})

Then we add an MLP model candidate.

CALL gds.alpha.pipeline.linkPrediction.addMLP('pipe-with-context',
{hiddenLayerSizes: [4, 2], penalty: 1, patience: 2})

The following will train another model using the pipeline with additional context information used in node property step:

CALL gds.beta.pipeline.linkPrediction.train('fullGraph', {
  pipeline: 'pipe-with-context',
  modelName: 'lp-pipeline-model-filtered',
  metrics: ['AUCPR', 'OUT_OF_BAG_ERROR'],
  sourceNodeLabel: 'Person',
  targetNodeLabel: 'Person',
  targetRelationshipType: 'KNOWS',
  randomSeed: 12
}) YIELD modelInfo, modelSelectionStats
RETURN
  modelInfo.bestParameters AS winningModel,
  modelInfo.metrics.AUCPR.train.avg AS avgTrainScore,
  modelInfo.metrics.AUCPR.outerTrain AS outerTrainScore,
  modelInfo.metrics.AUCPR.test AS testScore,
  [cand IN modelSelectionStats.modelCandidates | cand.metrics.AUCPR.validation.avg] AS validationScores

Table 8. Results
winningModel	avgTrainScore	outerTrainScore	testScore	validationScores
{batchSize=100, classWeights=[], focusWeight=0.0, hiddenLayerSizes=[4, 2], learningRate=0.001, maxEpochs=100, methodName="MultilayerPerceptron", minEpochs=1, patience=2, penalty=1.0, tolerance=0.001}	0.832010582	0.6666666667	0.8611111111	[0.75]

As we can see, the results are effectively identical. While the train and test score stays the same in this toy example, it is likely that the contextual information will have a greater impact for larger datasets.