Training the pipeline

The train mode, gds.beta.pipeline.linkPrediction.train, is responsible for splitting data, feature extraction, model selection, training and storing a model for future use. Running this mode results in a prediction model of type LinkPrediction being stored in the model catalog along with metrics collected during training. The model can be applied to a possibly different graph which produces a relationship type of predicted links, each having a predicted probability stored as a property.

Visualization of Link Prediction pipeline data flow

More precisely, the procedure will in order:

  1. Apply node filtering using sourceNodeLabel and targetNodeLabel, and relationship filtering using targetRelationshipType. The resulting graph is used as input to splitting.

  2. Create a relationship split of the graph into test, train and feature-input graphs as described in Configuring the relationship splits. These graphs are internally managed and exist only for the duration of the training.

  3. Apply the node property steps, added according to Adding node properties. The graph filter on each step consists of contextNodeLabels + targetNodeLabel + sourceNodeLabel and contextRelationships + feature-input relationships.

  4. Apply the feature steps, added according to Adding link features, to the train graph, which yields for each train relationship an instance, that is, a feature vector and a binary label.

  5. Split the training instances using stratified k-fold cross-validation. The number of folds k can be configured using validationFolds in gds.beta.pipeline.linkPrediction.configureSplit.

  6. Train each model candidate given by the parameter space for each of the folds and evaluate the model on the respective validation set. The evaluation uses the specified metric.

  7. Declare as winner the model with the highest average metric across the folds.

  8. Re-train the winning model on the whole training set and evaluate it on both the train and test sets. In order to evaluate on the test set, the feature pipeline is first applied again as for the train set.

  9. Register the winning model in the Model Catalog.

The above steps describe what the procedure does logically. The actual steps as well as their ordering in the implementation may differ.
A step can only use node properties that are already present in the input graph or produced by steps, which were added before.
Parallel executions of the same pipeline on the same graph is not supported.

Syntax

Run Link Prediction in train mode on a named graph:
CALL gds.beta.pipeline.linkPrediction.train(
  graphName: String,
  configuration: Map
) YIELD
  trainMillis: Integer,
  modelInfo: Map,
  modelSelectionStats: Map,
  configuration: Map
Table 1. Parameters
Name Type Default Optional Description

graphName

String

n/a

no

The name of a graph stored in the catalog.

configuration

Map

{}

yes

Configuration for algorithm-specifics and/or graph filtering.

Table 2. Configuration
Name Type Default Optional Description

modelName

String

n/a

no

The name of the model to train, must not exist in the Model Catalog.

pipeline

String

n/a

no

The name of the pipeline to execute.

targetRelationshipType

String

n/a

no

The name of the relationship type to train the model on. The relationship type must be undirected.

sourceNodeLabel

String

'*'

yes

The name of the node label relationships in the training and test sets should start from [1].

targetNodeLabel

String

'*'

yes

The name of the node label relationships in the training and test sets should end at [1].

negativeClassWeight

Float

1.0

yes

Weight of negative examples in model evaluation. Positive examples have weight 1. More details here.

metrics

List of String

[AUCPR]

no

Metrics used to evaluate the models.

randomSeed

Integer

n/a

yes

Seed for the random number generator used during training.

concurrency

Integer

4

yes

The number of concurrent threads used for running the algorithm.

jobId

String

Generated internally

yes

An ID that can be provided to more easily track the training’s progress.

storeModelToDisk

Boolean

false

yes

Automatically store model to disk after training.

1. This helps to train the model to predict links with a certain label combination.

Table 3. Results
Name Type Description

trainMillis

Integer

Milliseconds used for training.

modelInfo

Map

Information about the training and the winning model.

modelSelectionStats

Map

Statistics about evaluated metrics for all model candidates.

configuration

Map

Configuration used for the train procedure.

The modelInfo can also be retrieved at a later time by using the Model List Procedure. The modelInfo return field has the following algorithm-specific subfields:

Table 4. Fields of modelSelectionStats
Name Type Description

bestParameters

Map

The model parameters which performed best on average on validation folds according to the primary metric.

modelCandidates

List

List of maps, where each map contains information about one model candidate. This information includes the candidates parameters, training statistics and validation statistics.

bestTrial

Integer

The trial that produced the best model. The first trial has number 1.

Table 5. Fields of modelInfo
Name Type Description

modelName

String

The name of the trained model.

modelType

String

The type of the trained model.

bestParameters

Map

The model parameters which performed best on average on validation folds according to the primary metric.

metrics

Map

Map from metric description to evaluated metrics for the winning model over the subsets of the data, see below.

nodePropertySteps

List of Map

Algorithms that produce node properties within the pipeline.

linkFeatures

List of Map

Feature steps that combine node properties from endpoint nodes to produce features for relationships (links) as input to the pipeline model.

The structure of modelInfo is:

{
    bestParameters: Map,              (1)
    nodePropertySteps: List of Map,
    linkFeatures: List of Map,
    metrics: {                        (2)
        AUCPR: {
            test: Float,              (3)
            outerTrain: Float,        (4)
            train: {                  (5)
                avg: Float,
                max: Float,
                min: Float,
            },
            validation: {             (6)
                avg: Float,
                max: Float,
                min: Float
            }
        }
    }
}
1 The best scoring model candidate configuration.
2 The metrics map contains an entry for each metric description (currently only AUCPR) and the corresponding results for that metric.
3 Numeric value for the evaluation of the best model on the test set.
4 Numeric value for the evaluation of the best model on the outer train set.
5 The train entry summarizes the metric results over the train set.
6 The validation entry summarizes the metric results over the validation set.

In (3)-(5), if the metric is OUT_OF_BAG_ERROR, these statistics are not reported. The OUT_OF_BAG_ERROR is only reported in (6) as validation metric and only if the model is RandomForest.

In addition to the data the procedure yields, there’s a fair amount of information about the training that’s being sent to the Neo4j database’s logs as the procedure progresses.

For example, how well each model candidates perform is logged with info log level and thus end up the neo4j.log file of the database.

Some information is only logged with debug log level, and thus end up in the debug.log file of the database. An example of this is training method specific metadata - such as per epoch loss for logistic regression - during model candidate training (in the model selection phase). Please note that this particular data is not yielded by the procedure call.

Example

In this example we will create a small graph and use the training pipeline we have built up thus far. The graph is a small social network of people and cities, including some information about where people live, were born, and what other people they know. We will attempt to train a model to predict which additional people might know each other. The example graph looks like this:

Visualization of the example graph
The following Cypher statement will create the example graph in the Neo4j database:
CREATE
  (alice:Person {name: 'Alice', age: 38}),
  (michael:Person {name: 'Michael', age: 67}),
  (karin:Person {name: 'Karin', age: 30}),
  (chris:Person {name: 'Chris', age: 52}),
  (will:Person {name: 'Will', age: 6}),
  (mark:Person {name: 'Mark', age: 32}),
  (greg:Person {name: 'Greg', age: 29}),
  (veselin:Person {name: 'Veselin', age: 3}),

  (london:City {name: 'London'}),
  (malmo:City {name: 'Malmo'}),

  (alice)-[:KNOWS]->(michael),
  (michael)-[:KNOWS]->(karin),
  (michael)-[:KNOWS]->(chris),
  (michael)-[:KNOWS]->(greg),
  (will)-[:KNOWS]->(michael),
  (will)-[:KNOWS]->(chris),
  (mark)-[:KNOWS]->(michael),
  (mark)-[:KNOWS]->(will),
  (greg)-[:KNOWS]->(chris),
  (veselin)-[:KNOWS]->(chris),
  (karin)-[:KNOWS]->(veselin),
  (chris)-[:KNOWS]->(karin),

  (alice)-[:LIVES]->(london),
  (michael)-[:LIVES]->(london),
  (karin)-[:LIVES]->(london),
  (chris)-[:LIVES]->(malmo),
  (will)-[:LIVES]->(malmo),

  (alice)-[:BORN]->(london),
  (michael)-[:BORN]->(london),
  (karin)-[:BORN]->(malmo),
  (chris)-[:BORN]->(london),
  (will)-[:BORN]->(malmo),
  (greg)-[:BORN]->(london),
  (veselin)-[:BORN]->(malmo)

With the graph in Neo4j we can now project it into the graph catalog. We do this using a native projection targeting the Person nodes and the KNOWS relationships. We will also project the age property, so it can be used when creating link features. For the relationships we must use the UNDIRECTED orientation. This is because the Link Prediction pipelines are defined only for undirected graphs. We ignore the additional nodes and relationship types, in order for our projection to be homogeneous. We will illustrate how to make use of the larger graph in a subsequent example.

The following statement will project a graph using a native projection and store it in the graph catalog under the name 'myGraph'.
CALL gds.graph.project(
  'myGraph',
  {
    Person: {
      properties: ['age']
    }
  },
  {
    KNOWS: {
      orientation: 'UNDIRECTED'
    }
  }
)
The Link Prediction model requires the graph to be created using the UNDIRECTED orientation for relationships.

Memory Estimation

First off, we will estimate the cost of training the pipeline by using the estimate procedure. Estimation is useful to understand the memory impact that training the pipeline on your graph will have. When actually training the pipeline the system will perform an estimation and prohibit the execution if the estimation shows there is a very high probability of the execution running out of memory. To read more about this, see Automatic estimation and execution blocking.

For more details on estimate in general, see Memory Estimation.

The following will estimate the memory requirements for training the pipeline:
CALL gds.beta.pipeline.linkPrediction.train.estimate('myGraph', {
  pipeline: 'pipe',
  modelName: 'lp-pipeline-model',
  targetRelationshipType: 'KNOWS'
})
YIELD requiredMemory
Table 6. Results
requiredMemory

"[24 KiB ... 522 KiB]"

Training

Now we are ready to actually train a LinkPrediction model. We must make sure to specify the targetRelationshipType to instruct the model to train only using that type. With the graph myGraph there are actually no other relationship types projected, but that is not always the case.

The following will train a model using a pipeline:
CALL gds.beta.pipeline.linkPrediction.train('myGraph', {
  pipeline: 'pipe',
  modelName: 'lp-pipeline-model',
  metrics: ['AUCPR', 'OUT_OF_BAG_ERROR'],
  targetRelationshipType: 'KNOWS',
  randomSeed: 12
}) YIELD modelInfo, modelSelectionStats
RETURN
  modelInfo.bestParameters AS winningModel,
  modelInfo.metrics.AUCPR.train.avg AS avgTrainScore,
  modelInfo.metrics.AUCPR.outerTrain AS outerTrainScore,
  modelInfo.metrics.AUCPR.test AS testScore,
  [cand IN modelSelectionStats.modelCandidates | cand.metrics.AUCPR.validation.avg] AS validationScores
Table 7. Results
winningModel avgTrainScore outerTrainScore testScore validationScores

{batchSize=100, classWeights=[0.55, 0.45], focusWeight=0.0904839039446324, hiddenLayerSizes=[4, 2], learningRate=0.001, maxEpochs=100, methodName=MultilayerPerceptron, minEpochs=1, patience=2, penalty=0.5, tolerance=0.001}

0.858465608465608

0.82

0.677777777777778

[0.5555555555555555, 0.4444444444444444, 0.4305555555555555, 0.875]

We can see the MLP model configuration won, and has a score of 0.67 on the test set. The score computed as the AUCPR metric, which is in the range [0, 1]. A model which gives higher score to all links than non-links will have a score of 1.0, and a model that assigns random scores will on average have a score of 0.5.

Training with context filters

In the above example we projected a Person-KNOWS-Person subgraph and used it for training and testing. Much information in the original graph is not used. We might want to utilize more node and relationship types to generate node properties (and link features) and investigate whether it improves link prediction. We can do that by passing in contextNodeLabels and contextRelationshipTypes. We explicitly pass in sourceNodeLabel and targetNodeLabel to specify a narrower set of nodes to be used for training and testing.

The following statement will project the full graph using a native projection and store it in the graph catalog under the name 'fullGraph'.

CALL gds.graph.project(
  'fullGraph',
  {
    Person: {
      properties: {age: {defaultValue: 1}}
    },
    City: {
      properties: {age: {defaultValue: 1}}
    }
  },
  {
    KNOWS: {
      orientation: 'UNDIRECTED'
    },
    LIVES: {},
    BORN: {}
  }
)

The full graph contains 2 node labels and 3 relationship types. We still train a Person-KNOWS-Person model, but use context information Person-LIVES-City, Person-BORN-City to generate node properties that the model uses in training. Note that we do not require the UNDIRECTED orientation for the context relationship types, as these are excluded from the LinkPrediction training.

First we’ll create a new pipeline.

CALL gds.beta.pipeline.linkPrediction.create('pipe-with-context')

Next we add the nodePropertyStep with context configurations.

CALL gds.beta.pipeline.linkPrediction.addNodeProperty('pipe-with-context', 'fastRP', {
  mutateProperty: 'embedding',
  embeddingDimension: 256,
  randomSeed: 42,
  contextNodeLabels: ['City'],
  contextRelationshipTypes: ['LIVES', 'BORN']
})

Then we add the link feature.

CALL gds.beta.pipeline.linkPrediction.addFeature('pipe-with-context', 'hadamard', {
  nodeProperties: ['embedding', 'age']
})

And then similarly configure the data splits.

CALL gds.beta.pipeline.linkPrediction.configureSplit('pipe-with-context', {
  testFraction: 0.25,
  trainFraction: 0.6,
  validationFolds: 3
})

Then we add an MLP model candidate.

CALL gds.alpha.pipeline.linkPrediction.addMLP('pipe-with-context',
{hiddenLayerSizes: [4, 2], penalty: 1, patience: 2})
The following will train another model using the pipeline with additional context information used in node property step:
CALL gds.beta.pipeline.linkPrediction.train('fullGraph', {
  pipeline: 'pipe-with-context',
  modelName: 'lp-pipeline-model-filtered',
  metrics: ['AUCPR', 'OUT_OF_BAG_ERROR'],
  sourceNodeLabel: 'Person',
  targetNodeLabel: 'Person',
  targetRelationshipType: 'KNOWS',
  randomSeed: 12
}) YIELD modelInfo, modelSelectionStats
RETURN
  modelInfo.bestParameters AS winningModel,
  modelInfo.metrics.AUCPR.train.avg AS avgTrainScore,
  modelInfo.metrics.AUCPR.outerTrain AS outerTrainScore,
  modelInfo.metrics.AUCPR.test AS testScore,
  [cand IN modelSelectionStats.modelCandidates | cand.metrics.AUCPR.validation.avg] AS validationScores
Table 8. Results
winningModel avgTrainScore outerTrainScore testScore validationScores

{batchSize=100, classWeights=[], focusWeight=0.0, hiddenLayerSizes=[4, 2], learningRate=0.001, maxEpochs=100, methodName=MultilayerPerceptron, minEpochs=1, patience=2, penalty=1.0, tolerance=0.001}

0.858465608465608

0.82

0.677777777777778

[0.875]

As we can see, the results are effectively identical. While the train and test score stays the same in this toy example, it is likely that the contextual information will have a greater impact for larger datasets.