Training the pipeline

The train mode, gds.beta.pipeline.linkPrediction.train, is responsible for splitting data, feature extraction, model selection, training and storing a model for future use. Running this mode results in a prediction model of type LinkPrediction being stored in the model catalog along with metrics collected during training. The model can be applied to a possibly different graph which produces a relationship type of predicted links, each having a predicted probability stored as a property.

More precisely, the procedure will in order:

  1. Apply nodeLabels and relationshipType filters to the graph. All subsequent graphs have the same node set.

  2. Create a relationship split of the graph into test, train and feature-input sets as described in Configuring the relationship splits. These graphs are internally managed and exist only for the duration of the training.

  3. Apply the node property steps, added according to Adding node properties, on the feature-input graph.

  4. Apply the feature steps, added according to Adding link features, to the train graph, which yields for each train relationship an instance, that is, a feature vector and a binary label.

  5. Split the training instances using stratified k-fold cross-validation. The number of folds k can be configured using validationFolds in gds.beta.pipeline.linkPrediction.configureSplit.

  6. Train each model candidate given by the parameter space for each of the folds and evaluate the model on the respective validation set. The evaluation uses the specified metric.

  7. Declare as winner the model with the highest average metric across the folds.

  8. Re-train the winning model on the whole training set and evaluate it on both the train and test sets. In order to evaluate on the test set, the feature pipeline is first applied again as for the train set.

  9. Register the winning model in the Model Catalog.

The above steps describe what the procedure does logically. The actual steps as well as their ordering in the implementation may differ.
A step can only use node properties that are already present in the input graph or produced by steps, which were added before.

1. Syntax

Run Link Prediction in train mode on a named graph:
CALL gds.beta.pipeline.linkPrediction.train(
  graphName: String,
  configuration: Map
) YIELD
  trainMillis: Integer,
  modelInfo: Map,
  modelSelectionStats: Map,
  configuration: Map
Table 1. Parameters
Name Type Default Optional Description

graphName

String

n/a

no

The name of a graph stored in the catalog.

configuration

Map

{}

yes

Configuration for algorithm-specifics and/or graph filtering.

Table 2. Configuration
Name Type Default Optional Description

modelName

String

n/a

no

The name of the model to train, must not exist in the Model Catalog.

pipeline

String

n/a

no

The name of the pipeline to execute.

negativeClassWeight

Float

1.0

yes

Weight of negative examples in model evaluation. Positive examples have weight 1. More details here.

metrics

List of String

[AUCPR]

no

Metrics used to evaluate the models.

randomSeed

Integer

n/a

yes

Seed for the random number generator used during training.

nodeLabels

List of String

['*']

yes

Filter the named graph using the given node labels.

relationshipTypes

List of String

['*']

yes

Filter the named graph using the given relationship types.

concurrency

Integer

4

yes

The number of concurrent threads used for running the algorithm.

jobId

String

Generated internally

yes

An ID that can be provided to more easily track the training’s progress.

Table 3. Results
Name Type Description

trainMillis

Integer

Milliseconds used for training.

modelInfo

Map

Information about the training and the winning model.

modelSelectionStats

Map

Statistics about evaluated metrics for all model candidates.

configuration

Map

Configuration used for the train procedure.

The modelInfo can also be retrieved at a later time by using the Model List Procedure. The modelInfo return field has the following algorithm-specific subfields:

Table 4. Fields of modelSelectionStats
Name Type Description

bestParameters

Map

The model parameters which performed best on average on validation folds according to the primary metric.

modelCandidates

List

List of maps, where each map contains information about one model candidate. This information includes the candidates parameters, training statistics and validation statistics.

bestTrial

Integer

The trial that produced the best model. The first trial has number 1.

Table 5. Fields of modelInfo
Name Type Description

modelName

String

The name of the trained model.

modelType

String

The type of the trained model.

bestParameters

Map

The model parameters which performed best on average on validation folds according to the primary metric.

metrics

Map

Map from metric description to evaluated metrics for the winning model over the subsets of the data, see below.

trainingPipeline

Map

The pipeline used for the training.

The structure of modelInfo is:

{
    bestParameters: Map,        (1)
    trainingPipeline: Map       (2)
    metrics: {                  (3)
        AUCPR: {
            test: Float,        (4)
            outerTrain: Float,  (5)
            train: {           (6)
                avg: Float,
                max: Float,
                min: Float,
            },
            validation: {      (7)
                avg: Float,
                max: Float,
                min: Float
            }
        }
    }
}
1 The best scoring model candidate configuration.
2 The pipeline used for the training.
3 The metrics map contains an entry for each metric description (currently only AUCPR) and the corresponding results for that metric.
4 Numeric value for the evaluation of the best model on the test set.
5 Numeric value for the evaluation of the best model on the outer train set.
6 The train entry summarizes the metric results over the train set.
7 The validation entry summarizes the metric results over the validation set.

In (4)-(6), if the metric is OUT_OF_BAG_ERROR, these statistics are not reported. The OUT_OF_BAG_ERROR is only reported in (7) as validation metric and only if the model is RandomForest.

In addition to the data the procedure yields, there’s a fair amount of information about the training that’s being sent to the Neo4j database’s logs as the procedure progresses.

For example, how well each model candidates perform is logged with info log level and thus end up the neo4j.log file of the database.

Some information is only logged with debug log level, and thus end up in the debug.log file of the database. An example of this is training method specific metadata - such as per epoch loss for logistic regression - during model candidate training (in the model selection phase). Please note that this particular data is not yielded by the procedure call.

2. Example

In this example we will create a small graph and use the training pipeline we have built up thus far. The graph consists of a handful nodes connected in a particular pattern. The example graph looks like this:

Visualization of the example graph
The following Cypher statement will create the example graph in the Neo4j database:
CREATE
  (alice:Person {name: 'Alice', numberOfPosts: 38}),
  (michael:Person {name: 'Michael', numberOfPosts: 67}),
  (karin:Person {name: 'Karin', numberOfPosts: 30}),
  (chris:Person {name: 'Chris', numberOfPosts: 132}),
  (will:Person {name: 'Will', numberOfPosts: 6}),
  (mark:Person {name: 'Mark', numberOfPosts: 32}),
  (greg:Person {name: 'Greg', numberOfPosts: 29}),
  (veselin:Person {name: 'Veselin', numberOfPosts: 3}),

  (alice)-[:KNOWS]->(michael),
  (michael)-[:KNOWS]->(karin),
  (michael)-[:KNOWS]->(chris),
  (michael)-[:KNOWS]->(greg),
  (will)-[:KNOWS]->(michael),
  (will)-[:KNOWS]->(chris),
  (mark)-[:KNOWS]->(michael),
  (mark)-[:KNOWS]->(will),
  (greg)-[:KNOWS]->(chris),
  (veselin)-[:KNOWS]->(chris),
  (karin)-[:KNOWS]->(veselin),
  (chris)-[:KNOWS]->(karin);

With the graph in Neo4j we can now project it into the graph catalog. We do this using a native projection targeting the Person nodes and the KNOWS relationships. We will also project the numberOfPosts property, so it can be used when creating link features. For the relationships we must use the UNDIRECTED orientation. This is because the Link Prediction pipelines are defined only for undirected graphs.

The following statement will project a graph using a native projection and store it in the graph catalog under the name 'myGraph'.
CALL gds.graph.project(
  'myGraph',
  {
    Person: {
      properties: ['numberOfPosts']
    }
  },
  {
    KNOWS: {
      orientation: 'UNDIRECTED'
    }
  }
)
The Link Prediction model requires the graph to be created using the UNDIRECTED orientation for relationships.

2.1. Memory Estimation

First off, we will estimate the cost of training the pipeline by using the estimate procedure. Estimation is useful to understand the memory impact that training the pipeline on your graph will have. When actually training the pipeline the system will perform an estimation and prohibit the execution if the estimation shows there is a very high probability of the execution running out of memory. To read more about this, see Automatic estimation and execution blocking.

For more details on estimate in general, see Memory Estimation.

The following will estimate the memory requirements for training the pipeline:
CALL gds.beta.pipeline.linkPrediction.train.estimate('myGraph', {
  pipeline: 'pipe',
  modelName: 'lp-pipeline-model'
})
YIELD requiredMemory
Table 6. Results
requiredMemory

"[24 KiB ... 522 KiB]"

2.2. Training

The following will train a model using a pipeline:
CALL gds.beta.pipeline.linkPrediction.train('myGraph', {
  pipeline: 'pipe',
  modelName: 'lp-pipeline-model',
  metrics: ['AUCPR', 'OUT_OF_BAG_ERROR'],
  randomSeed: 42
}) YIELD modelInfo, modelSelectionStats
RETURN
  modelInfo.bestParameters AS winningModel,
  modelInfo.metrics.AUCPR.train.avg AS avgTrainScore,
  modelInfo.metrics.AUCPR.outerTrain AS outerTrainScore,
  modelInfo.metrics.AUCPR.test AS testScore,
  [cand IN modelSelectionStats.modelCandidates | cand.metrics.AUCPR.validation.avg] AS validationScores
Table 7. Results
winningModel avgTrainScore outerTrainScore testScore validationScores

{maxDepth=2147483647, minLeafSize=1, criterion=GINI, minSplitSize=2, numberOfDecisionTrees=10, methodName=RandomForest, numberOfSamplesRatio=1.0}

0.779365079365079

0.788888888888889

0.766666666666667

[0.3333333333333333, 0.6388888888888888, 0.3333333333333333, 0.3333333333333333, 0.3333333333333333]

We can see the RandomForest model configuration with numberOfDecisionTrees = 5 (and defaults filled for remaining parameters) was selected, and has a score of 0.58 on the test set. The score computed as the AUCPR metric, which is in the range [0, 1]. A model which gives higher score to all links than non-links will have a score of 1.0, and a model that assigns random scores will on average have a score of 0.5.