Node classification pipelines

This section describes Node classification pipelines in the Neo4j Graph Data Science library.

1. Introduction

Node Classification is a common machine learning task applied to graphs: training models to classify nodes. Concretely, Node Classification models are used to predict the classes of unlabeled nodes as a node properties based on other node properties. During training, the property representing the class of the node is referred to as the target property. GDS supports both binary and multi-class node classification.

In GDS, we have Node Classification pipelines which offer an end-to-end workflow, from feature extraction to node classification. The training pipelines reside in the pipeline catalog. When a training pipeline is executed, a classification model is created and stored in the model catalog.

A training pipeline is a sequence of two phases:

  1. The graph is augmented with new node properties in a series of steps.

  2. The augmented graph is used for training a node classification model.

One can configure which steps should be included above. The steps execute GDS algorithms that create new node properties. After configuring the node property steps, one can select a subset of node properties to be used as features. The training phase (II) trains multiple model candidates using cross-validation, selects the best one, and reports relevant performance metrics.

After training the pipeline, a classification model is created. This model includes the node property steps and feature configuration from the training pipeline and uses them to generate the relevant features for classifying unlabeled nodes. The classification model can be applied to predict the class of previously unseen nodes. In addition to the predicted class for each node, the predicted probability for each class may also be retained on the nodes. The order of the probabilities matches the order of the classes registered in the model.

Classification can only be done with a classification model (not with a training pipeline).

The rest of this page is divided as follows:

2. Creating a pipeline

The first step of building a new pipeline is to create one using gds.beta.pipeline.nodeClassification.create. This stores a trainable pipeline object in the pipeline catalog of type Node classification training pipeline. This represents a configurable pipeline that can later be invoked for training, which in turn creates a classification model. The latter is also a model which is stored in the catalog with type NodeClassification.

2.1. Syntax

Create pipeline syntax
CALL gds.beta.pipeline.nodeClassification.create(
  pipelineName: String
)
YIELD
  name: String,
  nodePropertySteps: List of Map,
  featureProperties: List of String,
  splitConfig: Map,
  parameterSpace: List of Map
Table 1. Parameters
Name Type Description

pipelineName

String

The name of the created pipeline.

Table 2. Results
Name Type Description

name

String

Name of the pipeline.

nodePropertySteps

List of Map

List of configurations for node property steps.

featureProperties

List of String

List of node properties to be used as features.

splitConfig

Map

Configuration to define the split before the model training.

parameterSpace

List of Map

List of parameter configurations for models which the train mode uses for model selection.

2.2. Example

The following will create a pipeline:
CALL gds.beta.pipeline.nodeClassification.create('pipe')
Table 3. Results
name nodePropertySteps featureProperties splitConfig parameterSpace

"pipe"

[]

[]

{testFraction=0.3, validationFolds=3}

{RandomForest=[], LogisticRegression=[]}

This shows that the newly created pipeline does not contain any steps yet, and has defaults for the split and train parameters.

3. Adding node properties

A node classification pipeline can execute one or several GDS algorithms in mutate mode that create node properties in the in-memory graph. Such steps producing node properties can be chained one after another and created properties can later be used as features. Moreover, the node property steps that are added to the training pipeline will be executed both when training a model and when the classification pipeline is applied for classification.

The name of the procedure that should be added can be a fully qualified GDS procedure name ending with .mutate. The ending .mutate may be omitted and one may also use shorthand forms such as node2vec instead of gds.beta.node2vec.mutate.

For example, pre-processing algorithms can be used as node property steps.

3.1. Syntax

Add node property syntax
CALL gds.beta.pipeline.nodeClassification.addNodeProperty(
  pipelineName: String,
  procedureName: String,
  procedureConfiguration: Map
)
YIELD
  name: String,
  nodePropertySteps: List of Map,
  featureProperties: List of String,
  splitConfig: Map,
  parameterSpace: List of Map
Table 4. Parameters
Name Type Description

pipelineName

String

The name of the pipeline.

procedureName

String

The name of the procedure to be added to the pipeline.

procedureConfiguration

Map

The configuration of the procedure, excluding graphName, nodeLabels and relationshipTypes.

Table 5. Results
Name Type Description

name

String

Name of the pipeline.

nodePropertySteps

List of Map

List of configurations for node property steps.

featureProperties

List of String

List of node properties to be used as features.

splitConfig

Map

Configuration to define the split before the model training.

parameterSpace

List of Map

List of parameter configurations for models which the train mode uses for model selection.

3.2. Example

The following will add a node property step to the pipeline. Here we assume that the input graph contains a property sizePerStory.
CALL gds.beta.pipeline.nodeClassification.addNodeProperty('pipe', 'alpha.scaleProperties', {
  nodeProperties: 'sizePerStory',
  scaler: 'L1Norm',
  mutateProperty:'scaledSizes'
})
YIELD name, nodePropertySteps
Table 6. Results
name nodePropertySteps

"pipe"

[{name=gds.alpha.scaleProperties.mutate, config={scaler=L1Norm, mutateProperty=scaledSizes, nodeProperties=sizePerStory}}]

The scaledSizes property can be later used as a feature.

4. Adding features

A Node Classification Pipeline allows you to select a subset of the available node properties to be used as features for the machine learning model. When executing the pipeline, the selected nodeProperties must be either present in the input graph, or created by a previous node property step. For example, the embedding property could be created by the previous example, and we expect numberOfPosts to already be present in the in-memory graph used as input, at train and predict time.

4.1. Syntax

Adding a feature to a pipeline syntax
CALL gds.beta.pipeline.nodeClassification.selectFeatures(
  pipelineName: String,
  nodeProperties: List or String
)
YIELD
  name: String,
  nodePropertySteps: List of Map,
  featureProperties: List of String,
  splitConfig: Map,
  parameterSpace: List of Map
Table 7. Parameters
Name Type Description

pipelineName

String

The name of the pipeline.

nodeProperties

List or String

Configuration for splitting the relationships.

Table 8. Results
Name Type Description

name

String

Name of the pipeline.

nodePropertySteps

List of Map

List of configurations for node property steps.

featureProperties

List of String

List of node properties to be used as features.

splitConfig

Map

Configuration to define the split before the model training.

parameterSpace

List of Map

List of parameter configurations for models which the train mode uses for model selection.

4.2. Example

The following will select features for the pipeline. Here we assume that the input graph contains a property sizePerStory and scaledSizes was created in a nodePropertyStep.
CALL gds.beta.pipeline.nodeClassification.selectFeatures('pipe', ['scaledSizes', 'sizePerStory'])
YIELD name, featureProperties
Table 9. Results
name featureProperties

"pipe"

[scaledSizes, sizePerStory]

5. Configuring the node splits

Node Classification Pipelines manage splitting the nodes into several sets for training, testing and validating the models defined in the parameter space. Configuring the splitting is optional, and if omitted, splitting will be done using default settings. The splitting configuration of a pipeline can be inspected by using gds.beta.model.list and possibly only yielding splitConfig.

The node splits are used in the training process as follows:

  1. The input graph is split into two parts: the train graph and the test graph. See the example below.

  2. The train graph is further divided into a number of validation folds, each consisting of a train part and a validation part. See the animation below.

  3. Each model candidate is trained on each train part and evaluated on the respective validation part.

  4. The model with the highest average score according to the primary metric will win the training.

  5. The winning model will then be retrained on the entire train graph.

  6. The winning model is evaluated on the train graph as well as the test graph.

  7. The winning model is retrained on the entire original graph.

Below we illustrate an example for a graph with 12 nodes. First we use a holdoutFraction of 0.25 to split into train and test subgraphs.

train-test-image

Then we carry out three validation folds, where we first split the train subgraph into 3 disjoint subsets (s1, s2 and s3), and then alternate which subset is used for validation. For each fold, all candidate models are trained in the red nodes, and validated in the green nodes.

validation-folds-image

5.1. Syntax

Configure the node split syntax
CALL gds.beta.pipeline.nodeClassification.configureSplit(
  pipelineName: String,
  configuration: Map
)
YIELD
  name: String,
  nodePropertySteps: List of Map,
  featureProperties: List of Strings,
  splitConfig: Map,
  parameterSpace: List of Map
Table 10. Parameters
Name Type Description

pipelineName

String

The name of the pipeline.

configuration

Map

Configuration for splitting the relationships.

Table 11. Configuration
Name Type Default Description

validationFolds

Integer

3

Number of divisions of the training graph used during model selection.

testFraction

Double

0.3

Fraction of the graph reserved for testing. Must be in the range (0, 1). The fraction used for the training is 1 - testFraction.

Table 12. Results
Name Type Description

name

String

Name of the pipeline.

nodePropertySteps

List of Map

List of configurations for node property steps.

featureProperties

List of String

List of node properties to be used as features.

splitConfig

Map

Configuration to define the split before the model training.

parameterSpace

List of Map

List of parameter configurations for models which the train mode uses for model selection.

5.2. Example

The following will configure the splitting of the pipeline:
CALL gds.beta.pipeline.nodeClassification.configureSplit('pipe', {
 testFraction: 0.2,
  validationFolds: 5
})
YIELD splitConfig
Table 13. Results
splitConfig

{testFraction=0.2, validationFolds=5}

We now reconfigured the splitting of the pipeline, which will be applied during training.

6. Adding model candidates

A pipeline contains a collection of configurations for model candidates which is initially empty. This collection is called the parameter space. One or more model configurations must be added to the parameter space of the training pipeline, using one of the following procedures:

  • gds.beta.pipeline.nodeClassification.addLogisticRegression

  • gds.alpha.pipeline.nodeClassification.addRandomForest

For information about the available training methods in GDS, logistic regression and random forest, see Training methods.

In Training the pipeline, we explain further how the configured model candidates are trained, evaluated and compared.

The parameter space of a pipeline can be inspected using gds.beta.model.list and optionally yielding only parameterSpace.

At least one model candidate must be added to the pipeline before training it.

6.1. Syntax

Configure the train parameters syntax
CALL gds.beta.pipeline.nodeClassification.addLogisticRegression(
  pipelineName: String,
  config: Map
)
YIELD
  name: String,
  nodePropertySteps: List of Map,
  featureProperties: List of String,
  splitConfig: Map,
  parameterSpace: Map
Table 14. Parameters
Name Type Description

pipelineName

String

The name of the pipeline.

config

Map

The logistic regression config for a potential model. The allowed parameters for a model are defined in the next table.

Table 15. Logistic regression configuration
Name Type Default Optional Description

penalty

Float

0.0

yes

Penalty used for the logistic regression. By default, no penalty is applied.

batchSize

Integer

100

yes

Number of nodes per batch.

minEpochs

Integer

1

yes

Minimum number of training epochs.

maxEpochs

Integer

100

yes

Maximum number of training epochs.

patience

Integer

1

yes

Maximum number of unproductive consecutive epochs.

tolerance

Float

0.001

yes

The minimal improvement of the loss to be considered productive.

Table 16. Results
Name Type Description

name

String

Name of the pipeline.

nodePropertySteps

List of Map

List of configurations for node property steps.

featureProperties

List of String

List of node properties to be used as features.

splitConfig

Map

Configuration to define the split before the model training.

parameterSpace

List of Map

List of parameter configurations for models which the train mode uses for model selection.

Configure the train parameters syntax
CALL gds.alpha.pipeline.nodeClassification.addRandomForest(
  pipelineName: String,
  config: Map
)
YIELD
  name: String,
  nodePropertySteps: List of Map,
  featureProperties: List of String,
  splitConfig: Map,
  parameterSpace: Map
Table 17. Parameters
Name Type Description

pipelineName

String

The name of the pipeline.

config

Map

The random forest config for a potential model. The allowed parameters for a model are defined in the next table.

Table 18. Random Forest configuration
Name Type Default Optional Description

maxFeaturesRatio

Float

1 / sqrt(|features|)

yes

The ratio of features to consider when looking for the best split

numberOfSamplesRatio

Float

1.0

yes

The ratio of samples to consider per decision tree. We use sampling with replacement. A value of 0 indicates using every training example (no sampling).

numberOfDecisionTrees

Integer

100

yes

The number of decision trees.

maxDepth

Integer

No max depth

yes

The maximum depth of a decision tree.

minSplitSize

Integer

2

yes

The minimum number of samples required to split an internal node.

Table 19. Results
Name Type Description

name

String

Name of the pipeline.

nodePropertySteps

List of Map

List of configurations for node property steps.

featureProperties

List of String

List of node properties to be used as features.

splitConfig

Map

Configuration to define the split before the model training.

parameterSpace

List of Map

List of parameter configurations for models which the train mode uses for model selection.

6.2. Example

We can add multiple model candidates to our pipeline.

The following will add a logistic regression model:
CALL gds.beta.pipeline.nodeClassification.addLogisticRegression('pipe', {penalty: 0.0625})
YIELD parameterSpace
The following will add a random forest model:
CALL gds.alpha.pipeline.nodeClassification.addRandomForest('pipe', {numberOfDecisionTrees: 5})
YIELD parameterSpace
The following will add another logistic regression model:
CALL gds.beta.pipeline.nodeClassification.addLogisticRegression('pipe', {maxEpochs: 500})
YIELD parameterSpace
RETURN parameterSpace.RandomForest AS randomForestSpace, parameterSpace.LogisticRegression AS logisticRegressionSpace
Table 20. Results
randomForestSpace logisticRegressionSpace

[{maxDepth=2147483647, minSplitSize=2, numberOfDecisionTrees=5, methodName=RandomForest, numberOfSamplesRatio=1.0}]

[{maxEpochs=100, minEpochs=1, penalty=0.0625, patience=1, methodName=LogisticRegression, batchSize=100, tolerance=0.001}, {maxEpochs=500, minEpochs=1, penalty=0.0, patience=1, methodName=LogisticRegression, batchSize=100, tolerance=0.001}]

The parameterSpace in the pipeline now contains the three different model candidates, expanded with the default values. Each specified model candidate will be tried out during the model selection in training.

These are somewhat naive examples of how to add and configure model candidates. Please see Training methods for more information on how to tune the configuration parameters of each method.

7. Training the pipeline

The train mode, gds.beta.pipeline.nodeClassification.train, is responsible for splitting data, feature extraction, model selection, training and storing a model for future use. Running this mode results in a classification model of type NodeClassification, which is then stored in the model catalog. The classification model can be applied to a possibly different graph which classifies nodes.

More precisely, the training proceeds as follows:

  1. Apply nodeLabels and relationshipType filters to the graph.

  2. Apply the node property steps, added according to Adding node properties, on the whole graph.

  3. Select node properties to be used as features, as specified in Adding features.

  4. Split the input graph into two parts: the train graph and the test graph. This is described in Configuring the node splits. These graphs are internally managed and exist only for the duration of the training.

  5. Split the nodes in the train graph using stratified k-fold cross-validation. The number of folds k can be configured as described in Configuring the node splits.

  6. Each model candidate defined in the parameter space is trained on each train set and evaluated on the respective validation set for every fold. The evaluation uses the specified primary metric.

  7. Choose the best performing model according to the highest average score for the primary metric.

  8. Retrain the winning model on the entire train graph.

  9. Evaluate the performance of the winning model on the whole train graph as well as the test graph.

  10. Retrain the winning model on the entire original graph.

  11. Register the winning model in the Model Catalog.

The above steps describe what the procedure does logically. The actual steps as well as their ordering in the implementation may differ.
A step can only use node properties that are already present in the input graph or produced by steps, which were added before.

7.1. Metrics

The Node Classification model in the Neo4j GDS library supports the following evaluation metrics:

  • Global metrics

    • F1_WEIGHTED

    • F1_MACRO

    • ACCURACY

  • Per-class metrics

    • F1(class=<number>) or F1(class=*)

    • PRECISION(class=<number>) or PRECISION(class=*)

    • RECALL(class=<number>) or RECALL(class=*)

    • ACCURACY(class=<number>) or ACCURACY(class=*)

The * is syntactic sugar for reporting the metric for each class in the graph. When using a per-class metric, the reported metrics contain keys like for example ACCURACY_class_1.

More than one metric can be specified during training but only the first specified — the primary one — is used for evaluation, the results of all are present in the train results. The primary metric may not be a * expansion due to the ambiguity of which of the expanded metrics should be the primary one.

7.2. Syntax

Run Node Classification in train mode on a named graph:
CALL gds.beta.pipeline.nodeClassification.train(
  graphName: String,
  configuration: Map
) YIELD
  trainMillis: Integer,
  modelInfo: Map,
  modelSelectionStats: Map,
  configuration: Map
Table 21. Parameters
Name Type Default Optional Description

graphName

String

n/a

no

The name of a graph stored in the catalog.

configuration

Map

{}

yes

Configuration for algorithm-specifics and/or graph filtering.

Table 22. Configuration
Name Type Default Optional Description

pipeline

String

n/a

no

The name of the pipeline to execute.

nodeLabels

List of String

['*']

yes

Filter the named graph using the given node labels.

relationshipTypes

List of String

['*']

yes

Filter the named graph using the given relationship types.

concurrency

Integer

4

yes

The number of concurrent threads used for running the algorithm.

targetProperty

String

n/a

no

The class of the node. Must be of type Integer.

metrics

List of String

n/a

no

Metrics used to evaluate the models.

randomSeed

Integer

n/a

yes

Seed for the random number generator used during training.

modelName

String

n/a

no

The name of the model to train, must not exist in the Model Catalog.

Table 23. Results
Name Type Description

trainMillis

Integer

Milliseconds used for training.

modelInfo

Map

Information about the training and the winning model.

modelSelectionStats

Map

Statistics about evaluated metrics for all model candidates.

configuration

Map

Configuration used for the train procedure.

The modelInfo can also be retrieved at a later time by using the Model List Procedure. The modelInfo return field has the following algorithm-specific subfields:

Table 24. Model info fields
Name Type Description

classes

List of Integer

Sorted list of class ids which are the distinct values of targetProperty over the entire graph.

bestParameters

Map

The model parameters which performed best on average on validation folds according to the primary metric.

metrics

Map

Map from metric description to evaluated metrics for the winning model over the subsets of the data, see below.

trainingPipeline

Map

The pipeline used for the training.

The structure of modelInfo is:

{
    bestParameters: Map,        (1)
    trainingPipeline: Map       (2)
    classes: List of Integer,   (3)
    metrics: {                  (4)
        <METRIC_NAME>: {        (5)
            test: Float,        (6)
            outerTrain: Float,  (7)
            train: {           (8)
                avg: Float,
                max: Float,
                min: Float,
            },
            validation: {      (9)
                avg: Float,
                max: Float,
                min: Float,
                params: Map
            }
        }
    }
}
1 The best scoring model candidate configuration.
2 The pipeline used for the training.
3 Sorted list of class ids which are the distinct values of targetProperty over the entire graph.
4 The metrics map contains an entry for each metric description, and the corresponding results for that metric.
5 A metric name specified in the configuration of the procedure, e.g., F1_MACRO or RECALL(class=4).
6 Numeric value for the evaluation of the winning model on the test set.
7 Numeric value for the evaluation of the winning model on the outer train set.
8 The train entry summarizes the metric results over the train set.
9 The validation entry summarizes the metric results over the validation set.

7.3. Example

In this section we will show examples of running a Node Classification training pipeline on a concrete graph. The intention is to illustrate what the results look like and to provide a guide in how to make use of the model in a real setting. We will do this on a small graph of a handful of nodes representing houses. This is an example of Multi-class classification, the class node property distinct values determine the number of classes, in this case three (0, 1 and 2). The example graph looks like this:

node classification
The following Cypher statement will create the example graph in the Neo4j database:
CREATE
  (:House {color: 'Gold', sizePerStory: [15.5, 23.6, 33.1], class: 0}),
  (:House {color: 'Red', sizePerStory: [15.5, 23.6, 100.0], class: 0}),
  (:House {color: 'Blue', sizePerStory: [11.3, 35.1, 22.0], class: 0}),
  (:House {color: 'Green', sizePerStory: [23.2, 55.1, 0.0], class: 1}),
  (:House {color: 'Gray', sizePerStory: [34.3, 24.0, 0.0],  class: 1}),
  (:House {color: 'Black', sizePerStory: [71.66, 55.0, 0.0], class: 1}),
  (:House {color: 'White', sizePerStory: [11.1, 111.0, 0.0], class: 1}),
  (:House {color: 'Teal', sizePerStory: [80.8, 0.0, 0.0], class: 2}),
  (:House {color: 'Beige', sizePerStory: [106.2, 0.0, 0.0], class: 2}),
  (:House {color: 'Magenta', sizePerStory: [99.9, 0.0, 0.0], class: 2}),
  (:House {color: 'Purple', sizePerStory: [56.5, 0.0, 0.0], class: 2}),
  (:UnknownHouse {color: 'Pink', sizePerStory: [23.2, 55.1, 56.1]}),
  (:UnknownHouse {color: 'Tan', sizePerStory: [22.32, 102.0, 0.0]}),
  (:UnknownHouse {color: 'Yellow', sizePerStory: [39.0, 0.0, 0.0]});

With the graph in Neo4j we can now project it into the graph catalog to prepare it for the pipeline execution. We do this using a native projection targeting the House and UnknownHouse labels. We will also project the sizeOfStory property to use as a model feature, and the class property to use as a target feature.

In the examples below we will use named graphs and native projections as the norm. However, Cypher projections can also be used.

The following statement will project a graph using a native projection and store it in the graph catalog under the name 'myGraph'.
CALL gds.graph.project('myGraph', {
    House: { properties: ['sizePerStory', 'class'] },
    UnknownHouse: { properties: 'sizePerStory' }
  },
  '*'
)

7.3.1. Memory Estimation

First off, we will estimate the cost of running the algorithm using the estimate procedure. This can be done with any execution mode. We will use the train mode in this example. Estimating the algorithm is useful to understand the memory impact that running the algorithm on your graph will have. When you later actually run the algorithm in one of the execution modes the system will perform an estimation. If the estimation shows that there is a very high probability of the execution going over its memory limitations, the execution is prohibited. To read more about this, see Automatic estimation and execution blocking.

For more details on estimate in general, see Memory Estimation.

The following will estimate the memory requirements for running the algorithm in train mode:
CALL gds.beta.pipeline.nodeClassification.train.estimate('myGraph', {
  pipeline: 'pipe',
  nodeLabels: ['House'],
  modelName: 'nc-model',
  targetProperty: 'class',
  randomSeed: 2,
  metrics: [ 'F1_WEIGHTED' ]
})
YIELD bytesMin, bytesMax, requiredMemory
Table 25. Results
bytesMin bytesMax requiredMemory

66787480

66862560

"[63 MiB ... 63 MiB]"

If a node property step does not have an estimation implemented, the step will be ignored in the estimation.

7.3.2. Train

In the following examples we will demonstrate running the Node Classification training pipeline on this graph. We will train a model to predict the class in which a house belongs, based on its sizePerStory property.

The following will train a model using a pipeline:
CALL gds.beta.pipeline.nodeClassification.train('myGraph', {
  pipeline: 'pipe',
  nodeLabels: ['House'],
  modelName: 'nc-pipeline-model',
  targetProperty: 'class',
  randomSeed: 42,
  concurrency:1,
  metrics: ['F1_WEIGHTED']
}) YIELD modelInfo
RETURN
  modelInfo.bestParameters AS winningModel,
  modelInfo.metrics.F1_WEIGHTED.train.avg AS avgTrainScore,
  modelInfo.metrics.F1_WEIGHTED.outerTrain AS outerTrainScore,
  modelInfo.metrics.F1_WEIGHTED.test AS testScore
Table 26. Results
winningModel avgTrainScore outerTrainScore testScore

{maxEpochs=100, minEpochs=1, penalty=0.0625, patience=1, methodName=LogisticRegression, batchSize=100, tolerance=0.001}

0.999999989939394

0.9999999912121211

0.9999999850000002

Here we can observe that the model candidate with penalty 0.0625 performed the best in the training phase, with an F1_WEIGHTED score nearing 1 over the train graph as well as on the test graph. This indicates that the model reacted very well to the train graph, and was able to generalize fairly well to unseen data. Notice that this is just a toy example on a very small graph. In order to achieve a higher test score, we may need to use better features, a larger graph, or different model configuration.

8. Applying a trained model for prediction

In the previous sections we have seen how to build up a Node Classification training pipeline and train it to produce a classification pipeline. After training, the runnable model is of type NodeClassification and resides in the model catalog.

The classification model can be executed with a graph in the graph catalog to predict the class of previously unseen nodes. In addition to the predicted class for each node, the predicted probability for each class may also be retained on the nodes. The order of the probabilities matches the order of the classes registered in the model.

Since the model has been trained on features which are created using the feature pipeline, the same feature pipeline is stored within the model and executed at prediction time. As during training, intermediate node properties created by the node property steps in the feature pipeline are transient and not visible after execution.

The predict graph must contain the properties that the pipeline requires and the used array properties must have the same dimensions as in the train graph. If the predict and train graphs are distinct, it is also beneficial that they have similar origins and semantics, so that the model is able to generalize well.

8.1. Syntax

Node Classification syntax per mode
Run Node Classification in stream mode on a named graph:
CALL gds.beta.pipeline.nodeClassification.predict.stream(
  graphName: String,
  configuration: Map
)
YIELD
  nodeId: Integer,
  predictedClass: Integer,
  predictedProbabilities: List of Float
Table 27. Parameters
Name Type Default Optional Description

graphName

String

n/a

no

The name of a graph stored in the catalog.

configuration

Map

{}

yes

Configuration for algorithm-specifics and/or graph filtering.

Table 28. General configuration for algorithm execution on a named graph.
Name Type Default Optional Description

modelName

String

n/a

no

The name of a NodeClassification model in the model catalog.

nodeLabels

List of String

['*']

yes

Filter the named graph using the given node labels.

relationshipTypes

List of String

['*']

yes

Filter the named graph using the given relationship types.

concurrency

Integer

4

yes

The number of concurrent threads used for running the algorithm.

Table 29. Algorithm specific configuration
Name Type Default Optional Description

includePredictedProbabilities

Boolean

false

yes

Whether to return the probability for each class. If false then null is returned in predictedProbabilites. The order of the classes can be inspected in the modelInfo of the classification model (see listing models).

Table 30. Results
Name Type Description

nodeId

Integer

Node ID.

predictedClass

Integer

Predicted class for this node.

predictedProbabilities

List of Float

Probabilities for all classes, for this node.

Run Node Classification in mutate mode on a named graph:
CALL gds.beta.pipeline.nodeClassification.predict.mutate(
  graphName: String,
  configuration: Map
)
YIELD
  preProcessingMillis: Integer,
  computeMillis: Integer,
  postProcessingMillis: Integer,
  mutateMillis: Integer,
  nodePropertiesWritten: Integer,
  configuration: Map
Table 31. Parameters
Name Type Default Optional Description

graphName

String

n/a

no

The name of a graph stored in the catalog.

configuration

Map

{}

yes

Configuration for algorithm-specifics and/or graph filtering.

Table 32. General configuration for algorithm execution on a named graph.
Name Type Default Optional Description

modelName

String

n/a

no

The name of a NodeClassification model in the model catalog.

nodeLabels

List of String

['*']

yes

Filter the named graph using the given node labels.

relationshipTypes

List of String

['*']

yes

Filter the named graph using the given relationship types.

concurrency

Integer

4

yes

The number of concurrent threads used for running the algorithm.

mutateProperty

String

n/a

no

The node property in the GDS graph to which the predicted property is written.

Table 33. Algorithm specific configuration
Name Type Default Optional Description

predictedProbabilityProperty

String

n/a

yes

The node property in which the class probability list is stored. If omitted, the probability list is discarded. The order of the classes can be inspected in the modelInfo of the classification model (see listing models).

Table 34. Results
Name Type Description

preProcessingMillis

Integer

Milliseconds for preprocessing the graph.

computeMillis

Integer

Milliseconds for running the algorithm.

postProcessingMillis

Integer

Milliseconds for computing the global metrics.

mutateMillis

Integer

Milliseconds for adding properties to the in-memory graph.

nodePropertiesWritten

Integer

Number of node properties written.

configuration

Map

Configuration used for running the algorithm.

Run Node Classification in write mode on a named graph:
CALL gds.beta.pipeline.nodeClassification.predict.write(
  graphName: String,
  configuration: Map
)
YIELD
  preProcessingMillis: Integer,
  computeMillis: Integer,
  postProcessingMillis: Integer,
  writeMillis: Integer,
  nodePropertiesWritten: Integer,
  configuration: Map
Table 35. Parameters
Name Type Default Optional Description

graphName

String

n/a

no

The name of a graph stored in the catalog.

configuration

Map

{}

yes

Configuration for algorithm-specifics and/or graph filtering.

Table 36. General configuration for algorithm execution on a named graph.
Name Type Default Optional Description

modelName

String

n/a

no

The name of a NodeClassification model in the model catalog.

nodeLabels

List of String

['*']

yes

Filter the named graph using the given node labels.

relationshipTypes

List of String

['*']

yes

Filter the named graph using the given relationship types.

concurrency

Integer

4

yes

The number of concurrent threads used for running the algorithm. Also provides the default value for 'writeConcurrency'.

writeConcurrency

Integer

value of 'concurrency'

yes

The number of concurrent threads used for writing the result to Neo4j.

writeProperty

String

n/a

no

The node property in the Neo4j database to which the predicted property is written.

Table 37. Algorithm specific configuration
Name Type Default Optional Description

predictedProbabilityProperty

String

n/a

yes

The node property in which the class probability list is stored. If omitted, the probability list is discarded. The order of the classes can be inspected in the modelInfo of the classification model (see listing models).

Table 38. Results
Name Type Description

preProcessingMillis

Integer

Milliseconds for preprocessing the graph.

computeMillis

Integer

Milliseconds for running the algorithm.

postProcessingMillis

Integer

Milliseconds for computing the global metrics.

writeMillis

Integer

Milliseconds for writing result back to Neo4j.

nodePropertiesWritten

Integer

Number of node properties written.

configuration

Map

Configuration used for running the algorithm.

8.2. Example

In the following examples we will show how to use a classification model to predict the class of a node in your in-memory graph. In addition to the predicted class, we will also produce the probability for each class in another node property. In order to do this, we must first have an already trained model registered in the Model Catalog. We will use the model which we trained in the train example which we gave the name 'nc-pipeline-model'.

8.2.1. Memory Estimation

First off, we will estimate the cost of running the algorithm using the estimate procedure. This can be done with any execution mode. We will use the stream mode in this example. Estimating the algorithm is useful to understand the memory impact that running the algorithm on your graph will have. When you later actually run the algorithm in one of the execution modes the system will perform an estimation. If the estimation shows that there is a very high probability of the execution going over its memory limitations, the execution is prohibited. To read more about this, see Automatic estimation and execution blocking.

For more details on estimate in general, see Memory Estimation.

The following will estimate the memory requirements for running the algorithm in stream mode:
CALL gds.beta.pipeline.nodeClassification.predict.stream.estimate('myGraph', {
  modelName: 'nc-pipeline-model',
  includePredictedProbabilities: true,
  nodeLabels: ['UnknownHouse']
})
YIELD bytesMin, bytesMax, requiredMemory
Table 39. Results
bytesMin bytesMax requiredMemory

10200

10200

"10200 Bytes"

If a node property step does not have an estimation implemented, the step will be ignored in the estimation.

8.2.2. Stream

CALL gds.beta.pipeline.nodeClassification.predict.stream('myGraph', {
  modelName: 'nc-pipeline-model',
  includePredictedProbabilities: true,
  nodeLabels: ['UnknownHouse']
})
 YIELD nodeId, predictedClass, predictedProbabilities
WITH gds.util.asNode(nodeId) AS houseNode, predictedClass, predictedProbabilities
RETURN
  houseNode.color AS classifiedHouse,
  predictedClass,
  floor(predictedProbabilities[predictedClass] * 100) AS confidence
  ORDER BY classifiedHouse
Table 40. Results
classifiedHouse predictedClass confidence

"Pink"

0

98.0

"Tan"

1

98.0

"Yellow"

2

79.0

As we can see, the model was able to predict the pink house into class 0, tan house into class 1, and yellow house into class 2. This makes sense, as all houses in class 0 had three stories, class 1 two stories and class 2 one story, and the same is true of the pink, tan and yellow houses, respectively. Additionally, we see that the model is confident in these predictions, as the confidence is >=79% in all cases.

The indices in the predictedProbabilities correspond to the order of the classes in the classification model. To inspect the order of the classes, we can look at its modelInfo (see listing models).

8.2.3. Mutate

The mutate execution mode updates the named graph with a new node property containing the predicted class for that node. The name of the new property is specified using the mandatory configuration parameter mutateProperty. The result is a single summary row including information about timings and how many properties were written. The mutate mode is especially useful when multiple algorithms are used in conjunction.

For more details on the mutate mode in general, see Mutate.

CALL gds.beta.pipeline.nodeClassification.predict.mutate('myGraph', {
  nodeLabels: ['UnknownHouse'],
  modelName: 'nc-pipeline-model',
  mutateProperty: 'predictedClass',
  predictedProbabilityProperty: 'predictedProbabilities'
}) YIELD nodePropertiesWritten
Table 41. Results
nodePropertiesWritten

6

Since we specified also the predictedProbabilityProperty we are writing two properties for each of the 3 UnknownHouse nodes.

8.2.4. Write

The write execution mode writes the predicted property for each node as a property to the Neo4j database. The name of the new property is specified using the mandatory configuration parameter writeProperty. The result is a single summary row including information about timings and how many properties were written. The write mode enables directly persisting the results to the database.

For more details on the write mode in general, see Write.

CALL gds.beta.pipeline.nodeClassification.predict.write('myGraph', {
  nodeLabels: ['UnknownHouse'],
  modelName: 'nc-pipeline-model',
  writeProperty: 'predictedClass',
  predictedProbabilityProperty: 'predictedProbabilities'
}) YIELD nodePropertiesWritten
Table 42. Results
nodePropertiesWritten

6

Since we specified also the predictedProbabilityProperty we are writing two properties for each of the 3 UnknownHouse nodes.