Node classification pipelines

This section describes Node classification pipelines in the Neo4j Graph Data Science library.

1. Introduction

Node Classification is a common machine learning task applied to graphs: training models to classify nodes. The GDS library also provides a standalone version of Node Classification. Here we describe Node Classification Pipelines, which facilitate an end-to-end workflow, from features extraction to node classification. There are two kinds of pipelines: training pipelines and classification pipelines, both of which reside in the model catalog. When a training pipeline is executed, a classification pipeline is created and stored in the model catalog.

A training pipeline is a sequence of two phases:

  1. The graph is augmented with new node properties in a series of steps.

  2. The augmented graph is used for training a node classification model.

One can configure which steps should be included above. The steps execute GDS algorithms that create new node properties. After configuring the node property steps, one can select a subset of node properties to be used as features. The training phase (II) proceeds in a manner akin to the standalone version of Node Classification, where it can train multiple models, select the best one, and report relevant performance metrics.

After training the pipeline, a classification pipeline is created. This new pipeline inherits the node property steps and feature configuration from the training pipeline and uses them to generate the relevant features for classifying unlabeled nodes.

Classification can only be done with a trained classification pipeline (not with a training pipeline).

The motivation for using pipelines:

  • easier to get splits right and prevent data leakage

  • ensuring that the same feature creation steps are applied at classification and train time

  • applying the trained model with a single procedure call

  • persisting the pipeline as a whole

The rest of this page is divided as follows:

2. Creating a pipeline

The first step of building a new pipeline is to create one using gds.alpha.ml.pipeline.nodeClassification.create. This stores a trainable model object in the model catalog of type Node classification training pipeline. This represents a configurable pipeline that can later be invoked for training, which in turn creates a classification pipeline. The latter is also a model which is stored in the catalog with type Node classification pipeline.

2.1. Syntax

Create pipeline syntax
CALL gds.alpha.ml.pipeline.nodeClassification.create(
  pipelineName: String
)
YIELD
  name: String,
  nodePropertySteps: List of Map,
  featureProperties: List of String,
  splitConfig: Map,
  parameterSpace: List of Map
Table 1. Parameters
Name Type Description

pipelineName

String

The name of the created pipeline.

Table 2. Results
Name Type Description

name

String

Name of the pipeline.

nodePropertySteps

List of Map

List of configurations for node property steps.

featureProperties

List of String

List of node properties to be used as features.

splitConfig

Map

Configuration to define the split before the model training.

parameterSpace

List of Map

List of parameter configurations for models which the train mode uses for model selection.

2.2. Example

The following will create a pipeline:
CALL gds.alpha.ml.pipeline.nodeClassification.create('pipe')
Table 3. Results
name nodePropertySteps featureProperties splitConfig parameterSpace

"pipe"

[]

[]

{testFraction=0.3, validationFolds=3}

[{maxEpochs=100, minEpochs=1, penalty=0.0, patience=1, batchSize=100, tolerance=0.001}]

This shows that the newly created pipeline does not contain any steps yet, and has defaults for the split and train parameters.

3. Adding node properties

A node classification pipeline can execute one or several GDS algorithms in mutate mode that create node properties in the in-memory graph. Such steps producing node properties can be chained one after another and created properties can later be used as features. Moreover, the node property steps that are added to the training pipeline will be executed both when training a model and when the classification pipeline is applied for classification.

The name of the procedure that should be added can be a fully qualified GDS procedure name ending with .mutate. The ending .mutate may be omitted and one may also use shorthand forms such as node2vec instead of gds.beta.node2vec.mutate.

For example, pre-processing algorithms can be used as node property steps.

3.1. Syntax

Add node property syntax
CALL gds.alpha.ml.pipeline.nodeClassification.addNodeProperty(
  pipelineName: String,
  procedureName: String,
  procedureConfiguration: Map
)
YIELD
  name: String,
  nodePropertySteps: List of Map,
  featureProperties: List of String,
  splitConfig: Map,
  parameterSpace: List of Map
Table 4. Parameters
Name Type Description

pipelineName

String

The name of the pipeline.

procedureName

String

The name of the procedure to be added to the pipeline.

procedureConfiguration

Map

The configuration of the procedure, excluding graphName, nodeLabels and relationshipTypes.

Table 5. Results
Name Type Description

name

String

Name of the pipeline.

nodePropertySteps

List of Map

List of configurations for node property steps.

featureProperties

List of String

List of node properties to be used as features.

splitConfig

Map

Configuration to define the split before the model training.

parameterSpace

List of Map

List of parameter configurations for models which the train mode uses for model selection.

3.2. Example

The following will add a node property step to the pipeline. Here we assume that the input graph contains a property sizePerStory.
CALL gds.alpha.ml.pipeline.nodeClassification.addNodeProperty('pipe', 'scaleProperties', {
  nodeProperties: 'sizePerStory',
  scaler: 'L1Norm',
  mutateProperty:'scaledSizes'
})
YIELD name, nodePropertySteps
Table 6. Results
name nodePropertySteps

"pipe"

[{name=gds.alpha.scaleProperties.mutate, config={scaler=L1Norm, mutateProperty=scaledSizes, nodeProperties=sizePerStory}}]

The scaledSizes property can be later used as a feature.

4. Adding features

A Node Classification Pipeline allows you to select a subset of the available node properties to be used as features for the machine learning model. When executing the pipeline, the selected nodeProperties must be either present in the input graph, or created by a previous node property step. For example, the embedding property could be created by the previous example, and we expect numberOfPosts to already be present in the in-memory graph used as input, at train and predict time.

4.1. Syntax

Adding a feature to a pipeline syntax
CALL gds.alpha.ml.pipeline.nodeClassification.selectFeatures(
  pipelineName: String,
  nodeProperties: List or String
)
YIELD
  name: String,
  nodePropertySteps: List of Map,
  featureProperties: List of String,
  splitConfig: Map,
  parameterSpace: List of Map
Table 7. Parameters
Name Type Description

pipelineName

String

The name of the pipeline.

nodeProperties

List or String

Configuration for splitting the relationships.

Table 8. Results
Name Type Description

name

String

Name of the pipeline.

nodePropertySteps

List of Map

List of configurations for node property steps.

featureProperties

List of String

List of node properties to be used as features.

splitConfig

Map

Configuration to define the split before the model training.

parameterSpace

List of Map

List of parameter configurations for models which the train mode uses for model selection.

4.2. Example

The following will select features for the pipeline. Here we assume that the input graph contains a property sizePerStory and scaledSizes was created in a nodePropertyStep.
CALL gds.alpha.ml.pipeline.nodeClassification.selectFeatures('pipe', ['scaledSizes', 'sizePerStory'])
YIELD name, featureProperties
Table 9. Results
name featureProperties

"pipe"

[scaledSizes, sizePerStory]

5. Configuring the node splits

Node Classification Pipelines manage splitting the nodes into several sets for training, testing and validating the models defined in the parameter space. Configuring the splitting is optional, and if omitted, splitting will be done using default settings. The splitting configuration of a pipeline can be inspected by using gds.beta.model.list and possibly only yielding splitConfig.

The node splits are used in the training process as follows:

  1. The input graph is split into two parts: the train graph and the test graph. See the example below.

  2. The train graph is further divided into a number of validation folds, each consisting of a train part and a validation part. See the animation below.

  3. Each model candidate is trained on each train part and evaluated on the respective validation part.

  4. The model with the highest average score according to the primary metric will win the training.

  5. The winning model will then be retrained on the entire train graph.

  6. The winning model is evaluated on the train graph as well as the test graph.

  7. The winning model is retrained on the entire original graph.

Below we illustrate an example for a graph with 12 nodes. First we use a holdoutFraction of 0.25 to split into train and test subgraphs.

train-test-image

Then we carry out three validation folds, where we first split the train subgraph into 3 disjoint subsets (s1, s2 and s3), and then alternate which subset is used for validation. For each fold, all candidate models are trained in the red nodes, and validated in the green nodes.

validation-folds-image

5.1. Syntax

Configure the relationship split syntax
CALL gds.alpha.ml.pipeline.nodeClassification.configureSplit(
  pipelineName: String,
  configuration: Map
)
YIELD
  name: String,
  nodePropertySteps: List of Map,
  featureProperties: List of Strings,
  splitConfig: Map,
  parameterSpace: List of Map
Table 10. Parameters
Name Type Description

pipelineName

String

The name of the pipeline.

configuration

Map

Configuration for splitting the relationships.

Table 11. Configuration
Name Type Default Description

validationFolds

Integer

3

Number of divisions of the training graph used during model selection.

testFraction

Double

0.3

Fraction of the graph reserved for testing. Must be in the range (0, 1). The fraction used for the training is 1 - testFraction.

Table 12. Results
Name Type Description

name

String

Name of the pipeline.

nodePropertySteps

List of Map

List of configurations for node property steps.

featureProperties

List of String

List of node properties to be used as features.

splitConfig

Map

Configuration to define the split before the model training.

parameterSpace

List of Map

List of parameter configurations for models which the train mode uses for model selection.

5.2. Example

The following will configure the splitting of the pipeline:
CALL gds.alpha.ml.pipeline.nodeClassification.configureSplit('pipe', {
 testFraction: 0.2,
  validationFolds: 5
})
YIELD splitConfig
Table 13. Results
splitConfig

{testFraction=0.2, validationFolds=5}

We now reconfigured the splitting of the pipeline, which will be applied during training.

6. Configuring the model parameters

The gds.alpha.ml.pipeline.nodeClassification.configureParams mode is used to set up the train mode with a list of configurations of logistic regression models. The set of model configurations is called the parameter space which parametrizes a set of model candidates. The parameter space can be configured by passing this procedure a list of maps, where each map configures the training of one logistic regression model. In Training the pipeline, we explain further how the configured model candidates are trained, evaluated and compared.

The allowed model parameters are listed in the table Model configuration.

If configureParams is not used, then a single model with defaults for all the model parameters is used. The parameter space of a pipeline can be inspected using gds.beta.model.list and optionally yielding only parameterSpace.

6.1. Syntax

Configure the train parameters syntax
CALL gds.alpha.ml.pipeline.nodeClassification.configureParams(
  pipelineName: String,
  parameterSpace: List of Map
)
YIELD
  name: String,
  nodePropertySteps: List of Map,
  featureProperties: List of String,
  splitConfig: Map,
  parameterSpace: List of Map
Table 14. Parameters
Name Type Description

pipelineName

String

The name of the pipeline.

parameterSpace

List of Map

The parameter space used to select the best model from. Each Map corresponds to potential model. The allowed parameters for a model are defined in the next table.

Table 15. Model configuration
Name Type Default Optional Description

penalty

Float

0.0

yes

Penalty used for the logistic regression. By default, no penalty is applied.

batchSize

Integer

100

yes

Number of nodes per batch.

minEpochs

Integer

1

yes

Minimum number of training epochs.

maxEpochs

Integer

100

yes

Maximum number of training epochs.

patience

Integer

1

yes

Maximum number of unproductive consecutive epochs.

tolerance

Float

0.001

yes

The minimal improvement of the loss to be considered productive.

Table 16. Results
Name Type Description

name

String

Name of the pipeline.

nodePropertySteps

List of Map

List of configurations for node property steps.

featureProperties

List of String

List of node properties to be used as features.

splitConfig

Map

Configuration to define the split before the model training.

parameterSpace

List of Map

List of parameter configurations for models which the train mode uses for model selection.

6.2. Example

The following will configure the parameter space of the pipeline:
CALL gds.alpha.ml.pipeline.nodeClassification.configureParams('pipe',
  [{penalty: 0.0625}, {tolerance: 0.01}, {maxEpochs: 500}]
) YIELD parameterSpace
Table 17. Results
parameterSpace

[{maxEpochs=100, minEpochs=1, penalty=0.0625, patience=1, batchSize=100, tolerance=0.001}, {maxEpochs=100, minEpochs=1, penalty=0.0, patience=1, batchSize=100, tolerance=0.01}, {maxEpochs=500, minEpochs=1, penalty=0.0, patience=1, batchSize=100, tolerance=0.001}]

The parameterSpace in the pipeline now contains the three different model parameters, expanded with the default values. Each specified model configuration will be tried out during the model selection in training.

7. Training the pipeline

The train mode, gds.alpha.ml.pipeline.nodeClassification.train, is responsible for splitting data, feature extraction, model selection, training and storing a model for future use. Running this mode results in a classification pipeline of type Node classification pipeline, which is then stored in the model catalog. The classification pipeline can be applied to a possibly different graph which classifies nodes.

More precisely, the training proceeds as follows:

  1. Apply nodeLabels and relationshipType filters to the graph.

  2. Apply the node property steps, added according to Adding node properties, on the whole graph.

  3. Select node properties to be used as features, as specified in Adding features.

  4. Split the input graph into two parts: the train graph and the test graph. This is described in Configuring the node splits. These graphs are internally managed and exist only for the duration of the training.

  5. Split the nodes in the train graph using stratified k-fold cross-validation. The number of folds k can be configured as described in Configuring the node splits.

  6. Each model candidate defined in the parameter space is trained on each train set and evaluated on the respective validation set for every fold. The training process uses a logistic regression algorithm, and the evaluation uses the specified metric.

  7. Choose the best performing model according to the highest average score for the primary metric.

  8. Retrain the winning model on the entire train graph.

  9. Evaluate the performance of the winning model on the whole train graph as well as the test graph.

  10. Retrain the winning model on the entire original graph.

  11. Register the winning model in the Model Catalog.

The above steps describe what the procedure does logically. The actual steps as well as their ordering in the implementation may differ.
A step can only use node properties that are already present in the input graph or produced by steps, which were added before.

7.1. Metrics

The Node Classification model in the Neo4j GDS library supports the following evaluation metrics:

  • Global metrics

    • F1_WEIGHTED

    • F1_MACRO

    • ACCURACY

  • Per-class metrics

    • F1(class=<number>) or F1(class=*)

    • PRECISION(class=<number>) or PRECISION(class=*)

    • RECALL(class=<number>) or RECALL(class=*)

    • ACCURACY(class=<number>) or ACCURACY(class=*)

The * is syntactic sugar for reporting the metric for each class in the graph. When using a per-class metric, the reported metrics contain keys like for example ACCURACY_class_1.

More than one metric can be specified during training but only the first specified — the primary one — is used for evaluation, the results of all are present in the train results. The primary metric may not be a * expansion due to the ambiguity of which of the expanded metrics should be the primary one.

7.2. Syntax

Run Node Classification in train mode on a named graph:
CALL gds.alpha.ml.pipeline.nodeClassification.train(
  graphName: String,
  configuration: Map
) YIELD
  trainMillis: Integer,
  modelInfo: Map,
  configuration: Map
Table 18. Parameters
Name Type Default Optional Description

graphName

String

n/a

no

The name of a graph stored in the catalog.

configuration

Map

{}

yes

Configuration for algorithm-specifics and/or graph filtering.

Table 19. Configuration
Name Type Default Optional Description

pipeline

String

n/a

no

The name of the pipeline to execute.

nodeLabels

List of String

['*']

yes

Filter the named graph using the given node labels.

relationshipTypes

List of String

['*']

yes

Filter the named graph using the given relationship types.

concurrency

Integer

4

yes

The number of concurrent threads used for running the algorithm.

targetProperty

String

n/a

no

The class of the node. Must be of type Integer.

metrics

List of String

n/a

no

Metrics used to evaluate the models.

randomSeed

Integer

n/a

yes

Seed for the random number generator used during training.

modelName

String

n/a

no

The name of the model to train, must not exist in the Model Catalog.

Table 20. Results
Name Type Description

trainMillis

Integer

Milliseconds used for training.

modelInfo

Map

Information about the training and the winning model.

configuration

Map

Configuration used for the train procedure.

The modelInfo can also be retrieved at a later time by using the Model List Procedure. The modelInfo return field has the following algorithm-specific subfields:

Table 21. Model info fields
Name Type Description

classes

List of Integer

Sorted list of class ids which are the distinct values of targetProperty over the entire graph.

bestParameters

Map

The model parameters which performed best on average on validation folds according to the primary metric.

metrics

Map

Map from metric description to evaluated metrics for various models and subsets of the data, see below.

trainingPipeline

Map

The pipeline used for the training.

The structure of modelInfo is:

{
    bestParameters: Map,        (1)
    trainingPipeline: Map       (2)
    classes: List of Integer,   (3)
    metrics: {                  (4)
        <METRIC_NAME>: {        (5)
            test: Float,        (6)
            outerTrain: Float,  (7)
            train: [{           (8)
                avg: Float,
                max: Float,
                min: Float,
                params: Map
            },
            {
                avg: Float,
                max: Float,
                min: Float,
                params: Map
            },
            ...
            ],
            validation: [{      (9)
                avg: Float,
                max: Float,
                min: Float,
                params: Map
            },
            {
                avg: Float,
                max: Float,
                min: Float,
                params: Map
            },
            ...
            ]
        }
    }
}
1 The best scoring model candidate configuration.
2 The pipeline used for the training.
3 Sorted list of class ids which are the distinct values of targetProperty over the entire graph.
4 The metrics map contains an entry for each metric description, and the corresponding results for that metric.
5 A metric name specified in the configuration of the procedure, e.g., F1_MACRO or RECALL(class=4).
6 Numeric value for the evaluation of the winning model on the test set.
7 Numeric value for the evaluation of the winning model on the outer train set.
8 The train entry lists the scores over the train set for all candidate models (e.g., params). Each such result is in turn also a map with keys params, avg, min and max.
9 The validation entry lists the scores over the validation set for all candidate models (e.g., params). Each such result is in turn also a map with keys params, avg, min and max.

7.3. Example

In this section we will show examples of running a Node Classification training pipeline on a concrete graph. The intention is to illustrate what the results look like and to provide a guide in how to make use of the model in a real setting. We will do this on a small graph of a handful of nodes representing houses. This is an example of Multi-class classification, the class node property distinct values determine the number of classes, in this case three (0, 1 and 2). The example graph looks like this:

node classification
The following Cypher statement will create the example graph in the Neo4j database:
CREATE
  (:House {color: 'Gold', sizePerStory: [15.5, 23.6, 33.1], class: 0}),
  (:House {color: 'Red', sizePerStory: [15.5, 23.6, 100.0], class: 0}),
  (:House {color: 'Blue', sizePerStory: [11.3, 35.1, 22.0], class: 0}),
  (:House {color: 'Green', sizePerStory: [23.2, 55.1, 0.0], class: 1}),
  (:House {color: 'Gray', sizePerStory: [34.3, 24.0, 0.0],  class: 1}),
  (:House {color: 'Black', sizePerStory: [71.66, 55.0, 0.0], class: 1}),
  (:House {color: 'White', sizePerStory: [11.1, 111.0, 0.0], class: 1}),
  (:House {color: 'Teal', sizePerStory: [80.8, 0.0, 0.0], class: 2}),
  (:House {color: 'Beige', sizePerStory: [106.2, 0.0, 0.0], class: 2}),
  (:House {color: 'Magenta', sizePerStory: [99.9, 0.0, 0.0], class: 2}),
  (:House {color: 'Purple', sizePerStory: [56.5, 0.0, 0.0], class: 2}),
  (:UnknownHouse {color: 'Pink', sizePerStory: [23.2, 55.1, 56.1]}),
  (:UnknownHouse {color: 'Tan', sizePerStory: [22.32, 102.0, 0.0]}),
  (:UnknownHouse {color: 'Yellow', sizePerStory: [39.0, 0.0, 0.0]});

With the graph in Neo4j we can now project it into the graph catalog to prepare it for the pipeline execution. We do this using a native projection targeting the House and UnknownHouse labels. We will also project the sizeOfStory property to use as a model feature, and the class property to use as a target feature.

In the examples below we will use named graphs and native projections as the norm. However, anonymous graphs and/or Cypher projections can also be used.

The following statement will create a graph using a native projection and store it in the graph catalog under the name 'myGraph'.
CALL gds.graph.create('myGraph', {
    House: { properties: ['sizePerStory', 'class'] },
    UnknownHouse: { properties: 'sizePerStory' }
  },
  '*'
)

In the following examples we will demonstrate running the Node Classification training pipeline on this graph. We will train a model to predict the class in which a house belongs, based on its sizePerStory property.

The following will train a model using a pipeline:
CALL gds.alpha.ml.pipeline.nodeClassification.train('myGraph', {
  pipeline: 'pipe',
  nodeLabels: ['House'],
  modelName: 'nc-pipeline-model',
  targetProperty: 'class',
  randomSeed: 42,
  concurrency:1,
  metrics: ['F1_WEIGHTED']
}) YIELD modelInfo
RETURN
  modelInfo.bestParameters AS winningModel,
  modelInfo.metrics.F1_WEIGHTED.outerTrain AS trainGraphScore,
  modelInfo.metrics.F1_WEIGHTED.test AS testGraphScore
Table 22. Results
winningModel trainGraphScore testGraphScore

{maxEpochs=100, minEpochs=1, penalty=0.0625, patience=1, batchSize=100, tolerance=0.001}

0.9999999912121211

0.9999999850000002

Here we can observe that the model candidate with penalty 0.0625 performed the best in the training phase, with an F1_WEIGHTED score nearing 1 over the train graph as well as on the test graph. This indicates that the model reacted very well to the train graph, and was able to generalize fairly well to unseen data. Notice that this is just a toy example on a very small graph. In order to achieve a higher test score, we may need to use better features, a larger graph, or different model configuration.

8. Applying a trained model for prediction

In the previous sections we have seen how to build up a Node Classification training pipeline and train it to produce a classification pipeline. After training, the runnable model is of type Node classification pipeline and resides in the model catalog.

The classification pipeline can be executed with a graph in the graph catalog to predict the value of the target property (class) of previously unseen nodes. In addition to the predicted class for each node, the predicted probability for each class may also be retained on the nodes. The order of the probabilities matches the order of the classes registered in the model.

Since the model has been trained on features which are created using the feature pipeline, the same feature pipeline is stored within the model and executed at prediction time. As during training, intermediate node properties created by the node property steps in the feature pipeline are transient and not visible after execution.

The predict graph must contain the properties that the pipeline requires and the used array properties must have the same dimensions as in the train graph. If the predict and train graphs are distinct, it is also beneficial that they have similar origins and semantics, so that the model is able to generalize well.

8.1. Syntax

Node Classification syntax per mode
Run Node Classification in stream mode on a named graph:
CALL gds.alpha.ml.pipeline.nodeClassification.predict.stream(
  graphName: String,
  configuration: Map
)
YIELD
  nodeId: Integer,
  predictedClass: Integer,
  predictedProbabilities: List of Float
Table 23. Parameters
Name Type Default Optional Description

graphName

String

n/a

no

The name of a graph stored in the catalog.

configuration

Map

{}

yes

Configuration for algorithm-specifics and/or graph filtering.

Table 24. General configuration for algorithm execution on a named graph.
Name Type Default Optional Description

modelName

String

n/a

no

The name of a Node classification pipeline model model in the model catalog.

nodeLabels

List of String

['*']

yes

Filter the named graph using the given node labels.

relationshipTypes

List of String

['*']

yes

Filter the named graph using the given relationship types.

concurrency

Integer

4

yes

The number of concurrent threads used for running the algorithm.

Table 25. Algorithm specific configuration
Name Type Default Optional Description

includePredictedProbabilities

Boolean

false

yes

Whether to return the probability for each class. If false then null is returned in predictedProbabilites.

batchSize

Integer

100

yes

Number of nodes per batch.

Table 26. Results
Name Type Description

nodeId

Integer

Node ID.

predictedClass

Integer

Predicted class for this node.

predictedProbabilities

List of Float

Probabilities for all classes, for this node.

Run Node Classification in mutate mode on a named graph:
CALL gds.alpha.ml.pipeline.nodeClassification.predict.mutate(
  graphName: String,
  configuration: Map
)
YIELD
  createMillis: Integer,
  computeMillis: Integer,
  postProcessingMillis: Integer,
  mutateMillis: Integer,
  nodePropertiesWritten: Integer,
  configuration: Map
Table 27. Parameters
Name Type Default Optional Description

graphName

String

n/a

no

The name of a graph stored in the catalog.

configuration

Map

{}

yes

Configuration for algorithm-specifics and/or graph filtering.

Table 28. General configuration for algorithm execution on a named graph.
Name Type Default Optional Description

modelName

String

n/a

no

The name of a Node classification pipeline model model in the model catalog.

nodeLabels

List of String

['*']

yes

Filter the named graph using the given node labels.

relationshipTypes

List of String

['*']

yes

Filter the named graph using the given relationship types.

concurrency

Integer

4

yes

The number of concurrent threads used for running the algorithm.

mutateProperty

String

n/a

no

The node property in the GDS graph to which the predicted property is written.

Table 29. Algorithm specific configuration
Name Type Default Optional Description

predictedProbabilityProperty

String

n/a

yes

The node property in which the class probability list is stored. If omitted, the probability list is discarded.

batchSize

Integer

100

yes

Number of nodes per batch.

Table 30. Results
Name Type Description

createMillis

Integer

Milliseconds for creating the graph.

computeMillis

Integer

Milliseconds for running the algorithm.

postProcessingMillis

Integer

Milliseconds for computing the global metrics.

mutateMillis

Integer

Milliseconds for adding properties to the in-memory graph.

nodePropertiesWritten

Integer

Number of node properties written.

configuration

Map

Configuration used for running the algorithm.

Run Node Classification in write mode on a named graph:
CALL gds.alpha.ml.pipeline.nodeClassification.predict.write(
  graphName: String,
  configuration: Map
)
YIELD
  createMillis: Integer,
  computeMillis: Integer,
  postProcessingMillis: Integer,
  writeMillis: Integer,
  nodePropertiesWritten: Integer,
  configuration: Map
Table 31. Parameters
Name Type Default Optional Description

graphName

String

n/a

no

The name of a graph stored in the catalog.

configuration

Map

{}

yes

Configuration for algorithm-specifics and/or graph filtering.

Table 32. General configuration for algorithm execution on a named graph.
Name Type Default Optional Description

modelName

String

n/a

no

The name of a Node classification pipeline model model in the model catalog.

nodeLabels

List of String

['*']

yes

Filter the named graph using the given node labels.

relationshipTypes

List of String

['*']

yes

Filter the named graph using the given relationship types.

concurrency

Integer

4

yes

The number of concurrent threads used for running the algorithm. Also provides the default value for 'writeConcurrency'.

writeConcurrency

Integer

value of 'concurrency'

yes

The number of concurrent threads used for writing the result to Neo4j.

writeProperty

String

n/a

no

The node property in the Neo4j database to which the predicted property is written.

Table 33. Algorithm specific configuration
Name Type Default Optional Description

predictedProbabilityProperty

String

n/a

yes

The node property in which the class probability list is stored. If omitted, the probability list is discarded.

batchSize

Integer

100

yes

Number of nodes per batch.

Table 34. Results
Name Type Description

createMillis

Integer

Milliseconds for creating the graph.

computeMillis

Integer

Milliseconds for running the algorithm.

postProcessingMillis

Integer

Milliseconds for computing the global metrics.

writeMillis

Integer

Milliseconds for writing result back to Neo4j.

nodePropertiesWritten

Integer

Number of node properties written.

configuration

Map

Configuration used for running the algorithm.

8.2. Example

In the following examples we will show how to use a classification pipeline to predict the class of a node in your in-memory graph. In addition to the predicted class, we will also produce the probability for each class in another node property. In order to do this, we must first have an already trained model registered in the Model Catalog. We will use the model which we trained in the train example which we gave the name 'nc-pipeline-model'.

8.2.1. Stream

CALL gds.alpha.ml.pipeline.nodeClassification.predict.stream('myGraph', {
  modelName: 'nc-pipeline-model',
  includePredictedProbabilities: true,
  nodeLabels: ['UnknownHouse']
})
 YIELD nodeId, predictedClass, predictedProbabilities
WITH gds.util.asNode(nodeId) AS houseNode, predictedClass, predictedProbabilities
RETURN
  houseNode.color AS classifiedHouse,
  predictedClass,
  floor(predictedProbabilities[predictedClass] * 100) AS confidence
  ORDER BY classifiedHouse
Table 35. Results
classifiedHouse predictedClass confidence

"Pink"

0

98.0

"Tan"

1

98.0

"Yellow"

2

79.0

As we can see, the model was able to predict the pink house into class 0, tan house into class 1, and yellow house into class 2. This makes sense, as all houses in class 0 had three stories, class 1 two stories and class 2 one story, and the same is true of the pink, tan and yellow houses, respectively. Additionally, we see that the model is confident in these predictions, as the confidence is >=79% in all cases.

8.2.2. Mutate

The mutate execution mode updates the named graph with a new node property containing the predicted class for that node. The name of the new property is specified using the mandatory configuration parameter mutateProperty. The result is a single summary row including information about timings and how many properties were written. The mutate mode is especially useful when multiple algorithms are used in conjunction.

For more details on the mutate mode in general, see Mutate.

CALL gds.alpha.ml.pipeline.nodeClassification.predict.mutate('myGraph', {
  nodeLabels: ['UnknownHouse'],
  modelName: 'nc-pipeline-model',
  mutateProperty: 'predictedClass',
  predictedProbabilityProperty: 'predictedProbabilities'
}) YIELD nodePropertiesWritten
Table 36. Results
nodePropertiesWritten

6

Since we specified also the predictedProbabilityProperty we are writing two properties for each of the 3 UnknownHouse nodes.

8.2.3. Write

The write execution mode writes the predicted property for each node as a property to the Neo4j database. The name of the new property is specified using the mandatory configuration parameter writeProperty. The result is a single summary row including information about timings and how many properties were written. The write mode enables directly persisting the results to the database.

For more details on the write mode in general, see Write.

CALL gds.alpha.ml.pipeline.nodeClassification.predict.write('myGraph', {
  nodeLabels: ['UnknownHouse'],
  modelName: 'nc-pipeline-model',
  writeProperty: 'predictedClass',
  predictedProbabilityProperty: 'predictedProbabilities'
}) YIELD nodePropertiesWritten
Table 37. Results
nodePropertiesWritten

6

Since we specified also the predictedProbabilityProperty we are writing two properties for each of the 3 UnknownHouse nodes.