Node Classification

This section describes the Node Classification Model in the Neo4j Graph Data Science library.

1. Introduction

Node Classification is a common machine learning task applied to graphs: training a model to learn in which class a node belongs. Neo4j GDS trains supervised machine learning models based on node properties (features) in your graph to predict what class an unseen or future node would belong to. Node Classification can be used favorably together with pre-processing algorithms.

Concretely, Node Classification models are used to predict a non-existing node property based on other node properties. The non-existing node property represents the class, and is referred to as the target property. The specified node properties are used as input features. The Node Classification model does not rely on relationship information. However, a node embedding algorithm could embed the neighborhoods of nodes as a node property, to transfer this information into the Node Classification model (see Node embeddings).

Models are trained on parts of the input graph and evaluated using specified metrics. Splitting of the graph into a train and a test graph is performed internally by the algorithm, and the test graph is used to evaluate model performance.

The training process follows this outline:

  1. The input graph is split into two parts: the train graph and the test graph.

  2. The train graph is further divided into a number of validation folds, each consisting of a train part and a validation part.

  3. Each model candidate is trained on each train part and evaluated on the respective validation part.

  4. The training process uses a logistic regression algorithm, and the evaluation uses the specified metrics. The first metric is the primary metric.

  5. The model with the highest average score according to the primary metric will win the training.

  6. The winning model will then be retrained on the entire train graph.

  7. The winning model is evaluated on the train graph as well as the test graph.

  8. The winning model is retrained on the entire original graph.

  9. Finally, the winning model will be registered in the Model Catalog.

Trained models may then be used to predict the value of the target property (class) of previously unseen nodes. In addition to the predicted class for each node, the predicted probability for each class may also be retained on the nodes. The order of the probabilities match the order of the classes registered in the model.

1.1. Metrics

The Node Classification model in the Neo4j GDS library supports the following evaluation metrics:

  • Global metrics

    • F1_WEIGHTED

    • F1_MACRO

    • ACCURACY

  • Per-class metrics

    • F1(class=<number>) or F1(class=*)

    • PRECISION(class=<number>) or PRECISION(class=*)

    • RECALL(class=<number>) or RECALL(class=*)

    • ACCURACY(class=<number>) or ACCURACY(class=*)

The * is syntactic sugar for reporting the metric for each class in the graph. When using a per-class metric, the reported metrics contain keys like for example ACCURACY_class_1.

More than one metric can be specified during training but only the first specified — the primary one — is used for evaluation, the results of all are present in the train results. The primary metric may not be a * expansion due to the ambiguity of which of the expanded metrics should be the primary one.

2. Syntax

This section covers the syntax used to execute the Node Classification algorithm in each of its execution modes. We are describing the named graph variant of the syntax. To learn more about general syntax variants, see Syntax overview.

Example 1. Node Classification syntax per mode
Run Node Classification in train mode on a named graph:
CALL gds.alpha.ml.nodeClassification.train(
  graphName: String,
  configuration: Map
) YIELD
  trainMillis: Integer,
  modelInfo: Map,
  configuration: Map
Table 1. Parameters
Name Type Default Optional Description

graphName

String

n/a

no

The name of a graph stored in the catalog.

configuration

Map

{}

yes

Configuration for algorithm-specifics and/or graph filtering.

Table 2. General configuration for algorithm execution on a named graph.
Name Type Default Optional Description

modelName

String

n/a

no

The name of the model to train, must not exist in the Model Catalog.

featureProperties

List<String>

[]

yes

The names of the node properties that should be used as input features. All property names must exist in the in-memory graph and be of type Float or List<Float>.

nodeLabels

String[]

['*']

yes

Filter the named graph using the given node labels.

relationshipTypes

String[]

['*']

yes

Filter the named graph using the given relationship types.

concurrency

Integer

4

yes

The number of concurrent threads used for running the algorithm.

Table 3. Algorithm specific configuration
Name Type Default Optional Description

targetProperty

String

n/a

no

The class of the node. Must be of type Integer.

holdoutFraction

Float

n/a

no

Portion of the graph reserved for testing. Must be in the range (0, 1).

validationFolds

Integer

n/a

no

Number of divisions of the train graph used for model selection.

metrics

List<String>

n/a

no

Metrics used to evaluate the models.

params

List<Map>

n/a

no

List of model configurations to be trained. See next table for details.

randomSeed

Integer

n/a

yes

Seed for the random number generator used during training.

Table 4. Model configuration
Name Type Default Optional Description

penalty

Float

n/a

no

Penalty used for the logistic regression.

batchSize

Integer

100

yes

Number of nodes per batch.

minEpochs

Integer

1

yes

Minimum number of training epochs.

maxEpochs

Integer

100

yes

Maximum number of training epochs.

patience

Integer

1

yes

Maximum number of iterations that do not improve the loss before stopping.

tolerance

Float

0.001

yes

Minimum acceptable loss before stopping.

concurrency

Integer

see description

yes

Concurrency for training the model candidate. By default the value of the top level concurrency parameter is used.

For hyperparameter tuning ideas, look here.

Table 5. Results
Name Type Description

trainMillis

Integer

Milliseconds used for training.

modelInfo

Map

Information about the training and the winning model.

configuration

Map

Configuration used for the train procedure.

The modelInfo can also be retrieved at a later time by using the Model List Procedure. The modelInfo return field has the following algorithm-specific subfields:

Name Type Description

classes

List<Integer>

Sorted list of class ids which are the distinct values of targetProperty over the entire graph.

bestParameters

Map

The model parameters which performed best on average on validation folds according to the primary metric.

metrics

Map

Map from metric description to evaluated metrics for various models and subsets of the data, see below.

The metrics map contains for each metric description (such as F1_MACRO or RECALL(class=4)), the corresponding results for that metric. Each such result map contains the keys train, validation, outerTrain and test. The latter two correspond to numeric values for the evaluation of the best model on the test set and its complement — the outer train set. The train and validation results contain lists of results for all candidate models (i.e. params). Each such result is in turn also a map with keys params, avg, min and max. These correspond to the hyper-parameters of the model candidate, and statistics for the result of that model over cross-validation folds. The structure of modelInfo is:

{
    bestParameters: Map,     // one of the maps specified in `params`
    classes: List of Integer,
    metrics: {
        String: {            // a metric description
            "test": Float,
            "outerTrain": Float,
            "train": [{
                "avg": Float,
                "max": Float,
                "min": Float,
                "params": Map
            },
            {
                "avg": Float,
                "max": Float,
                "min": Float,
                "params": Map
            },
            ...              // more results per model candidate
            ],
            "validation": [{
                "avg": Float,
                "max": Float,
                "min": Float,
                "params": Map
            },
            {
                "avg": Float,
                "max": Float,
                "min": Float,
                "params": Map
            },
            ...              // more results per model candidate
            ],
        },
        String: Map,         // another metric
        ...                  // remaining metrics
    }
}
Run Node Classification in stream mode on a named graph:
CALL gds.alpha.ml.nodeClassification.predict.stream(
  graphName: String,
  configuration: Map
) YIELD
  nodeId: Integer,
  predictedClass: Integer,
  predictedProbabilities: List[Float]
Table 6. Parameters
Name Type Default Optional Description

graphName

String

n/a

no

The name of a graph stored in the catalog.

configuration

Map

{}

yes

Configuration for algorithm-specifics and/or graph filtering.

Table 7. General configuration for algorithm execution on a named graph.
Name Type Default Optional Description

modelName

String

n/a

no

The name of a Node Classification model in the model catalog.

nodeLabels

String[]

['*']

yes

Filter the named graph using the given node labels.

relationshipTypes

String[]

['*']

yes

Filter the named graph using the given relationship types.

concurrency

Integer

4

yes

The number of concurrent threads used for running the algorithm.

Table 8. Algorithm specific configuration
Name Type Default Optional Description

includePredictedProbabilities

Boolean

false

yes

Whether to return the probability for each class. If false then null is returned in predictedProbabilites.

batchSize

Integer

100

yes

Number of nodes per batch.

Table 9. Results
Name Type Description

nodeId

Integer

Node ID.

predictedClass

Integer

Predicted class for this node.

predictedProbabilities

List[Float]

Probabilities for all classes, for this node.

Run Node Classification in mutate mode on a named graph:
CALL gds.alpha.ml.nodeClassification.predict.mutate(
  graphName: String,
  configuration: Map
) YIELD
  createMillis: Integer,
  computeMillis: Integer,
  postProcessingMillis: Integer,
  mutateMillis: Integer,
  nodePropertiesWritten: Integer,
  configuration: Map
Table 10. Parameters
Name Type Default Optional Description

graphName

String

n/a

no

The name of a graph stored in the catalog.

configuration

Map

{}

yes

Configuration for algorithm-specifics and/or graph filtering.

Table 11. General configuration for algorithm execution on a named graph.
Name Type Default Optional Description

modelName

String

n/a

no

The name of a Node Classification model in the model catalog.

nodeLabels

String[]

['*']

yes

Filter the named graph using the given node labels.

relationshipTypes

String[]

['*']

yes

Filter the named graph using the given relationship types.

concurrency

Integer

4

yes

The number of concurrent threads used for running the algorithm.

mutateProperty

String

n/a

no

The node property in the GDS graph to which the predicted property is written.

Table 12. Algorithm specific configuration
Name Type Default Optional Description

predictedProbabilityProperty

String

n/a

yes

The node property in which the class probability list is stored. If omitted, the probability list is discarded.

batchSize

Integer

100

yes

Number of nodes per batch.

Table 13. Results
Name Type Description

createMillis

Integer

Milliseconds for creating the graph.

computeMillis

Integer

Milliseconds for running the algorithm.

postProcessingMillis

Integer

Milliseconds for computing the global metrics.

mutateMillis

Integer

Milliseconds for adding properties to the in-memory graph.

nodePropertiesWritten

Integer

Number of relationships created.

configuration

Map

Configuration used for running the algorithm.

Run Node Classification in write mode on a named graph:
CALL gds.alpha.ml.nodeClassification.predict.write(
  graphName: String,
  configuration: Map
) YIELD
  createMillis: Integer,
  computeMillis: Integer,
  postProcessingMillis: Integer,
  writeMillis: Integer,
  nodePropertiesWritten: Integer,
  configuration: Map
Table 14. Parameters
Name Type Default Optional Description

graphName

String

n/a

no

The name of a graph stored in the catalog.

configuration

Map

{}

yes

Configuration for algorithm-specifics and/or graph filtering.

Table 15. General configuration for algorithm execution on a named graph.
Name Type Default Optional Description

modelName

String

n/a

no

The name of a Node Classification model in the model catalog.

nodeLabels

String[]

['*']

yes

Filter the named graph using the given node labels.

relationshipTypes

String[]

['*']

yes

Filter the named graph using the given relationship types.

concurrency

Integer

4

yes

The number of concurrent threads used for running the algorithm. Also provides the default value for 'writeConcurrency'.

writeConcurrency

Integer

value of 'concurrency'

yes

The number of concurrent threads used for writing the result to Neo4j.

writeProperty

String

n/a

no

The node property in the Neo4j database to which the predicted property is written.

Table 16. Algorithm specific configuration
Name Type Default Optional Description

predictedProbabilityProperty

String

n/a

yes

The node property in which the class probability list is stored. If omitted, the probability list is discarded.

batchSize

Integer

100

yes

Number of nodes per batch.

Table 17. Results
Name Type Description

createMillis

Integer

Milliseconds for creating the graph.

computeMillis

Integer

Milliseconds for running the algorithm.

postProcessingMillis

Integer

Milliseconds for computing the global metrics.

writeMillis

Integer

Milliseconds for writing result back to Neo4j.

nodePropertiesWritten

Integer

Number of relationships created.

configuration

Map

Configuration used for running the algorithm.

3. Examples

In this section we will show examples of training a Node Classification Model on a concrete graph. The intention is to illustrate what the results look like and to provide a guide in how to make use of the model in a real setting. We will do this on a small graph of a handful of nodes representing houses. The example graph looks like this:

node classification
The following Cypher statement will create the example graph in the Neo4j database:
CREATE
  (:House {color: 'Gold', sizePerStory: [15.5, 23.6, 33.1], class: 0}),
  (:House {color: 'Red', sizePerStory: [15.5, 23.6, 100.0], class: 0}),
  (:House {color: 'Blue', sizePerStory: [11.3, 35.1, 22.0], class: 0}),
  (:House {color: 'Green', sizePerStory: [23.2, 55.1, 0.0], class: 1}),
  (:House {color: 'Gray', sizePerStory: [34.3, 24.0, 0.0],  class: 1}),
  (:House {color: 'Black', sizePerStory: [71.66, 55.0, 0.0], class: 1}),
  (:House {color: 'White', sizePerStory: [11.1, 111.0, 0.0], class: 1}),
  (:House {color: 'Teal', sizePerStory: [80.8, 0.0, 0.0], class: 2}),
  (:House {color: 'Beige', sizePerStory: [106.2, 0.0, 0.0], class: 2}),
  (:House {color: 'Magenta', sizePerStory: [99.9, 0.0, 0.0], class: 2}),
  (:House {color: 'Purple', sizePerStory: [56.5, 0.0, 0.0], class: 2}),
  (:UnknownHouse {color: 'Pink', sizePerStory: [23.2, 55.1, 56.1]}),
  (:UnknownHouse {color: 'Tan', sizePerStory: [22.32, 102.0, 0.0]}),
  (:UnknownHouse {color: 'Yellow', sizePerStory: [39.0, 0.0, 0.0]});

With the graph in Neo4j we can now project it into the graph catalog to prepare it for algorithm execution. We do this using a native projection targeting the House and UnknownHouse labels. We will also project the sizeOfStory property to use as a model feature, and the class property to use as a target feature.

In the examples below we will use named graphs and native projections as the norm. However, anonymous graphs and/or Cypher projections can also be used.

The following statement will create a graph using a native projection and store it in the graph catalog under the name 'myGraph'.
CALL gds.graph.create('myGraph', {
    House: { properties: ['sizePerStory', 'class'] },
    UnknownHouse: { properties: 'sizePerStory' }
  },
  '*'
)

In the following examples we will demonstrate using the Node Classification model on this graph.

3.1. Memory Estimation

First off, we will estimate the cost of running the algorithm using the estimate procedure. This can be done with any execution mode. We will use the train mode in this example. Estimating the algorithm is useful to understand the memory impact that running the algorithm on your graph will have. When you later actually run the algorithm in one of the execution modes the system will perform an estimation. If the estimation shows that there is a very high probability of the execution going over its memory limitations, the execution is prohibited. To read more about this, see Automatic estimation and execution blocking.

For more details on estimate in general, see Memory Estimation.

The following will estimate the memory requirements for running the algorithm in write mode:
CALL gds.alpha.ml.nodeClassification.train.estimate('myGraph', {
  nodeLabels: ['House'],
  modelName: 'nc-model',
  featureProperties: ['sizePerStory'],
  targetProperty: 'class',
  randomSeed: 2,
  holdoutFraction: 0.2,
  validationFolds: 5,
  metrics: [ 'F1_WEIGHTED' ],
  params: [
    {penalty: 0.0625},
    {penalty: 0.5},
    {penalty: 1.0},
    {penalty: 4.0}
  ]
})
YIELD bytesMin, bytesMax, requiredMemory
Table 18. Results
bytesMin bytesMax requiredMemory

74874400

74906360

"[71 MiB ... 71 MiB]"

3.2. Train

In this example we will train a model to predict the class in which a house belongs, based on its sizePerStory property.

Train a Node Classification model:
CALL gds.alpha.ml.nodeClassification.train('myGraph', {
  nodeLabels: ['House'],
  modelName: 'nc-model',
  featureProperties: ['sizePerStory'],
  targetProperty: 'class',
  randomSeed: 2,
  holdoutFraction: 0.2,
  validationFolds: 5,
  metrics: [ 'F1_WEIGHTED' ],
  params: [
    {penalty: 0.0625},
    {penalty: 0.5},
    {penalty: 1.0},
    {penalty: 4.0}
  ]
}) YIELD modelInfo
RETURN
  {penalty: modelInfo.bestParameters.penalty} AS winningModel,
  modelInfo.metrics.F1_WEIGHTED.outerTrain AS trainGraphScore,
  modelInfo.metrics.F1_WEIGHTED.test AS testGraphScore
Table 19. Results
winningModel trainGraphScore testGraphScore

{penalty=0.0625}

0.999999990909091

0.6363636286363638

Here we can observe that the model candidate with penalty 0.0625 performed the best in the training phase, with a score of almost 100% over the train graph. On the test graph, the model scores a bit lower at about 64%. This indicates that the model reacted very well to the train graph, and was able to generalize fairly well to unseen data. In order to achieve a higher test score, we may need to use better features, a larger graph, or different model configuration.

3.3. Stream

In the stream execution mode, the algorithm returns the predicted property for each node. This allows us to inspect the results directly or post-process them in Cypher without any side effects.

For more details on the stream mode in general, see Stream.

In this example we will show how to use a trained model to predict the class of a node in your in-memory graph. In addition to the predicted class, we will also produce the probability for each class in another node property. In order to do this, we must first have an already trained model registered in the Model Catalog. We will use the model which we trained in the train example which we gave the name 'nc-model'.

CALL gds.alpha.ml.nodeClassification.predict.stream('myGraph', {
  nodeLabels: ['House', 'UnknownHouse'],
  modelName: 'nc-model',
  includePredictedProbabilities: true
}) YIELD nodeId, predictedClass, predictedProbabilities
WITH gds.util.asNode(nodeId) AS houseNode, predictedClass, predictedProbabilities
WHERE houseNode:UnknownHouse
RETURN
  houseNode.color AS classifiedHouse,
  predictedClass,
  floor(predictedProbabilities[predictedClass] * 100) AS confidence
  ORDER BY classifiedHouse
Table 20. Results
classifiedHouse predictedClass confidence

"Pink"

0

98.0

"Tan"

1

98.0

"Yellow"

2

79.0

As we can see, the model was able to predict the pink house into class 0, tan house into class 1, and yellow house into class 2. This makes sense, as all houses in class 0 had three stories, class 1 two stories and class 2 one story, and the same is true of the pink, tan and yellow houses, respectively. Additionally, we see that the model is confident in these predictions, as the highest class probability is >=80% in all cases.

3.4. Mutate

The mutate execution mode updates the named graph with a new node property containing the predicted class for that node. The name of the new property is specified using the mandatory configuration parameter mutateProperty. The result is a single summary row including information about timings and how many properties were written. The mutate mode is especially useful when multiple algorithms are used in conjunction.

For more details on the mutate mode in general, see Mutate.

In this example we will show how to use a trained model to predict the class of a node in your in-memory graph. In addition to the predicted class, we will also produce the probability for each class in another node property. In order to do this, we must first have an already trained model registered in the Model Catalog. We will use the model which we trained in the train example which we gave the name 'nc-model'.

CALL gds.alpha.ml.nodeClassification.predict.mutate('myGraph', {
  nodeLabels: ['House', 'UnknownHouse'],
  modelName: 'nc-model',
  mutateProperty: 'predictedClass',
  predictedProbabilityProperty: 'predictedProbabilities'
}) YIELD nodePropertiesWritten
Table 21. Results
nodePropertiesWritten

28

Since we specified also the predictedProbabilityProperty we are writing two properties for each of the 14 nodes. In order to analyse our predicted classes we stream the properties from the in-memory graph:

CALL gds.graph.streamNodeProperties(
  'myGraph', ['predictedProbabilities', 'predictedClass'], ['UnknownHouse']
) YIELD nodeId, nodeProperty, propertyValue
RETURN gds.util.asNode(nodeId).color AS classifiedHouse, nodeProperty, propertyValue
  ORDER BY classifiedHouse, nodeProperty
Table 22. Results
classifiedHouse nodeProperty propertyValue

"Pink"

"predictedClass"

0

"Pink"

"predictedProbabilities"

[0.9866455686217779, 0.01311656378786989, 2.3786759035214687E-4]

"Tan"

"predictedClass"

1

"Tan"

"predictedProbabilities"

[0.01749164563726576, 0.9824922482993587, 1.610606337562594E-5]

"Yellow"

"predictedClass"

2

"Yellow"

"predictedProbabilities"

[0.0385634113659007, 0.16350471177895198, 0.7979318768551473]

As we can see, the model was able to predict the pink house into class 0, tan house into class 1, and yellow house into class 2. This makes sense, as all houses in class 0 had three stories, class 1 two stories and class 2 one story, and the same is true of the pink, tan and yellow houses, respectively. Additionally, we see that the model is confident in these predictions, as the highest class probability is >75% in all cases.

3.5. Write

The write execution mode writes the predicted property for each node as a property to the Neo4j database. The name of the new property is specified using the mandatory configuration parameter writeProperty. The result is a single summary row including information about timings and how many properties were written. The write mode enables directly persisting the results to the database.

For more details on the write mode in general, see Write.

In this example we will show how to use a trained model to predict the class of a node in your in-memory graph. In addition to the predicted class, we will also produce the probability for each class in another node property. In order to do this, we must first have an already trained model registered in the Model Catalog. We will use the model which we trained in the train example which we gave the name 'nc-model'.

CALL gds.alpha.ml.nodeClassification.predict.write('myGraph', {
  nodeLabels: ['House', 'UnknownHouse'],
  modelName: 'nc-model',
  writeProperty: 'predictedClass',
  predictedProbabilityProperty: 'predictedProbabilities'
}) YIELD nodePropertiesWritten
Table 23. Results
nodePropertiesWritten

28

Since we specified also the predictedProbabilityProperty we are writing two properties for each of the 14 nodes. In order to analyse our predicted classes we stream the properties from the in-memory graph:

MATCH (house:UnknownHouse)
RETURN house.color AS classifiedHouse, house.predictedClass AS predictedClass, house.predictedProbabilities AS predictedProbabilities
Table 24. Results
classifiedHouse predictedClass predictedProbabilities

"Pink"

0

[0.9866455686217779, 0.01311656378786989, 2.3786759035214687E-4]

"Tan"

1

[0.01749164563726576, 0.9824922482993587, 1.610606337562594E-5]

"Yellow"

2

[0.0385634113659007, 0.16350471177895198, 0.7979318768551473]

As we can see, the model was able to predict the pink house into class 0, tan house into class 1, and yellow house into class 2. This makes sense, as all houses in class 0 had three stories, class 1 two stories and class 2 one story, and the same is true of the pink, tan and yellow houses, respectively. Additionally, we see that the model is confident in these predictions, as the highest class probability is >75% in all cases.