Node Classification

This section describes the Node Classification Model in the Neo4j Graph Data Science library.

1. Introduction

Node Classification is a common machine learning task applied to graphs: training a model to learn in which class a node belongs. Neo4j GDS trains supervised machine learning models based on node properties (features) in your graph to predict what class an unseen or future node would belong to.

Concretely, Node Classification models are used to predict a non-existing node property based on other node properties. The non-existing node property represents the class, and is referred to as the target property. The specified node properties are used as input features. The Node Classification model does not rely on relationship information. However, a node embedding algorithm could embed the neighborhoods of nodes as a node property, to transfer this information into the Node Classification model (see Node embeddings).

Models are trained on parts of the input graph and evaluated using specified metrics. Splitting of the graph into a train and a test graph is performed internally by the algorithm, and the test graph is used to evaluate model performance.

The training process follows this outline:

  1. The input graph is split into two parts: the train graph and the test graph.

  2. The train graph is further divided into a number of validation folds, each consisting of a train part and a validation part.

  3. Each model candidate is trained on each train part and evaluated on the respective validation part.

  4. The training process uses a logistic regression algorithm, and the evaluation uses the specified metrics. The first metric is the primary metric.

  5. The model with the highest average score according to the primary metric will win the training.

  6. The winning model will then be retrained on the entire train graph.

  7. The winning model is evaluated on the train graph as well as the test graph.

  8. The winning model is retrained on the entire original graph.

  9. Finally, the winning model will be registered in the Model Catalog.

Trained models may then be used to predict the value of the target property (class) of previously unseen nodes. In addition to the predicted class for each node, the predicted probability for each class may also be retained on the nodes. The order of the probabilities match the order of the classes registered in the model.

1.1. Metrics

The Node Classification model in the Neo4j GDS library supports the following evaluation metrics:

  • F1_WEIGHTED

  • F1_MACRO

  • ACCURACY

More than one metric can be specified during training but only the primary one is used for evaluation, the results of all are present in the train results.

2. Syntax

This section covers the syntax used to execute the Node Classification algorithm in each of its execution modes. We are describing the named graph variant of the syntax. To learn more about general syntax variants, see Syntax overview.

Example 1. Node Classification syntax per mode
Run Node Classification in train mode on a named graph:
CALL gds.alpha.ml.nodeClassification.train(
  graphName: String,
  configuration: Map
) YIELD
  trainMillis: Integer,
  modelInfo: Map,
  configuration: Map
Table 1. Parameters
Name Type Default Optional Description

graphName

String

n/a

no

The name of a graph stored in the catalog.

configuration

Map

{}

yes

Configuration for algorithm-specifics and/or graph filtering.

Table 2. General configuration for algorithm execution on a named graph.
Name Type Default Optional Description

modelName

String

n/a

no

The name of the model to train, must not exist in the Model Catalog.

featureProperties

List<String>

[]

yes

The names of the node properties that should be used as input features. All property names must exist in the in-memory graph and be of type Float or List<Float>.

nodeLabels

String[]

['*']

yes

Filter the named graph using the given node labels.

relationshipTypes

String[]

['*']

yes

Filter the named graph using the given relationship types.

concurrency

Integer

4

yes

The number of concurrent threads used for running the algorithm.

Table 3. Algorithm specific configuration
Name Type Default Optional Description

targetProperty

String

n/a

no

The class of the node. Must be of type Integer.

holdoutFraction

Float

n/a

no

Portion of the graph reserved for testing. Must be in the range (0, 1).

validationFolds

Integer

n/a

no

Number of divisions of the train graph used for model selection.

metrics

List<String>

n/a

no

Metrics used to evaluate the models.

params

List<Map>

n/a

no

List of model configurations to be trained. See next table for details.

randomSeed

Integer

n/a

yes

Seed for the random number generator used during training.

Table 4. Model configuration
Name Type Default Optional Description

penalty

Float

n/a

no

Penalty used for the logistic regression.

batchSize

Integer

100

yes

Number of nodes per batch.

minIterations

Integer

1

yes

Minimum number of training iterations.

maxIterations

Integer

100

yes

Maximum number of training iterations.

maxStreakCount

Integer

1

yes

Maximum number of iterations that do not improve the loss before stopping.

windowSize

Integer

1

yes

Number of the most recent iterations used for computing the loss.

tolerance

Float

0.001

yes

Minimum acceptable loss before stopping.

sharedUpdater

Boolean

false

yes

Whether to use the same instance of weight updater for all batches.

Table 5. Results
Name Type Description

trainMillis

Integer

Milliseconds used for training.

modelInfo

Map

Information about the training and the winning model.

configuration

Map

Configuration used for the train procedure.

Run Node Classification in mutate mode on a named graph:
CALL gds.alpha.ml.nodeClassification.predict.mutate(
  graphName: String,
  configuration: Map
) YIELD
  createMillis: Integer,
  computeMillis: Integer,
  postProcessingMillis: Integer,
  mutateMillis: Integer,
  nodePropertiesWritten: Integer,
  configuration: Map
Table 6. Parameters
Name Type Default Optional Description

graphName

String

n/a

no

The name of a graph stored in the catalog.

configuration

Map

{}

yes

Configuration for algorithm-specifics and/or graph filtering.

Table 7. General configuration for algorithm execution on a named graph.
Name Type Default Optional Description

nodeLabels

String[]

['*']

yes

Filter the named graph using the given node labels.

relationshipTypes

String[]

['*']

yes

Filter the named graph using the given relationship types.

concurrency

Integer

4

yes

The number of concurrent threads used for running the algorithm.

mutateProperty

String

n/a

no

The node property in the GDS graph to which the property is written.

Table 8. Algorithm specific configuration
Name Type Default Optional Description

predictedProbabilityProperty

String

n/a

yes

The node property in the GDS graph in which the class probability list is stored. If omitted, the probability list is discarded.

batchSize

Integer

100

yes

Number of nodes per batch.

Table 9. Results
Name Type Description

createMillis

Integer

Milliseconds for creating the graph.

computeMillis

Integer

Milliseconds for running the algorithm.

postProcessingMillis

Integer

Milliseconds for computing the global metrics.

mutateMillis

Integer

Milliseconds for adding properties to the in-memory graph.

nodePropertiesWritten

Integer

Number of relationships created.

configuration

Map

Configuration used for running the algorithm.

3. Examples

In this section we will show examples of training a Node Classification Model on a concrete graph. The intention is to illustrate what the results look like and to provide a guide in how to make use of the model in a real setting. We will do this on a small graph of a handful of nodes representing houses. The example graph looks like this:

node classification
The following Cypher statement will create the example graph in the Neo4j database:
CREATE
  (:House {color: 'Gold', sizePerStory: [15.5, 23.6, 33.1], class: 0}),
  (:House {color: 'Red', sizePerStory: [15.5, 23.6, 100.0], class: 0}),
  (:House {color: 'Blue', sizePerStory: [11.3, 35.1, 22.0], class: 0}),
  (:House {color: 'Green', sizePerStory: [23.2, 55.1, 0.0], class: 1}),
  (:House {color: 'Gray', sizePerStory: [34.3, 24.0, 0.0],  class: 1}),
  (:House {color: 'Black', sizePerStory: [71.66, 55.0, 0.0], class: 1}),
  (:House {color: 'White', sizePerStory: [11.1, 111.0, 0.0], class: 1}),
  (:House {color: 'Teal', sizePerStory: [80.8, 0.0, 0.0], class: 2}),
  (:House {color: 'Beige', sizePerStory: [106.2, 0.0, 0.0], class: 2}),
  (:House {color: 'Magenta', sizePerStory: [99.9, 0.0, 0.0], class: 2}),
  (:House {color: 'Purple', sizePerStory: [56.5, 0.0, 0.0], class: 2}),
  (:UnknownHouse {color: 'Pink', sizePerStory: [23.2, 55.1, 56.1]}),
  (:UnknownHouse {color: 'Tan', sizePerStory: [22.32, 102.0, 0.0]}),
  (:UnknownHouse {color: 'Yellow', sizePerStory: [39.0, 0.0, 0.0]});

With the graph in Neo4j we can now project it into the graph catalog to prepare it for algorithm execution. We do this using a native projection targeting the House and UnknownHouse labels. We will also project the sizeOfStory property to use as a model feature, and the class property to use as a target feature.

In the examples below we will use named graphs and native projections as the norm. However, anonymous graphs and/or Cypher projections can also be used.

The following statement will create a graph using a native projection and store it in the graph catalog under the name 'myGraph'.
CALL gds.graph.create('myGraph', {
    House: { properties: ['sizePerStory', 'class'] },
    UnknownHouse: { properties: 'sizePerStory' }
  },
  '*'
)

In the following examples we will demonstrate using the Node Classification model on this graph.

3.1. Train

In this example we will train a model to predict the class in which a house belongs, based on its sizePerStory property.

Train a Node Classification model:
CALL gds.alpha.ml.nodeClassification.train('myGraph', {
  nodeLabels: ['House'],
  modelName: 'nc-model',
  featureProperties: ['sizePerStory'],
  targetProperty: 'class',
  randomSeed: 2,
  holdoutFraction: 0.2,
  validationFolds: 5,
  metrics: [ 'F1_WEIGHTED' ],
  params: [
    {penalty: 0.0625},
    {penalty: 0.5},
    {penalty: 1.0},
    {penalty: 4.0}
  ]
}) YIELD modelInfo
RETURN
  modelInfo.bestParameters AS winningModel,
  modelInfo.metrics.F1_WEIGHTED.outerTrain AS trainGraphScore,
  modelInfo.metrics.F1_WEIGHTED.test AS testGraphScore
Table 10. Results
winningModel trainGraphScore testGraphScore

{penalty=0.0625}

0.999999990909091

0.6363636286363638

Here we can observe that the model candidate with penalty 0.0625 performed the best in the training phase, with a score of almost 100% over the train graph. On the test graph, the model scores a bit lower at about 64%. This indicates that the model reacted very well to the train graph, and was able to generalize fairly well to unseen data. In order to achieve a higher test score, we may need to use better features, a larger graph, or different model configuration.

3.2. Mutate

In this example we will show how to use a trained model to predict the class of a node in your in-memory graph. In addition to the predicted class, we will also produce the probability for each class in another node property. In order to do this, we must first have an already trained model registered in the Model Catalog. We will use the model which we trained in the train example which we gave the name 'nc-model'.

CALL gds.alpha.ml.nodeClassification.predict.mutate('myGraph', {
  nodeLabels: ['House', 'UnknownHouse'],
  modelName: 'nc-model',
  mutateProperty: 'predicted_class',
  predictedProbabilityProperty: 'predicted_probability'
}) YIELD nodePropertiesWritten
Table 11. Results
nodePropertiesWritten

28

Since we specified also the predictedProbabilityProperty we are writing two properties for each of the 14 nodes. In order to analyse our predicted classes we stream the properties from the in-memory graph:

CALL gds.graph.streamNodeProperties(
  'myGraph', ['predicted_probability', 'predicted_class'], ['UnknownHouse']
) YIELD nodeId, nodeProperty, propertyValue
RETURN gds.util.asNode(nodeId).color AS classifiedHouse, nodeProperty, propertyValue
  ORDER BY classifiedHouse, nodeProperty
Table 12. Results
classifiedHouse nodeProperty propertyValue

"Pink"

"predicted_class"

0

"Pink"

"predicted_probability"

[0.9866455686217779, 0.01311656378786989, 2.3786759035214687E-4]

"Tan"

"predicted_class"

1

"Tan"

"predicted_probability"

[0.01749164563726576, 0.9824922482993587, 1.610606337562594E-5]

"Yellow"

"predicted_class"

2

"Yellow"

"predicted_probability"

[0.0385634113659007, 0.16350471177895198, 0.7979318768551473]

As we can see, the model was able to predict the pink house into class 0, tan house into class 1, and yellow house into class 2. This makes sense, as all houses in class 0 had three stories, class 1 two stories and class 2 one story, and the same is true of the pink, tan and yellow houses, respectively. Additionally, we see that the model is confident in these predictions, as the highest class probability is >75% in all cases.