Applying a trained model for prediction

This feature is in the beta tier. For more information on feature tiers, see API Tiers.

In the previous sections we have seen how to build up a Node Classification training pipeline and train it to produce a classification pipeline. After training, the runnable model is of type NodeClassification and resides in the model catalog.

The classification model can be executed with a graph in the graph catalog to predict the class of previously unseen nodes. In addition to the predicted class for each node, the predicted probability for each class may also be retained on the nodes. The order of the probabilities matches the order of the classes registered in the model.

Since the model has been trained on features which are created using the feature pipeline, the same feature pipeline is stored within the model and executed at prediction time. As during training, intermediate node properties created by the node property steps in the feature pipeline are transient and not visible after execution.

The predict graph must contain the properties that the pipeline requires and the used array properties must have the same dimensions as in the train graph. If the predict and train graphs are distinct, it is also beneficial that they have similar origins and semantics, so that the model is able to generalize well.

Syntax

Node Classification syntax per mode

Run Node Classification in stream mode on a named graph:

CALL gds.beta.pipeline.nodeClassification.predict.stream(
  graphName: String,
  configuration: Map
)
YIELD
  nodeId: Integer,
  predictedClass: Integer,
  predictedProbabilities: List of Float

Table 1. Parameters
Name	Type	Default	Optional	Description
graphName	String	`n/a`	no	The name of a graph stored in the catalog.
configuration	Map	`{}`	yes	Configuration for algorithm-specifics and/or graph filtering.

Table 2. Configuration
Name	Type	Default	Optional	Description
modelName	String	`n/a`	no	The name of a NodeClassification model in the model catalog.
targetNodeLabels	List of String	`from trainConfig`	yes	Filter the named graph using the given targetNodeLabels.
relationshipTypes	List of String	`from trainConfig`	yes	Filter the named graph using the given relationship types.
concurrency	Integer	`4 ^[1]`	yes	The number of concurrent threads used for running the algorithm.
jobId	String	`Generated internally`	yes	An ID that can be provided to more easily track the algorithm’s progress.
logProgress	Boolean	`true`	yes	If disabled the progress percentage will not be logged.
includePredictedProbabilities	Boolean	`false`	yes	Whether to return the probability for each class. If `false` then `null` is returned in `predictedProbabilites`. The order of the classes can be inspected in the `modelInfo` of the classification model (see listing models).
1. In a GDS Session the default is the number of available processors

Table 3. Results
Name	Type	Description
nodeId	Integer	Node ID.
predictedClass	Integer	Predicted class for this node.
predictedProbabilities	List of Float	Probabilities for all classes, for this node.

Run Node Classification in mutate mode on a named graph:

CALL gds.beta.pipeline.nodeClassification.predict.mutate(
  graphName: String,
  configuration: Map
)
YIELD
  preProcessingMillis: Integer,
  computeMillis: Integer,
  postProcessingMillis: Integer,
  mutateMillis: Integer,
  nodePropertiesWritten: Integer,
  configuration: Map

Table 4. Parameters
Name	Type	Default	Optional	Description
graphName	String	`n/a`	no	The name of a graph stored in the catalog.
configuration	Map	`{}`	yes	Configuration for algorithm-specifics and/or graph filtering.

Table 5. Configuration
Name	Type	Default	Optional	Description
modelName	String	`n/a`	no	The name of a NodeClassification model in the model catalog.
mutateProperty	String	`n/a`	no	The node property in the GDS graph to which the predicted property is written.
targetNodeLabels	List of String	`from trainConfig`	yes	Filter the named graph using the given targetNodeLabels.
relationshipTypes	List of String	`from trainConfig`	yes	Filter the named graph using the given relationship types.
concurrency	Integer	`4 ^[2]`	yes	The number of concurrent threads used for running the algorithm.
jobId	String	`Generated internally`	yes	An ID that can be provided to more easily track the algorithm’s progress.
logProgress	Boolean	`true`	yes	If disabled the progress percentage will not be logged.
predictedProbabilityProperty	String	`n/a`	yes	The node property in which the class probability list is stored. If omitted, the probability list is discarded. The order of the classes can be inspected in the `modelInfo` of the classification model (see listing models).
2. In a GDS Session the default is the number of available processors

Table 6. Results
Name	Type	Description
preProcessingMillis	Integer	Milliseconds for preprocessing the graph.
computeMillis	Integer	Milliseconds for running the algorithm.
postProcessingMillis	Integer	Milliseconds for computing the global metrics.
mutateMillis	Integer	Milliseconds for adding properties to the in-memory graph.
nodePropertiesWritten	Integer	Number of node properties written.
configuration	Map	Configuration used for running the algorithm.

Run Node Classification in write mode on a named graph:

CALL gds.beta.pipeline.nodeClassification.predict.write(
  graphName: String,
  configuration: Map
)
YIELD
  preProcessingMillis: Integer,
  computeMillis: Integer,
  postProcessingMillis: Integer,
  writeMillis: Integer,
  nodePropertiesWritten: Integer,
  configuration: Map

Table 7. Parameters
Name	Type	Default	Optional	Description
graphName	String	`n/a`	no	The name of a graph stored in the catalog.
configuration	Map	`{}`	yes	Configuration for algorithm-specifics and/or graph filtering.

Table 8. Configuration
Name	Type	Default	Optional	Description
modelName	String	`n/a`	no	The name of a NodeClassification model in the model catalog.
targetNodeLabels	List of String	`from trainConfig`	yes	Filter the named graph using the given targetNodeLabels.
relationshipTypes	List of String	`from trainConfig`	yes	Filter the named graph using the given relationship types.
concurrency	Integer	`4 ^[3]`	yes	The number of concurrent threads used for running the algorithm.
jobId	String	`Generated internally`	yes	An ID that can be provided to more easily track the algorithm’s progress.
logProgress	Boolean	`true`	yes	If disabled the progress percentage will not be logged.
writeConcurrency	Integer	`value of 'concurrency'`	yes	The number of concurrent threads used for writing the result to Neo4j.
writeProperty	String	`n/a`	no	The node property in the Neo4j database to which the predicted property is written.
predictedProbabilityProperty	String	`n/a`	yes	The node property in which the class probability list is stored. If omitted, the probability list is discarded. The order of the classes can be inspected in the `modelInfo` of the classification model (see listing models).
3. In a GDS Session the default is the number of available processors

Table 9. Results
Name	Type	Description
preProcessingMillis	Integer	Milliseconds for preprocessing the graph.
computeMillis	Integer	Milliseconds for running the algorithm.
postProcessingMillis	Integer	Milliseconds for computing the global metrics.
writeMillis	Integer	Milliseconds for writing result back to Neo4j.
nodePropertiesWritten	Integer	Number of node properties written.
configuration	Map	Configuration used for running the algorithm.

Example

In the following examples we will show how to use a classification model to predict the class of a node in your in-memory graph. In addition to the predicted class, we will also produce the probability for each class in another node property. In order to do this, we must first have an already trained model registered in the Model Catalog. We will use the model which we trained in the train example which we gave the name 'nc-pipeline-model'.

Memory Estimation

First off, we will estimate the cost of running the algorithm using the estimate procedure. This can be done with any execution mode. We will use the stream mode in this example. Estimating the algorithm is useful to understand the memory impact that running the algorithm on your graph will have. When you later actually run the algorithm in one of the execution modes the system will perform an estimation. If the estimation shows that there is a very high probability of the execution going over its memory limitations, the execution is prohibited. To read more about this, see Automatic estimation and execution blocking.

For more details on estimate in general, see Memory Estimation.

The following will estimate the memory requirements for running the algorithm in stream mode:

CALL gds.beta.pipeline.nodeClassification.predict.stream.estimate('myGraph', {
  modelName: 'nc-pipeline-model',
  includePredictedProbabilities: true,
  targetNodeLabels: ['UnknownHouse']
})
YIELD requiredMemory

Table 10. Results
requiredMemory
"792 Bytes"

If a node property step does not have an estimation implemented, the step will be ignored in the estimation.

Stream

CALL gds.beta.pipeline.nodeClassification.predict.stream('myGraph', {
  modelName: 'nc-pipeline-model',
  includePredictedProbabilities: true,
  targetNodeLabels: ['UnknownHouse']
})
 YIELD nodeId, predictedClass, predictedProbabilities
WITH gds.util.asNode(nodeId) AS houseNode, predictedClass, predictedProbabilities
RETURN
  houseNode.color AS classifiedHouse,
  predictedClass,
  floor(predictedProbabilities[predictedClass] * 100) AS confidence
  ORDER BY classifiedHouse

Table 11. Results
classifiedHouse	predictedClass	confidence
`"Pink"`	`0`	`96.0`
`"Tan"`	`1`	`97.0`
`"Yellow"`	`2`	`75.0`

As we can see, the model was able to predict the pink house into class 0, tan house into class 1, and yellow house into class 2. This makes sense, as all houses in class 0 had three stories, class 1 two stories and class 2 one story, and the same is true of the pink, tan and yellow houses, respectively. Additionally, we see that the model is confident in these predictions, as the confidence is >=79% in all cases.

The indices in the predictedProbabilities correspond to the order of the classes in the classification model. To inspect the order of the classes, we can look at its modelInfo (see listing models).

Mutate

The mutate execution mode updates the named graph with a new node property containing the predicted class for that node. The name of the new property is specified using the mandatory configuration parameter mutateProperty. The result is a single summary row including information about timings and how many properties were written. The mutate mode is especially useful when multiple algorithms are used in conjunction.

For more details on the mutate mode in general, see Mutate.

CALL gds.beta.pipeline.nodeClassification.predict.mutate('myGraph', {
  targetNodeLabels: ['UnknownHouse'],
  modelName: 'nc-pipeline-model',
  mutateProperty: 'predictedClass',
  predictedProbabilityProperty: 'predictedProbabilities'
}) YIELD nodePropertiesWritten

Table 12. Results
nodePropertiesWritten
6

Since we specified also the predictedProbabilityProperty we are writing two properties for each of the 3 UnknownHouse nodes.

Write

The write execution mode writes the predicted property for each node as a property to the Neo4j database. The name of the new property is specified using the mandatory configuration parameter writeProperty. The result is a single summary row including information about timings and how many properties were written. The write mode enables directly persisting the results to the database.

For more details on the write mode in general, see Write.

CALL gds.beta.pipeline.nodeClassification.predict.write('myGraph', {
  targetNodeLabels: ['UnknownHouse'],
  modelName: 'nc-pipeline-model',
  writeProperty: 'predictedClass',
  predictedProbabilityProperty: 'predictedProbabilities'
}) YIELD nodePropertiesWritten

Table 13. Results
nodePropertiesWritten
6

Since we specified also the predictedProbabilityProperty we are writing two properties for each of the 3 UnknownHouse nodes.