Applying a trained model for prediction

The following content is a preview of an upcoming release. For documentation covering current releases, see https://neo4j.com/docs.

In the previous sections we have seen how to build up a Link Prediction training pipeline and train it to produce a predictive model. After training, the runnable model is of type LinkPrediction and resides in the model catalog.

The trained model can then be applied to a graph in the graph catalog to create a new relationship type containing the predicted links. The relationships also have a property which stores the predicted probability of the link, which can be seen as a relative measure of the model’s prediction confidence.

Since the model has been trained on features which are created using the feature pipeline, the same feature pipeline is stored within the model and executed at prediction time. As during training, intermediate node properties created by the node property steps in the feature pipeline are transient and not visible after execution.

When using the model for prediction, the relationships of the input graph are used in two ways. First, the input graph is fed into the feature pipeline and therefore influences predictions if there is at least one step in the pipeline which uses the input relationships (typically any node property step does). Second, predictions are carried out on each node pair that is not connected in the input graph.

The predicted links are sorted by score before the ones having score below the configured threshold are discarded. Finally, the configured topN predictions are stored back to the projected graph.

It is necessary that the predict graph contains the properties that the pipeline requires and that the used array properties have the same dimensions as in the train graph. If the predict and train graphs are distinct, it is also beneficial that they have similar origins and semantics, so that the model is able to generalize well.

1. Syntax

Link Prediction syntax per mode

Mutate mode
Stream mode

Run Link Prediction in mutate mode on a named graph:

CALL gds.beta.pipeline.linkPrediction.predict.mutate(
  graphName: String,
  configuration: Map
)
YIELD
  preProcessingMillis: Integer,
  computeMillis: Integer,
  postProcessingMillis: Integer,
  mutateMillis: Integer,
  relationshipsWritten: Integer,
  probabilityDistribution: Integer,
  samplingStats: Map,
  configuration: Map

Table 1. Parameters
Name	Type	Default	Optional	Description
graphName	String	`n/a`	no	The name of a graph stored in the catalog.
configuration	Map	`{}`	yes	Configuration for algorithm-specifics and/or graph filtering.

Table 2. Configuration
Name	Type	Default	Optional	Description
modelName	String	`n/a`	no	The name of a Link Prediction model in the model catalog.
nodeLabels	List of String	`['*']`	yes	Filter the named graph using the given node labels.
relationshipTypes	List of String	`['*']`	yes	Filter the named graph using the given relationship types.
concurrency	Integer	`4`	yes	The number of concurrent threads used for running the algorithm.
mutateRelationshipType	String	`n/a`	no	The relationship type used for the new relationships written to the projected graph.
mutateProperty	String	`'probability'`	yes	The relationship property in the GDS graph to which the result is written.
sampleRate	Float	`n/a`	no	Sample rate to determine how many links are considered for each node. If set to 1, all possible links are considered, i.e., exhaustive search. Otherwise, a kNN-based approximate search will be used. Value must be between 0 (exclusive) and 1 (inclusive).
topN ^[1]	Integer	`n/a`	no	Limit on predicted relationships to output.
threshold ^[1]	Float	`0.0`	yes	Minimum predicted probability on relationships to output.
topK ^[2]	Integer	`10`	yes	Limit on number of predicted relationships to output for each node. This value cannot be lower than 1.
deltaThreshold ^[2]	Float	`0.001`	yes	Value as a percentage to determine when to stop early. If fewer updates than the configured value happen, the algorithm stops. Value must be between 0 (exclusive) and 1 (inclusive).
maxIterations ^[2]	Integer	`100`	yes	Hard limit to stop the algorithm after that many iterations.
randomJoins ^[2]	Integer	`10`	yes	Between every iteration, how many attempts are being made to connect new node neighbors based on random selection.
initialSampler ^[2]	String	`"uniform"`	yes	The method used to sample the first `k` random neighbors for each node. "uniform" and "randomWalk", both case-insensitive, are valid inputs.
randomSeed ^[2]	Integer	`n/a`	yes	The seed value to control the randomness of the algorithm. Note that `concurrency` must be set to 1 when setting this parameter.
1. Only applicable in the exhaustive search. 2. Only applicable in the approximate strategy. For more details look at the syntax section of kNN

Table 3. Results
Name	Type	Description
preProcessingMillis	Integer	Milliseconds for preprocessing the graph.
computeMillis	Integer	Milliseconds for running the algorithm.
postProcessingMillis	Integer	Milliseconds for computing the global metrics.
mutateMillis	Integer	Milliseconds for adding properties to the projected graph.
relationshipsWritten	Integer	Number of relationships created.
probabilityDistribution	Map	Description of distribution of predicted probabilities.
samplingStats	Map	Description of how predictions were sampled.
configuration	Map	Configuration used for running the algorithm.

Run Link Prediction in stream mode on a named graph:

CALL gds.beta.pipeline.linkPrediction.predict.stream(
  graphName: String,
  configuration: Map
)
YIELD
  node1: Integer,
  node2: Integer,
  probability: Float

Table 4. Parameters
Name	Type	Default	Optional	Description
graphName	String	`n/a`	no	The name of a graph stored in the catalog.
configuration	Map	`{}`	yes	Configuration for algorithm-specifics and/or graph filtering.

Table 5. Configuration
Name	Type	Default	Optional	Description
modelName	String	`n/a`	no	The name of a LinkPrediction model in the model catalog.
nodeLabels	List of String	`['*']`	yes	Filter the named graph using the given node labels.
relationshipTypes	List of String	`['*']`	yes	Filter the named graph using the given relationship types.
concurrency	Integer	`4`	yes	The number of concurrent threads used for running the algorithm.
jobId	String	`Generated internally`	yes	An ID that can be provided to more easily track the algorithm’s progress.
sampleRate	Float	`n/a`	no	Sample rate to determine how many links are considered for each node. If set to 1, all possible links are considered, i.e., exhaustive search. Otherwise, a kNN-based approximate search will be used. Value must be between 0 (exclusive) and 1 (inclusive).
topN ^[3]	Integer	`n/a`	no	Limit on predicted relationships to output.
threshold ^[3]	Float	`0.0`	yes	Minimum predicted probability on relationships to output.
topK ^[4]	Integer	`10`	yes	Limit on number of predicted relationships to output for each node. This value cannot be lower than 1.
deltaThreshold ^[4]	Float	`0.001`	yes	Value as a percentage to determine when to stop early. If fewer updates than the configured value happen, the algorithm stops. Value must be between 0 (exclusive) and 1 (inclusive).
maxIterations ^[4]	Integer	`100`	yes	Hard limit to stop the algorithm after that many iterations.
randomJoins ^[4]	Integer	`10`	yes	Between every iteration, how many attempts are being made to connect new node neighbors based on random selection.
initialSampler ^[4]	String	`"uniform"`	yes	The method used to sample the first `k` random neighbors for each node. "uniform" and "randomWalk", both case-insensitive, are valid inputs.
randomSeed ^[4]	Integer	`n/a`	yes	The seed value to control the randomness of the algorithm. Note that `concurrency` must be set to 1 when setting this parameter.
3. Only applicable in the exhaustive search. 4. Only applicable in the approximate strategy. For more details look at the syntax section of kNN

Table 6. Results
Name	Type	Description
node1	Integer	Node ID of the first node.
node2	Integer	Node ID of the second node.
probability	Float	Predicted probability of a link between the nodes.

2. Example

In this example we will show how to use a trained model to predict new relationships in your projected graph. In order to do this, we must first have an already trained model registered in the Model Catalog. We will use the model which we trained in the train example which we gave the name lp-pipeline-model. The algorithm excludes predictions for existing relationships in the graph as well as self-loops.

There are two different strategies for choosing which node pairs to consider when predicting new links, exhaustive search and approximate search. Whereas the former considers all possible new links, the latter will use a randomized strategy that only considers a subset of them in order to run faster. We will explain each individually with examples in the mutate examples below.

The relationships that are produced by the write and mutate procedures are undirected, just like the input. However, no parallel relationships are produced. So for example if when doing approximate search, a — b are among the top predictions for a, and b — a are among the top predictions for b, then there will still only be one undirected relationship a — b produced. The stream procedure will yield a node pair only once.

2.1. Memory Estimation

First off, we will estimate the cost of running the algorithm using the estimate procedure. This can be done with any execution mode. We will use the stream mode in this example. Estimating the algorithm is useful to understand the memory impact that running the algorithm on your graph will have. When you later actually run the algorithm in one of the execution modes the system will perform an estimation. If the estimation shows that there is a very high probability of the execution going over its memory limitations, the execution is prohibited. To read more about this, see Automatic estimation and execution blocking.

For more details on estimate in general, see Memory Estimation.

The following will estimate the memory requirements for applying the model:

CALL gds.beta.pipeline.linkPrediction.predict.stream.estimate('myGraph', {
  modelName: 'lp-pipeline-model',
  topN: 5,
  threshold: 0.45
})
YIELD requiredMemory

Table 7. Results
requiredMemory
"24 KiB"

2.2. Stream

CALL gds.beta.pipeline.linkPrediction.predict.stream('myGraph', {
  modelName: 'lp-pipeline-model',
  topN: 5,
  threshold: 0.45
})
 YIELD node1, node2, probability
 RETURN gds.util.asNode(node1).name AS person1, gds.util.asNode(node2).name AS person2, probability
 ORDER BY probability DESC, person1

We specified threshold to filter out predictions with probability less than 45%, and topN to further limit output to the top 5 relationships.

Table 8. Results
person1	person2	probability
"Alice"	"Mark"	0.6
"Karin"	"Will"	0.6
"Karin"	"Greg"	0.6
"Michael"	"Veselin"	0.6
"Will"	"Greg"	0.6

We can see, that our model predicts the most likely link is between Michael and Veselin.

2.3. Mutate

In these examples we will show how to write the predictions to your projected graph. We will use the model lp-pipeline-model, that we trained in the train example.

2.3.1. Exhaustive search

The exhaustive search will simply run through all possible new links, that is, check all node pairs that are not already connected by a relationship. For each such node pair the model we trained will be used to predict whether there they should be connected be a link or not.

CALL gds.beta.pipeline.linkPrediction.predict.mutate('myGraph', {
  modelName: 'lp-pipeline-model',
  relationshipTypes: ['KNOWS'],
  mutateRelationshipType: 'KNOWS_EXHAUSTIVE_PREDICTED',
  topN: 5,
  threshold: 0.45
}) YIELD relationshipsWritten, samplingStats

We specify threshold to filter out predictions with probability less than 45%, and topN to further limit output to the top 5 relationships. Note that we omit setting the sampleRate in our configuration as it defaults to 1 implying that the exhaustive search strategy is used. Because we are using the UNDIRECTED orientation, we will write twice as many relationships to the in-memory graph.

Table 9. Results
relationshipsWritten	samplingStats
10	{linksConsidered=16, strategy=exhaustive}

As we can see in the samplingStats, we use the exhaustive search strategy and check 16 possible links during the prediction. Indeed, since there are a total of 8 * (8 - 1) / 2 = 28 possible links in the graph and we already have 12, that means we check all possible new links. Although 16 links were considered, we only mutate the best five (since topN = 5) that are above our threshold.

If our graph is very large there may be a lot of possible new links. As such it may take a very long time to run the predictions. It may therefore be a more viable option to use a search strategy that only looks at a subset of all possible new links.

2.3.2. Approximate search

To avoid possibly having to run for a very long time considering all possible new links (due to the inherent quadratic complexity over node count) we can use an approximate search strategy.

The approximate search strategy lets us leverage the K-Nearest Neighbors algorithm with our model’s prediction function as its similarity measure to trade off lower runtime for accuracy. Accuracy in this context refers to how close we are in finding the very best actual new possible links according to our models predictions, i.e. the best predictions that would be made by exhaustive search.

The initial set of considered links for each node is picked at random and then refined in multiple iterations based of previously predicted links. The number of iterations is limited by the configuration parameter maxIterations, and we also limit the number of random links considered between kNN iterations using randomJoins. The algorithm may stop earlier if the link predictions per node only change by a small amount, which can be controlled by the configuration parameter deltaThreshold. See the K-Nearest Neighbors documentation for more details on how the search works.

CALL gds.beta.pipeline.linkPrediction.predict.mutate('myGraph', {
  modelName: 'lp-pipeline-model',
  relationshipTypes: ['KNOWS'],
  mutateRelationshipType: 'KNOWS_APPROX_PREDICTED',
  sampleRate: 0.5,
  topK: 1,
  randomJoins: 2,
  maxIterations: 3,
  // necessary for deterministic results
  concurrency: 1,
  randomSeed: 42
})
 YIELD relationshipsWritten, samplingStats

In order to use the approximate strategy we make sure to set the sampleRate explicitly to a value < 1.0. In this small example we set topK = 1 to only get one link predicted for each node. Because we are using the UNDIRECTED orientation, we will write twice as many relationships to the in-memory graph.

Table 10. Results
relationshipsWritten	samplingStats
16	{linksConsidered=44, didConverge=true, strategy=approximate, ranIterations=2}

As we can see in the samplingStats, we use the approximate search strategy and check 44 possible links during the prediction. Though in this small example we actually consider more links that in the exhaustive case, this will typically not be the case for larger graphs. Since the relationships we write are undirected, reported relationshipsWritten is 16 when we search for the best (topK = 1) prediction for each node.

Was this page helpful?