KNearest Neighbors
This section describes the KNearest Neighbors (KNN) algorithm in the Neo4j Graph Data Science library.
1. Introduction
The KNearest Neighbors algorithm computes a distance value for all node pairs in the graph and creates new relationships between each node and its k nearest neighbors. The distance is calculated based on node properties.
The input of this algorithm is a monopartite graph. The graph does not need to be connected, in fact, existing relationships between nodes will be ignored. New relationships are created between each node and its k nearest neighbors.
The KNearest Neighbors algorithm compares a given property of each node.
The k
nodes where this property is most similar are the knearest neighbors.
The initial set of neighbors is picked at random and verified and refined in multiple iterations.
The number of iterations is limited by the configuration parameter maxIterations
.
The algorithm may stop earlier if the neighbor lists only change by a small amount, which can be controlled by the configuration parameter deltaThreshold
.
The particular implementation is based on Efficient knearest neighbor graph construction for generic similarity measures by Wei Dong et al. Instead of comparing every node with every other node, the algorithm selects possible neighbors based on the assumption, that the neighborsofneighbors of a node are most likely already the nearest one. The algorithm scales quasilinear with respect to the node count, instead of being quadratic.
Furthermore, the algorithm only compares a sample of all possible neighbors on each iteration, assuming that eventually all possible neighbors will be seen.
This can be controlled with the configuration parameter sampleRate
:

A valid sample rate must be in between 0 (exclusive) and 1 (inclusive).

The default value is
0.5
. 
The parameter is used to control the tradeoff between accuracy and runtimeperformance.

A higher sample rate will increase the accuracy of the result.

The algorithm will also require more memory and will take longer to compute.


A lower sample rate will increase the runtimeperformance.

Some potential nodes may be missed in the comparison and may not be included in the result.

The output of the algorithm are new relationships between nodes and their knearest neighbors. Similarity scores are expressed via relationship properties.
For more information on this algorithm, see:
Running this algorithm requires sufficient available memory. Before running this algorithm, we recommend that you read Memory Estimation. 
1.1. Similarity measures
The similarity measure used in the KNN algorithm depends on the type of the configured node property. KNN supports both scalar numeric values as well as lists of numbers.
 Scalar numeric property

When the property is a scalar number, the similarity is computed as one divided by one plus the absolute difference between the values:
 List of integers

When the property is a list of integers, the similarity is computed as one divided by one plus the number of unequal numbers in the list:
 List of floatingpoint numbers

When the property is a list of floatingpoint numbers, the similarity is computed using the cosine similarity metric. See the Cosine Similarity algorithm for more details. If the cosine similarity is negative, we clip the value to
0
, i.e.,max(cosine(a, b), 0))
.
2. Syntax
This section covers the syntax used to execute the KNearest Neighbors algorithm in each of its execution modes. We are describing the named graph variant of the syntax. To learn more about general syntax variants, see Syntax overview.
CALL gds.beta.knn.stream(
graphName: String,
configuration: Map
) YIELD
node1: Integer,
node2: Integer,
similarity: Float
Name  Type  Default  Optional  Description 

graphName 
String 

no 
The name of a graph stored in the catalog. 
configuration 
Map 

yes 
Configuration for algorithmspecifics and/or graph filtering. 
Name  Type  Default  Optional  Description 

List of String 

yes 
Filter the named graph using the given node labels. 

List of String 

yes 
Filter the named graph using the given relationship types. 

Integer 

yes 
The number of concurrent threads used for running the algorithm. 
Name  Type  Default  Optional  Description 

String 

no 
The name of a node property that contains node weights which will be used for similarity computation. 

topK 
Integer 

yes 
The number of neighbors to find for each node. The Knearest neighbors are returned. This value cannot be lower than 1. 
sampleRate 
Float 

yes 
Sample rate to limit the number of comparisons per node. Value must be between 0 (exclusive) and 1 (inclusive). 
deltaThreshold 
Float 

yes 
Value as a percentage to determine when to stop early. If fewer updates than the configured value happen, the algorithm stops. Value must be between 0 (exclusive) and 1 (inclusive). 
Integer 

yes 
Hard limit to stop the algorithm after that many iterations. 

randomJoins 
Integer 

yes 
Between every iteration, how many attempts are being made to connect new node neighbors based on random selection. 
randomSeed 
Integer 

yes 
The seed value to control the randomness of the algorithm. The value 1 means that a new seed is generated for every execution, all other values (including negative ones) are used as the seed value. 
Name  Type  Description 


Integer 
Node ID of the first node. 

Integer 
Node ID of the second node. 

Float 
Similarity score for the two nodes. 
CALL gds.beta.knn.stats(
graphName: String,
configuration: Map
)
YIELD
createMillis: Integer,
computeMillis: Integer,
postProcessingMillis: Integer,
nodesCompared: Integer,
similarityPairs: Integer,
similarityDistribution: Map,
configuration: Map
Name  Type  Default  Optional  Description 

graphName 
String 

no 
The name of a graph stored in the catalog. 
configuration 
Map 

yes 
Configuration for algorithmspecifics and/or graph filtering. 
Name  Type  Default  Optional  Description 

List of String 

yes 
Filter the named graph using the given node labels. 

List of String 

yes 
Filter the named graph using the given relationship types. 

Integer 

yes 
The number of concurrent threads used for running the algorithm. 
Name  Type  Default  Optional  Description 

String 

no 
The name of a node property that contains node weights which will be used for similarity computation. 

topK 
Integer 

yes 
The number of neighbors to find for each node. The Knearest neighbors are returned. This value cannot be lower than 1. 
sampleRate 
Float 

yes 
Sample rate to limit the number of comparisons per node. Value must be between 0 (exclusive) and 1 (inclusive). 
deltaThreshold 
Float 

yes 
Value as a percentage to determine when to stop early. If fewer updates than the configured value happen, the algorithm stops. Value must be between 0 (exclusive) and 1 (inclusive). 
Integer 

yes 
Hard limit to stop the algorithm after that many iterations. 

randomJoins 
Integer 

yes 
Between every iteration, how many attempts are being made to connect new node neighbors based on random selection. 
randomSeed 
Integer 

yes 
The seed value to control the randomness of the algorithm. The value 1 means that a new seed is generated for every execution, all other values (including negative ones) are used as the seed value. 
Name  Type  Description 

createMillis 
Integer 
Milliseconds for loading data. 
computeMillis 
Integer 
Milliseconds for running the algorithm. 
postProcessingMillis 
Integer 
Milliseconds for computing similarity value distribution statistics. 
nodesCompared 
Integer 
The number of nodes compared. 
similarityPairs 
Integer 
The number of pairs of similar nodes computed. 
similarityDistribution 
Map 
Map containing min, max, mean as well as p50, p75, p90, p95, p99 and p999 percentile values of the computed similarity results. 
configuration 
Map 
The configuration used for running the algorithm. 
CALL gds.beta.knn.mutate(
graphName: String,
configuration: Map
)
YIELD
createMillis: Integer,
computeMillis: Integer,
mutateMillis: Integer,
postProcessingMillis: Integer,
relationshipsWritten: Integer,
nodesCompared: Integer,
similarityDistribution: Map,
configuration: Map
Name  Type  Default  Optional  Description 

graphName 
String 

no 
The name of a graph stored in the catalog. 
configuration 
Map 

yes 
Configuration for algorithmspecifics and/or graph filtering. 
Name  Type  Default  Optional  Description 

List of String 

yes 
Filter the named graph using the given node labels. 

List of String 

yes 
Filter the named graph using the given relationship types. 

Integer 

yes 
The number of concurrent threads used for running the algorithm. 

mutateRelationshipType 
String 

no 
The relationship type used for the new relationships written to the inmemory graph. 
mutateProperty 
String 

no 
The relationship property in the GDS graph to which the similarity score is written. 
Name  Type  Default  Optional  Description 

String 

no 
The name of a node property that contains node weights which will be used for similarity computation. 

topK 
Integer 

yes 
The number of neighbors to find for each node. The Knearest neighbors are returned. This value cannot be lower than 1. 
sampleRate 
Float 

yes 
Sample rate to limit the number of comparisons per node. Value must be between 0 (exclusive) and 1 (inclusive). 
deltaThreshold 
Float 

yes 
Value as a percentage to determine when to stop early. If fewer updates than the configured value happen, the algorithm stops. Value must be between 0 (exclusive) and 1 (inclusive). 
Integer 

yes 
Hard limit to stop the algorithm after that many iterations. 

randomJoins 
Integer 

yes 
Between every iteration, how many attempts are being made to connect new node neighbors based on random selection. 
randomSeed 
Integer 

yes 
The seed value to control the randomness of the algorithm. The value 1 means that a new seed is generated for every execution, all other values (including negative ones) are used as the seed value. 
Name  Type  Description 

createMillis 
Integer 
Milliseconds for loading data. 
computeMillis 
Integer 
Milliseconds for running the algorithm. 
mutateMillis 
Integer 
Milliseconds for adding properties to the inmemory graph. 
postProcessingMillis 
Integer 
Milliseconds for computing similarity value distribution statistics. 
nodesCompared 
Integer 
The number of nodes compared. 
relationshipsWritten 
Integer 
The number of relationships created. 
similarityDistribution 
Map 
Map containing min, max, mean, stdDev and p1, p5, p10, p25, p75, p90, p95, p99, p100 percentile values of the computed similarity results. 
configuration 
Map 
The configuration used for running the algorithm. 
CALL gds.beta.knn.write(
graphName: String,
configuration: Map
)
YIELD
createMillis: Integer,
computeMillis: Integer,
writeMillis: Integer,
postProcessingMillis: Integer,
nodesCompared: Integer,
relationshipsWritten: Integer,
similarityDistribution: Map,
configuration: Map
Name  Type  Default  Optional  Description 

graphName 
String 

no 
The name of a graph stored in the catalog. 
configuration 
Map 

yes 
Configuration for algorithmspecifics and/or graph filtering. 
Name  Type  Default  Optional  Description 

List of String 

yes 
Filter the named graph using the given node labels. 

List of String 

yes 
Filter the named graph using the given relationship types. 

Integer 

yes 
The number of concurrent threads used for running the algorithm. Also provides the default value for 'writeConcurrency'. 

Integer 

yes 
The number of concurrent threads used for writing the result to Neo4j. 

writeRelationshipType 
String 

no 
The relationship type used to persist the computed relationships in the Neo4j database. 
String 

no 
The relationship property in the Neo4j database to which the similarity score is written. 
Name  Type  Default  Optional  Description 

String 

no 
The name of a node property that contains node weights which will be used for similarity computation. 

topK 
Integer 

yes 
The number of neighbors to find for each node. The Knearest neighbors are returned. This value cannot be lower than 1. 
sampleRate 
Float 

yes 
Sample rate to limit the number of comparisons per node. Value must be between 0 (exclusive) and 1 (inclusive). 
deltaThreshold 
Float 

yes 
Value as a percentage to determine when to stop early. If fewer updates than the configured value happen, the algorithm stops. Value must be between 0 (exclusive) and 1 (inclusive). 
Integer 

yes 
Hard limit to stop the algorithm after that many iterations. 

randomJoins 
Integer 

yes 
Between every iteration, how many attempts are being made to connect new node neighbors based on random selection. 
randomSeed 
Integer 

yes 
The seed value to control the randomness of the algorithm. The value 1 means that a new seed is generated for every execution, all other values (including negative ones) are used as the seed value. 
Name  Type  Description 

createMillis 
Integer 
Milliseconds for loading data. 
computeMillis 
Integer 
Milliseconds for running the algorithm. 
writeMillis 
Integer 
Milliseconds for writing result data back to Neo4j. 
postProcessingMillis 
Integer 
Milliseconds for computing similarity value distribution statistics. 
nodesCompared 
Integer 
The number of nodes compared. 
relationshipsWritten 
Integer 
The number of relationships created. 
similarityDistribution 
Map 
Map containing min, max, mean, stdDev and p1, p5, p10, p25, p75, p90, p95, p99, p100 percentile values of the computed similarity results. 
configuration 
Map 
The configuration used for running the algorithm. 
2.1. Anonymous graphs
It is also possible to execute the algorithm on a graph that is projected in conjunction with the algorithm execution.
In this case, the graph does not have a name, and we call it anonymous.
When executing over an anonymous graph the configuration map contains a graph projection configuration as well as an algorithm configuration.
All execution modes support execution on anonymous graphs, although we only show syntax and modespecific configuration for the write
mode for brevity.
For more information on syntax variants, see Syntax overview.
CALL gds.beta.knn.write(
configuration: Map
)
YIELD
createMillis: Integer,
computeMillis: Integer,
writeMillis: Integer,
postProcessingMillis: Integer,
nodesCompared: Integer,
relationshipsWritten: Integer,
similarityDistribution: Map,
configuration: Map
Name  Type  Default  Optional  Description 

nodeProjection 
String, List of String or Map 

yes 
The node projection used for anonymous graph creation via a Native projection. 
relationshipProjection 
String, List of String or Map 

yes 
The relationship projection used for anonymous graph creation a Native projection. 
nodeQuery 
String 

yes 
The Cypher query used to select the nodes for anonymous graph creation via a Cypher projection. 
relationshipQuery 
String 

yes 
The Cypher query used to select the relationships for anonymous graph creation via a Cypher projection. 
nodeProperties 
String, List of String or Map 

yes 
The node properties to project during anonymous graph creation. 
relationshipProperties 
String, List of String or Map 

yes 
The relationship properties to project during anonymous graph creation. 
Integer 

yes 
The number of concurrent threads used for running the algorithm. Also provides the default value for 'readConcurrency' and 'writeConcurrency'. 

readConcurrency 
Integer 

yes 
The number of concurrent threads used for creating the graph. 
Integer 

yes 
The number of concurrent threads used for writing the result to Neo4j. 

writeRelationshipType 
String 

no 
The relationship type used to persist the computed relationships in the Neo4j database. 
String 

no 
The relationship property in the Neo4j database to which the similarity score is written. 
The KNN algorithm does not read any relationships, but the values for 
Name  Type  Default  Optional  Description 

String 

no 
The name of a node property that contains node weights which will be used for similarity computation. 

topK 
Integer 

yes 
The number of neighbors to find for each node. The Knearest neighbors are returned. This value cannot be lower than 1. 
sampleRate 
Float 

yes 
Sample rate to limit the number of comparisons per node. Value must be between 0 (exclusive) and 1 (inclusive). 
deltaThreshold 
Float 

yes 
Value as a percentage to determine when to stop early. If fewer updates than the configured value happen, the algorithm stops. Value must be between 0 (exclusive) and 1 (inclusive). 
Integer 

yes 
Hard limit to stop the algorithm after that many iterations. 

randomJoins 
Integer 

yes 
Between every iteration, how many attempts are being made to connect new node neighbors based on random selection. 
randomSeed 
Integer 

yes 
The seed value to control the randomness of the algorithm. The value 1 means that a new seed is generated for every execution, all other values (including negative ones) are used as the seed value. 
The results are the same as running write mode on a named graph, see write mode syntax above.
3. Examples
Consider the graph created by the following Cypher statement:
CREATE (alice:Person {name: 'Alice', age: 24})
CREATE (bob:Person {name: 'Bob', age: 73})
CREATE (carol:Person {name: 'Carol', age: 24})
CREATE (dave:Person {name: 'Dave', age: 48})
CREATE (eve:Person {name: 'Eve', age: 67});
In the example, we want to use the KNearest Neighbors algorithm to compare people based on their age.
In the examples below we will use named graphs and native projections as the norm. However, anonymous graphs and/or Cypher projections can also be used. 
CALL gds.graph.create(
'myGraph',
{
Person: {
label: 'Person',
properties: 'age'
}
},
'*'
);
3.1. Memory Estimation
First off, we will estimate the cost of running the algorithm using the estimate
procedure.
This can be done with any execution mode.
We will use the write
mode in this example.
Estimating the algorithm is useful to understand the memory impact that running the algorithm on your graph will have.
When you later actually run the algorithm in one of the execution modes the system will perform an estimation.
If the estimation shows that there is a very high probability of the execution going over its memory limitations, the execution is prohibited.
To read more about this, see Automatic estimation and execution blocking.
For more details on estimate
in general, see Memory Estimation.
CALL gds.beta.knn.write.estimate('myGraph', {
nodeWeightProperty: 'age',
writeRelationshipType: 'SIMILAR',
writeProperty: 'score',
topK: 1
})
YIELD nodeCount, bytesMin, bytesMax, requiredMemory
nodeCount  bytesMin  bytesMax  requiredMemory 

5 
1224 
2184 
"[1224 Bytes ... 2184 Bytes]" 
3.2. Stream
In the stream
execution mode, the algorithm returns the similarity score for each relationship.
This allows us to inspect the results directly or postprocess them in Cypher without any side effects.
For more details on the stream
mode in general, see Stream.
CALL gds.beta.knn.stream('myGraph', {
topK: 1,
nodeWeightProperty: 'age',
// The following parameters are set to produce a deterministic result
randomSeed: 42,
concurrency: 1,
sampleRate: 1.0,
deltaThreshold: 0.0
})
YIELD node1, node2, similarity
RETURN gds.util.asNode(node1).name AS Person1, gds.util.asNode(node2).name AS Person2, similarity
ORDER BY similarity DESCENDING, Person1, Person2
Person1  Person2  similarity 

"Alice" 
"Carol" 
1.0 
"Carol" 
"Alice" 
1.0 
"Bob" 
"Eve" 
0.14285714285714285 
"Eve" 
"Bob" 
0.14285714285714285 
"Dave" 
"Eve" 
0.05 
We use default values for the procedure configuration parameter for most parameters.
The randomSeed
is set to produce the same result on every invocation.
The topK
parameter is set to 1 to only return the single nearest neighbor for every node.
3.3. Stats
In the stats
execution mode, the algorithm returns a single row containing a summary of the algorithm result.
This execution mode does not have any side effects.
It can be useful for evaluating algorithm performance by inspecting the computeMillis
return item.
In the examples below we will omit returning the timings.
The full signature of the procedure can be found in the syntax section.
For more details on the stats
mode in general, see Stats.
CALL gds.beta.knn.stats('myGraph', {topK: 1, randomSeed: 42, nodeWeightProperty: 'age'})
YIELD nodesCompared, similarityPairs
nodesCompared  similarityPairs 

5 
5 
3.4. Mutate
The mutate
execution mode extends the stats
mode with an important side effect: updating the named graph with a new relationship property containing the similarity score for that relationship.
The name of the new property is specified using the mandatory configuration parameter mutateProperty
.
The result is a single summary row, similar to stats
, but with some additional metrics.
The mutate
mode is especially useful when multiple algorithms are used in conjunction.
For more details on the mutate
mode in general, see Mutate.
CALL gds.beta.knn.mutate('myGraph', {
mutateRelationshipType: 'SIMILAR',
mutateProperty: 'score',
topK: 1,
randomSeed: 42,
nodeWeightProperty: 'age'
})
YIELD nodesCompared, relationshipsWritten
nodesCompared  relationshipsWritten 

5 
5 
As we can see from the results, the number of created relationships is equal to the number of rows in the streaming example.
3.5. Write
The write
execution mode extends the stats
mode with an important side effect: for each pair of nodes we create a relationship with the similarity score as a property to the Neo4j database.
The type of the new relationship is specified using the mandatory configuration parameter writeRelationshipType
.
Each new relationship stores the similarity score between the two nodes it represents.
The relationship property key is set using the mandatory configuration parameter writeProperty
.
The result is a single summary row, similar to stats
, but with some additional metrics.
For more details on the write
mode in general, see Write.
CALL gds.beta.knn.write('myGraph', {
writeRelationshipType: 'SIMILAR',
writeProperty: 'score',
topK: 1,
randomSeed: 42,
nodeWeightProperty: 'age'
})
YIELD nodesCompared, relationshipsWritten
nodesCompared  relationshipsWritten 

5 
5 
As we can see from the results, the number of created relationships is equal to the number of rows in the streaming example.
Was this page helpful?