K-Nearest Neighbors

This section describes the K-Nearest Neighbors (KNN) algorithm in the Neo4j Graph Data Science library.

1. Introduction

The K-Nearest Neighbors algorithm computes a distance value for all node pairs in the graph and creates new relationships between each node and its k nearest neighbors. The distance is calculated based on node properties.

The input of this algorithm is a monopartite graph. The graph does not need to be connected, in fact, existing relationships between nodes will be ignored. New relationships are created between each node and its k nearest neighbors.

The K-Nearest Neighbors algorithm compares a given property of each node. The k nodes where this property is most similar are the k-nearest neighbors.

The initial set of neighbors is picked at random and verified and refined in multiple iterations. The number of iterations is limited by the configuration parameter maxIterations. The algorithm may stop earlier if the neighbor lists only change by a small amount, which can be controlled by the configuration parameter deltaThreshold.

The particular implementation is based on Efficient k-nearest neighbor graph construction for generic similarity measures by Wei Dong et al. Instead of comparing every node with every other node, the algorithm selects possible neighbors based on the assumption, that the neighbors-of-neighbors of a node are most likely already the nearest one. The algorithm scales quasi-linear with respect to the node count, instead of being quadratic.

Furthermore, the algorithm only compares a sample of all possible neighbors on each iteration, assuming that eventually all possible neighbors will be seen. This can be controlled with the configuration parameter sampleRate:

  • A valid sample rate must be in between 0 (exclusive) and 1 (inclusive).

  • The default value is 0.5.

  • The parameter is used to control the trade-off between accuracy and runtime-performance.

  • A higher sample rate will increase the accuracy of the result.

    • The algorithm will also require more memory and will take longer to compute.

  • A lower sample rate will increase the runtime-performance.

    • Some potential nodes may be missed in the comparison and may not be included in the result.

The output of the algorithm are new relationships between nodes and their k-nearest neighbors. Similarity scores are expressed via relationship properties.

For more information on this algorithm, see:

Running this algorithm requires sufficient available memory. Before running this algorithm, we recommend that you read Memory Estimation.

1.1. Similarity measures

The similarity measure used in the KNN algorithm depends on the type of the relationship properties. KNN supports both scalar numeric values as well as lists of numbers.

Scalar numeric property

When the property is a scalar number, the similarity is computed as one divided by one plus the absolute difference between the values:

knn scalar similarity
List of integers

When the property is a list of integers, the similarity is computed as one divided by one plus the number of unequal numbers in the list:

knn integer list similarity
List of floating-point numbers

When the property is a list of floating-point numbers, the similarity is computed using the cosine similarity metric. See the Cosine Similarity algorithm for more details.

2. Syntax

This section covers the syntax used to execute the K-Nearest Neighbors algorithm in each of its execution modes. We are describing the named graph variant of the syntax. To learn more about general syntax variants, see Syntax overview.

Example 1. K-Nearest Neighbors syntax per mode
Run K-Nearest Neighbors in stream mode on a named graph.
CALL gds.beta.knn.stream(
  graphName: String,
  configuration: Map
) YIELD
  node1: Integer,
  node2: Integer,
  similarity: Float
Table 1. Parameters
Name Type Default Optional Description

graphName

String

n/a

no

The name of a graph stored in the catalog.

configuration

Map

{}

yes

Configuration for algorithm-specifics and/or graph filtering.

Table 2. General configuration for algorithm execution on a named graph.
Name Type Default Optional Description

nodeLabels

String[]

['*']

yes

Filter the named graph using the given node labels.

relationshipTypes

String[]

['*']

yes

Filter the named graph using the given relationship types.

concurrency

Integer

4

yes

The number of concurrent threads used for running the algorithm.

Table 3. Algorithm specific configuration
Name Type Default Optional Description

nodeWeightProperty

String

n/a

no

The name of a node property that contains node weights which will be used for similarity computation.

topK

Integer

10

yes

The number of neighbors to find for each node. The K-nearest neighbors are returned. This value cannot be lower than 1.

sampleRate

Float

0.5

yes

Sample rate to limit the number of comparisons per node. Value must be between 0 (exclusive) and 1 (inclusive).

deltaThreshold

Float

0.001

yes

Value as a percentage to determine when to stop early. If fewer updates than the configured value happen, the algorithm stops. Value must be between 0 (exclusive) and 1 (inclusive).

maxIterations

Integer

100

yes

Hard limit to stop the algorithm after that many iterations.

randomJoins

Integer

10

yes

Between every iteration, how many attempts are being made to connect new node neighbors based on random selection.

randomSeed

Integer

-1

yes

The seed value to control the randomness of the algorithm. The value -1 means that a new seed is generated for every execution, all other values (including negative ones) are used as the seed value.

Table 4. Results
Name Type Description

node1

Integer

Node ID of the first node.

node2

Integer

Node ID of the second node.

similarity

Float

Similarity score for the two nodes.

Run K-Nearest Neighbors in stats mode on a named graph.
CALL gds.beta.knn.stats(
  graphName: String,
  configuration: Map
)
YIELD
  createMillis: Integer,
  computeMillis: Integer,
  postProcessingMillis: Integer,
  nodesCompared: Integer,
  similarityPairs: Integer,
  similarityDistribution: Map,
  configuration: Map
Table 5. Parameters
Name Type Default Optional Description

graphName

String

n/a

no

The name of a graph stored in the catalog.

configuration

Map

{}

yes

Configuration for algorithm-specifics and/or graph filtering.

Table 6. General configuration for algorithm execution on a named graph.
Name Type Default Optional Description

nodeLabels

String[]

['*']

yes

Filter the named graph using the given node labels.

relationshipTypes

String[]

['*']

yes

Filter the named graph using the given relationship types.

concurrency

Integer

4

yes

The number of concurrent threads used for running the algorithm.

Table 7. Algorithm specific configuration
Name Type Default Optional Description

nodeWeightProperty

String

n/a

no

The name of a node property that contains node weights which will be used for similarity computation.

topK

Integer

10

yes

The number of neighbors to find for each node. The K-nearest neighbors are returned. This value cannot be lower than 1.

sampleRate

Float

0.5

yes

Sample rate to limit the number of comparisons per node. Value must be between 0 (exclusive) and 1 (inclusive).

deltaThreshold

Float

0.001

yes

Value as a percentage to determine when to stop early. If fewer updates than the configured value happen, the algorithm stops. Value must be between 0 (exclusive) and 1 (inclusive).

maxIterations

Integer

100

yes

Hard limit to stop the algorithm after that many iterations.

randomJoins

Integer

10

yes

Between every iteration, how many attempts are being made to connect new node neighbors based on random selection.

randomSeed

Integer

-1

yes

The seed value to control the randomness of the algorithm. The value -1 means that a new seed is generated for every execution, all other values (including negative ones) are used as the seed value.

Table 8. Results
Name Type Description

createMillis

Integer

Milliseconds for loading data.

computeMillis

Integer

Milliseconds for running the algorithm.

postProcessingMillis

Integer

Milliseconds for computing similarity value distribution statistics.

similarityPairs

Integer

The number of pairs of similar nodes computed.

similarityDistribution

Map

Map containing min, max, mean as well as p50, p75, p90, p95, p99 and p999 percentile values of the computed similarity results.

configuration

Map

The configuration used for running the algorithm.

Run K-Nearest Neighbors in mutate mode on a graph stored in the catalog.
CALL gds.beta.knn.mutate(
  graphName: String,
  configuration: Map
)
YIELD
  createMillis: Integer,
  computeMillis: Integer,
  mutateMillis: Integer,
  postProcessingMillis: Integer,
  relationshipsWritten: Integer,
  nodesCompared: Integer,
  similarityDistribution: Map,
  configuration: Map
Table 9. Parameters
Name Type Default Optional Description

graphName

String

n/a

no

The name of a graph stored in the catalog.

configuration

Map

{}

yes

Configuration for algorithm-specifics and/or graph filtering.

Table 10. General configuration for algorithm execution on a named graph.
Name Type Default Optional Description

nodeLabels

String[]

['*']

yes

Filter the named graph using the given node labels.

relationshipTypes

String[]

['*']

yes

Filter the named graph using the given relationship types.

concurrency

Integer

4

yes

The number of concurrent threads used for running the algorithm.

mutateRelationshipType

String

n/a

no

The relationship type used for the new relationships written to the in-memory graph.

mutateProperty

String

n/a

no

The relationship property in the GDS graph to which the similarity score is written.

Table 11. Algorithm specific configuration
Name Type Default Optional Description

nodeWeightProperty

String

n/a

no

The name of a node property that contains node weights which will be used for similarity computation.

topK

Integer

10

yes

The number of neighbors to find for each node. The K-nearest neighbors are returned. This value cannot be lower than 1.

sampleRate

Float

0.5

yes

Sample rate to limit the number of comparisons per node. Value must be between 0 (exclusive) and 1 (inclusive).

deltaThreshold

Float

0.001

yes

Value as a percentage to determine when to stop early. If fewer updates than the configured value happen, the algorithm stops. Value must be between 0 (exclusive) and 1 (inclusive).

maxIterations

Integer

100

yes

Hard limit to stop the algorithm after that many iterations.

randomJoins

Integer

10

yes

Between every iteration, how many attempts are being made to connect new node neighbors based on random selection.

randomSeed

Integer

-1

yes

The seed value to control the randomness of the algorithm. The value -1 means that a new seed is generated for every execution, all other values (including negative ones) are used as the seed value.

Table 12. Results
Name Type Description

createMillis

Integer

Milliseconds for loading data.

computeMillis

Integer

Milliseconds for running the algorithm.

mutateMillis

Integer

Milliseconds for adding properties to the in-memory graph.

postProcessingMillis

Integer

Milliseconds for computing similarity value distribution statistics.

nodesCompared

Integer

The number of nodes compared.

relationshipsWritten

Integer

The number of relationships created.

similarityDistribution

Map

Map containing min, max, mean, stdDev and p1, p5, p10, p25, p75, p90, p95, p99, p100 percentile values of the computed similarity results.

configuration

Map

The configuration used for running the algorithm.

Run K-Nearest Neighbors in write mode on a graph stored in the catalog.
CALL gds.beta.knn.write(
  graphName: String,
  configuration: Map
)
YIELD
  createMillis: Integer,
  computeMillis: Integer,
  writeMillis: Integer,
  postProcessingMillis: Integer,
  nodesCompared: Integer,
  relationshipsWritten: Integer,
  similarityDistribution: Map,
  configuration: Map
Table 13. Parameters
Name Type Default Optional Description

graphName

String

n/a

no

The name of a graph stored in the catalog.

configuration

Map

{}

yes

Configuration for algorithm-specifics and/or graph filtering.

Table 14. General configuration for algorithm execution on a named graph.
Name Type Default Optional Description

nodeLabels

String[]

['*']

yes

Filter the named graph using the given node labels.

relationshipTypes

String[]

['*']

yes

Filter the named graph using the given relationship types.

concurrency

Integer

4

yes

The number of concurrent threads used for running the algorithm. Also provides the default value for 'writeConcurrency'.

writeConcurrency

Integer

value of 'concurrency'

yes

The number of concurrent threads used for writing the result to Neo4j.

writeRelationshipType

String

n/a

no

The relationship type used to persist the computed relationships in the Neo4j database.

writeProperty

String

n/a

no

The relationship property in the Neo4j database to which the similarity score is written.

Table 15. Algorithm specific configuration
Name Type Default Optional Description

nodeWeightProperty

String

n/a

no

The name of a node property that contains node weights which will be used for similarity computation.

topK

Integer

10

yes

The number of neighbors to find for each node. The K-nearest neighbors are returned. This value cannot be lower than 1.

sampleRate

Float

0.5

yes

Sample rate to limit the number of comparisons per node. Value must be between 0 (exclusive) and 1 (inclusive).

deltaThreshold

Float

0.001

yes

Value as a percentage to determine when to stop early. If fewer updates than the configured value happen, the algorithm stops. Value must be between 0 (exclusive) and 1 (inclusive).

maxIterations

Integer

100

yes

Hard limit to stop the algorithm after that many iterations.

randomJoins

Integer

10

yes

Between every iteration, how many attempts are being made to connect new node neighbors based on random selection.

randomSeed

Integer

-1

yes

The seed value to control the randomness of the algorithm. The value -1 means that a new seed is generated for every execution, all other values (including negative ones) are used as the seed value.

Table 16. Results
Name Type Description

createMillis

Integer

Milliseconds for loading data.

computeMillis

Integer

Milliseconds for running the algorithm.

writeMillis

Integer

Milliseconds for writing result data back to Neo4j.

postProcessingMillis

Integer

Milliseconds for computing similarity value distribution statistics.

nodesCompared

Integer

The number of nodes compared.

relationshipsWritten

Integer

The number of relationships created.

similarityDistribution

Map

Map containing min, max, mean, stdDev and p1, p5, p10, p25, p75, p90, p95, p99, p100 percentile values of the computed similarity results.

configuration

Map

The configuration used for running the algorithm.

2.1. Anonymous graphs

It is also possible to execute the algorithm on a graph that is projected in conjunction with the algorithm execution. In this case, the graph does not have a name, and we call it anonymous. When executing over an anonymous graph the configuration map contains a graph projection configuration as well as an algorithm configuration. All execution modes support execution on anonymous graphs, although we only show syntax and mode-specific configuration for the write mode for brevity.

For more information on syntax variants, see Syntax overview.

Run K-Nearest Neighbors in write mode on an anonymous graph.
CALL gds.beta.knn.write(
  configuration: Map
)
YIELD
  createMillis: Integer,
  computeMillis: Integer,
  writeMillis: Integer,
  postProcessingMillis: Integer,
  nodesCompared: Integer,
  relationshipsWritten: Integer,
  similarityDistribution: Map,
  configuration: Map
Table 17. General configuration for algorithm execution on an anonymous graph.
Name Type Default Optional Description

nodeProjection

String, String[] or Map

null

yes

The node projection used for anonymous graph creation via a Native projection.

relationshipProjection

String, String[] or Map

null

yes

The relationship projection used for anonymous graph creation a Native projection.

nodeQuery

String

null

yes

The Cypher query used to select the nodes for anonymous graph creation via a Cypher projection.

relationshipQuery

String

null

yes

The Cypher query used to select the relationships for anonymous graph creation via a Cypher projection.

nodeProperties

String, String[] or Map

null

yes

The node properties to project during anonymous graph creation.

relationshipProperties

String, String[] or Map

null

yes

The relationship properties to project during anonymous graph creation.

concurrency

Integer

4

yes

The number of concurrent threads used for running the algorithm. Also provides the default value for 'readConcurrency' and 'writeConcurrency'.

readConcurrency

Integer

value of 'concurrency'

yes

The number of concurrent threads used for creating the graph.

writeConcurrency

Integer

value of 'concurrency'

yes

The number of concurrent threads used for writing the result to Neo4j.

writeRelationshipType

String

n/a

no

The relationship type used to persist the computed relationships in the Neo4j database.

writeProperty

String

n/a

no

The relationship property in the Neo4j database to which the similarity score is written.

The KNN algorithm does not read any relationships, but the values for relationshipProjection or relationshipQuery are still being used and respected for the graph loading.

Table 18. Algorithm specific configuration
Name Type Default Optional Description

nodeWeightProperty

String

n/a

no

The name of a node property that contains node weights which will be used for similarity computation.

topK

Integer

10

yes

The number of neighbors to find for each node. The K-nearest neighbors are returned. This value cannot be lower than 1.

sampleRate

Float

0.5

yes

Sample rate to limit the number of comparisons per node. Value must be between 0 (exclusive) and 1 (inclusive).

deltaThreshold

Float

0.001

yes

Value as a percentage to determine when to stop early. If fewer updates than the configured value happen, the algorithm stops. Value must be between 0 (exclusive) and 1 (inclusive).

maxIterations

Integer

100

yes

Hard limit to stop the algorithm after that many iterations.

randomJoins

Integer

10

yes

Between every iteration, how many attempts are being made to connect new node neighbors based on random selection.

randomSeed

Integer

-1

yes

The seed value to control the randomness of the algorithm. The value -1 means that a new seed is generated for every execution, all other values (including negative ones) are used as the seed value.

The results are the same as running write mode on a named graph, see write mode syntax above.

3. Examples

Consider the graph created by the following Cypher statement:

CREATE (alice:Person {name: 'Alice', age: 24})
CREATE (bob:Person {name: 'Bob', age: 73})
CREATE (carol:Person {name: 'Carol', age: 24})
CREATE (dave:Person {name: 'Dave', age: 48})
CREATE (eve:Person {name: 'Eve', age: 67});

In the example, we want to use the K-Nearest Neighbors algorithm to compare people based on their age.

In the examples below we will use named graphs and standard projections as the norm. However, Cypher projection and anonymous graphs could also be used.

The following statement will create the graph and store it in the graph catalog.
CALL gds.graph.create(
    'myGraph',
    {
        Person: {
            label: 'Person',
            properties: 'age'
        }
    },
    '*'
);

3.1. Memory Estimation

First off, we will estimate the cost of running the algorithm using the estimate procedure. This can be done with any execution mode. We will use the write mode in this example. Estimating the algorithm is useful to understand the memory impact that running the algorithm on your graph will have. When you later actually run the algorithm in one of the execution modes the system will perform an estimation. If the estimation shows that there is a very high probability of the execution going over its memory limitations, the execution is prohibited. To read more about this, see Automatic estimation and execution blocking.

For more details on estimate in general, see Memory Estimation.

The following will estimate the memory requirements for running the algorithm:
CALL gds.beta.knn.write.estimate('myGraph', {
  nodeWeightProperty: 'age',
  writeRelationshipType: 'SIMILAR',
  writeProperty: 'score',
  topK: 1
})
YIELD nodeCount, bytesMin, bytesMax, requiredMemory
Table 19. Results
nodeCount bytesMin bytesMax requiredMemory

5

1224

2184

"[1224 Bytes ... 2184 Bytes]"

3.2. Stream

In the stream execution mode, the algorithm returns the similarity score for each relationship. This allows us to inspect the results directly or post-process them in Cypher without any side effects.

For more details on the stream mode in general, see Stream.

The following will run the algorithm, and stream results:
CALL gds.beta.knn.stream('myGraph', {
    topK: 1,
    nodeWeightProperty: 'age',
    // The following parameters are set to produce a deterministic result
    randomSeed: 42,
    concurrency: 1,
    sampleRate: 1.0,
    deltaThreshold: 0.0
})
YIELD node1, node2, similarity
RETURN gds.util.asNode(node1).name AS Person1, gds.util.asNode(node2).name AS Person2, similarity
ORDER BY similarity DESCENDING, Person1, Person2
Table 20. Results
Person1 Person2 similarity

"Alice"

"Carol"

1.0

"Carol"

"Alice"

1.0

"Bob"

"Eve"

0.14285714285714285

"Eve"

"Bob"

0.14285714285714285

"Dave"

"Eve"

0.05

5 rows

We use default values for the procedure configuration parameter for most parameters. The randomSeed is set to produce the same result on every invocation. The topK parameter is set to 1 to only return the single nearest neighbor for every node.

3.3. Stats

In the stats execution mode, the algorithm returns a single row containing a summary of the algorithm result. This execution mode does not have any side effects. It can be useful for evaluating algorithm performance by inspecting the computeMillis return item. In the examples below we will omit returning the timings. The full signature of the procedure can be found in the syntax section.

For more details on the stats mode in general, see Stats.

The following will run the algorithm and return the result in form of statistical and measurement values:
CALL gds.beta.knn.stats('myGraph', {topK: 1, randomSeed: 42, nodeWeightProperty: 'age'})
YIELD nodesCompared, similarityPairs
Table 21. Results
nodesCompared similarityPairs

5

5

3.4. Mutate

The mutate execution mode extends the stats mode with an important side effect: updating the named graph with a new relationship property containing the similarity score for that relationship. The name of the new property is specified using the mandatory configuration parameter mutateProperty. The result is a single summary row, similar to stats, but with some additional metrics. The mutate mode is especially useful when multiple algorithms are used in conjunction.

For more details on the mutate mode in general, see Mutate.

The following will run the algorithm, and write back results to the in-memory graph:
CALL gds.beta.knn.mutate('myGraph', {
    mutateRelationshipType: 'SIMILAR',
    mutateProperty: 'score',
    topK: 1,
    randomSeed: 42,
    nodeWeightProperty: 'age'
})
YIELD nodesCompared, relationshipsWritten
Table 22. Results
nodesCompared relationshipsWritten

5

5

As we can see from the results, the number of created relationships is equal to the number of rows in the streaming example.

3.5. Write

The write execution mode extends the stats mode with an important side effect: for each pair of nodes we create a relationship with the similarity score as a property to the Neo4j database. The type of the new relationship is specified using the mandatory configuration parameter writeRelationshipType. Each new relationship stores the similarity score between the two nodes it represents. The relationship property key is set using the mandatory configuration parameter writeProperty. The result is a single summary row, similar to stats, but with some additional metrics.

For more details on the write mode in general, see Write.

The following will run the algorithm, and write back results:
CALL gds.beta.knn.write('myGraph', {
    writeRelationshipType: 'SIMILAR',
    writeProperty: 'score',
    topK: 1,
    randomSeed: 42,
    nodeWeightProperty: 'age'
})
YIELD nodesCompared, relationshipsWritten
Table 23. Results
nodesCompared relationshipsWritten

5

5

As we can see from the results, the number of created relationships is equal to the number of rows in the streaming example.