5.4.7. Approximate Nearest Neighbors (ANN)

This section describes the Approximate Nearest Neighbors algorithm in the Neo4j Graph Data Science library.

This algorithm is in the alpha tier. For more information on algorithm tiers, see Chapter 5, Algorithms.

The Approximate Nearest Neighbors algorithm constructs a k-Nearest Neighbors Graph for a set of objects based on a provided similarity algorithm. The similarity of items is computed based on Jaccard Similarity, Cosine Similarity, Euclidean Distance, or Pearson Similarity.

The implementation in the library is based on Dong, Charikar, and Li’s paper Efficient K-Nearest Neighbor Graph Construction for Generic Similarity Measures.

This section includes:

5.4.7.1. Syntax

The following will run the algorithm and write back results: 

CALL gds.alpha.ml.ann.write(configuration: Map)
YIELD nodes, similarityPairs, writeRelationshipType, writeProperty, min, max, mean, stdDev, p25, p50, p75, p90, p95, p99, p999, p100

Table 5.342. Configuration
Name Type Default Optional Description

algorithm

String

null

no

The similarity algorithm to use. Valid values: jaccard', 'cosine', 'pearson', 'euclidean'.

data

List

null

no

If algorithm is jaccard, a list of maps of the following structure: {item: nodeId, categories: [nodeId, nodeId, nodeId]}. Otherwise a list of maps of the following structure: {item: nodeId, weights: [double, double, double]} or a Cypher query.

top

Integer

0

yes

The number of similar pairs to return. If 0, it will return as many as it finds.

topK

Integer

3

yes

The number of similar values to return per node.

randomSeed

Integer

1

yes

The random-seed used for neighbor-sampling.

sampling

Boolean

true

yes

Whether the potential neighbors should be sampled.

p

Float

0.5

yes

Influences the sample size: min(1.0, p) * |topK|.

similarityCutoff

Integer

-1

yes

The threshold for similarity. Values below this will not be returned.

degreeCutoff

Integer

0

yes

The threshold for the number of items in the targets list. If the list contains less than this amount, that node will be excluded from the calculation.

concurrency

Integer

4

yes

The number of concurrent threads used for running the algorithm. Also provides the default value for 'writeConcurrency'.

writeConcurrency

Integer

value of 'concurrency'

yes

The number of concurrent threads used for writing the result.

writeBatchSize

Integer

10000

yes

The batch size to use when storing results.

writeRelationshipType

String

SIMILAR

yes

The relationship type to use when storing results.

writeProperty

String

score

yes

The property to use when storing results.

Table 5.343. Results
Name Type Description

nodes

Integer

The number of nodes passed in.

similarityPairs

Integer

The number of pairs of similar nodes computed.

writeRelationshipType

String

The relationship type used when storing results.

writeProperty

String

The property used when storing results.

min

Float

The minimum similarity score computed.

max

Float

The maximum similarity score computed.

mean

Float

The mean of similarities scores computed.

stdDev

Float

The standard deviation of similarities scores computed.

p25

Float

The 25 percentile of similarities scores computed.

p50

Float

The 50 percentile of similarities scores computed.

p75

Float

The 75 percentile of similarities scores computed.

p90

Float

The 90 percentile of similarities scores computed.

p95

Float

The 95 percentile of similarities scores computed.

p99

Float

The 99 percentile of similarities scores computed.

p999

Float

The 99.9 percentile of similarities scores computed.

p100

Float

The 25 percentile of similarities scores computed.

The following will run the algorithm and stream results: 

CALL gds.alpha.ml.ann.stream(configuration: Map)
YIELD item1, item2, count1, count2, intersection, similarity

Table 5.344. Configuration
Name Type Default Optional Description

algorithm

String

null

no

The similarity algorithm to use. Valid values: jaccard', 'cosine', 'pearson', 'euclidean'

data

List

null

no

If algorithm is 'jaccard', a list of maps of the following structure: {item: nodeId, categories: [nodeId, nodeId, nodeId]}. Otherwise a list of maps of the following structure: {item: nodeId, weights: [double, double, double]} or a Cypher query.

top

Integer

0

yes

The number of similar pairs to return. If 0, it will return as many as it finds.

topK

Integer

3

yes

The number of similar values to return per node.

randomSeed

Integer

1

yes

The random-seed used for neighbor-sampling.

sampling

Boolean

true

yes

Whether the potential neighbors should be sampled.

p

Float

0.5

yes

Influences the sample size: min(1.0, p) * |topK|

similarityCutoff

Integer

-1

yes

The threshold for similarity. Values below this will not be returned.

degreeCutoff

Integer

0

yes

The threshold for the number of items in the targets list. If the list contains less than this amount, that node will be excluded from the calculation.

concurrency

Integer

4

yes

The number of concurrent threads used for running the algorithm.

Table 5.345. Results
Name Type Description

item1

Integer

The ID of one node in the similarity pair.

item2

Integer

The ID of other node in the similarity pair.

count1

Integer

The size of the targets list of one node.

count2

Integer

The size of the targets list of other node.

intersection

Integer

The number of intersecting values in the two nodes targets lists.

similarity

Integer

The similarity of the two nodes.

5.4.7.2. Use-cases - when to use the Approximate Nearest Neighbors algorithm

We can use the Approximate Nearest Neighbors algorithm to work out the approximate k most similar items to each other. The corresponding k-Nearest Neighbors Graph can then be used as part of recommendation queries.

5.4.7.3. Approximate Nearest Neighbors algorithm sample

The following will create a sample graph: 

 CREATE
  (french:Cuisine {name:'French'}),
  (italian:Cuisine {name:'Italian'}),
  (indian:Cuisine {name:'Indian'}),
  (lebanese:Cuisine {name:'Lebanese'}),
  (portuguese:Cuisine {name:'Portuguese'}),

  (zhen:Person {name: 'Zhen'}),
  (praveena:Person {name: 'Praveena'}),
  (michael:Person {name: 'Michael'}),
  (arya:Person {name: 'Arya'}),
  (karin:Person {name: 'Karin'}),

  (praveena)-[:LIKES]->(indian),
  (praveena)-[:LIKES]->(portuguese),

  (zhen)-[:LIKES]->(french),
  (zhen)-[:LIKES]->(indian),

  (michael)-[:LIKES]->(french),
  (michael)-[:LIKES]->(italian),
  (michael)-[:LIKES]->(indian),

  (arya)-[:LIKES]->(lebanese),
  (arya)-[:LIKES]->(italian),
  (arya)-[:LIKES]->(portuguese),

  (karin)-[:LIKES]->(lebanese),
  (karin)-[:LIKES]->(italian)

The following will return a stream of nodes, along with up to the 3 most similar nodes to them based on Jaccard Similarity: 

 MATCH (p:Person)-[:LIKES]->(cuisine)
 WITH {item:id(p), categories: collect(id(cuisine))} AS userData
 WITH collect(userData) AS data
 CALL gds.alpha.ml.ann.stream({
   data: data,
   algorithm: 'jaccard',
   similarityCutoff: 0.1,
   concurrency: 1
 })
 YIELD item1, item2, similarity
 return gds.util.asNode(item1).name AS from, gds.util.asNode(item2).name AS to, similarity
 ORDER BY from

Table 5.346. Results
from to similarity

Arya

Karin

0.6666666666666666

Arya

Praveena

0.25

Arya

Michael

0.2

Karin

Arya

0.6666666666666666

Karin

Michael

0.25

Michael

Karin

0.25

Michael

Praveena

0.25

Michael

Arya

0.2

Praveena

Arya

0.25

Praveena

Michael

0.25

Zhen

Michael

0.6666666666666666

Arya and Karin, and Zhen and Michael have the most similar food preferences, with two overlapping cuisines for a similarity of 0.66. We also have 3 pairs of users who are not similar at all. We’d probably want to filter those out, which we can do by passing in the similarityCutoff parameter.

The following will find up to 3 similar users for each user, and store a relationship between those users: 

 MATCH (p:Person)-[:LIKES]->(cuisine)
 WITH {item:id(p), categories: collect(id(cuisine))} AS userData
 WITH collect(userData) AS data
 CALL gds.alpha.ml.ann.write({
  algorithm: 'jaccard',
  data: data,
  similarityCutoff: 0.1,
  showComputations: true,
  concurrency: 1
 })
 YIELD nodes, similarityPairs, writeRelationshipType, writeProperty, min, max, mean, p95
 RETURN nodes, similarityPairs, writeRelationshipType, writeProperty, min, max, mean, p95

Table 5.347. Results
nodes similarityPairs writeRelationshipType writeProperty min max mean p95

5

13

"SIMILAR"

"score"

0.19999980926513672

0.6666669845581055

0.3512822664701022

0.6666669845581055

We then could write a query to find out what types of cuisine that other people similar to us might like.

The following will find the most similar user to Praveena, and return their favorite cuisines that Praveena doesn’t (yet!) like: 

 MATCH (p:Person {name: 'Praveena'})-[:SIMILAR]->(other),
       (other)-[:LIKES]->(cuisine)
 WHERE not((p)-[:LIKES]->(cuisine))
 RETURN cuisine.name AS cuisine, count(*) AS count
 ORDER BY cuisine DESC

Table 5.348. Results
cuisine count

"French"

1

"Italian"

2

"Lebanese"

1

Usage

When executing ApproximateNearestNeighbors in parallel, it is possible that results are flaky because of the asynchronous execution fashion of the algorithm.