7.5.6. The Approximate Nearest Neighbors (ANN) algorithm

This section describes the Approximate Nearest Neighbors algorithm in the Neo4j Labs Graph Algorithms library.

The Approximate Nearest Neighbors algorithm was developed by the Neo4j Labs team and is not officially supported.

The Approximate Nearest Neighbors algorithm constructs a k-Nearest Neighbors Graph for a set of objects based on a provided similarity algorithm. The similarity of items is computed based on Jaccard Similarity, Cosine Similarity, Euclidean Distance, or Pearson Similarity.

The implementation in the library is based on Dong, Charikar, and Li’s paper Efficient K-Nearest Neighbor Graph Construction for Generic Similarity Measures.

This section includes:

7.5.6.1. Use-cases - when to use the Approximate Nearest Neighbors algorithm

We can use the Approximate Nearest Neighbors algorithm to work out the approximate k most similar items to each other. The corresponding k-Nearest Neighbors Graph can then be used as part of recommendation queries.

7.5.6.2. Approximate Nearest Neighbors algorithm sample

The following will create a sample graph: 

MERGE (french:Cuisine {name:'French'})
MERGE (italian:Cuisine {name:'Italian'})
MERGE (indian:Cuisine {name:'Indian'})
MERGE (lebanese:Cuisine {name:'Lebanese'})
MERGE (portuguese:Cuisine {name:'Portuguese'})

MERGE (zhen:Person {name: "Zhen"})
MERGE (praveena:Person {name: "Praveena"})
MERGE (michael:Person {name: "Michael"})
MERGE (arya:Person {name: "Arya"})
MERGE (karin:Person {name: "Karin"})

MERGE (shrimp:Recipe {title: "Shrimp Bolognese"})
MERGE (saltimbocca:Recipe {title: "Saltimbocca alla roman"})
MERGE (periperi:Recipe {title: "Peri Peri Naan"})

MERGE (praveena)-[:LIKES]->(indian)
MERGE (praveena)-[:LIKES]->(portuguese)

MERGE (zhen)-[:LIKES]->(french)
MERGE (zhen)-[:LIKES]->(indian)

MERGE (michael)-[:LIKES]->(french)
MERGE (michael)-[:LIKES]->(italian)
MERGE (michael)-[:LIKES]->(indian)

MERGE (arya)-[:LIKES]->(lebanese)
MERGE (arya)-[:LIKES]->(italian)
MERGE (arya)-[:LIKES]->(portuguese)

MERGE (karin)-[:LIKES]->(lebanese)
MERGE (karin)-[:LIKES]->(italian)

MERGE (shrimp)-[:TYPE]->(italian)
MERGE (shrimp)-[:TYPE]->(indian)

MERGE (saltimbocca)-[:TYPE]->(italian)
MERGE (saltimbocca)-[:TYPE]->(french)

MERGE (periperi)-[:TYPE]->(portuguese)
MERGE (periperi)-[:TYPE]->(indian)

The following will return a stream of nodes, along with up to the 3 most similar nodes to them based on Jaccard Similarity: 

MATCH (p:Person)-[:LIKES]->(cuisine)
WITH {item:id(p), categories: collect(id(cuisine))} as userData
WITH collect(userData) as data
CALL algo.labs.ml.ann.stream("jaccard", data, {similarityCutoff: 0.1})
YIELD item1, item2, similarity
return algo.asNode(item1).name AS from, algo.asNode(item2).name AS to, similarity
ORDER BY from

Table 7.149. Results
From To Similarity

"Arya"

"Karin"

0.6666666666666666

"Arya"

"Praveena"

0.25

"Arya"

"Michael"

0.2

"Karin"

"Arya"

0.6666666666666666

"Karin"

"Michael"

0.25

"Michael"

"Karin"

0.25

"Michael"

"Praveena"

0.25

"Michael"

"Arya"

0.2

"Praveena"

"Zhen"

0.3333333333333333

"Praveena"

"Arya"

0.25

"Praveena"

"Michael"

0.25

"Zhen"

"Michael"

0.6666666666666666

"Zhen"

"Praveena"

0.3333333333333333

Arya and Karin, and Zhen and Michael have the most similar food preferences, with two overlapping cuisines for a similarity of 0.66. We also have 3 pairs of users who are not similar at all. We’d probably want to filter those out, which we can do by passing in the similarityCutoff parameter.

The following will find up to 3 similar users for each user, and store a relationship between those users: 

MATCH (p:Person)-[:LIKES]->(cuisine)
WITH {item:id(p), categories: collect(id(cuisine))} as userData
WITH collect(userData) as data
CALL algo.labs.ml.ann("jaccard", data, {similarityCutoff: 0.1, showComputations: true, write: true})
YIELD nodes, similarityPairs, write, writeRelationshipType, writeProperty, min, max, mean, p95
RETURN nodes, similarityPairs, write, writeRelationshipType, writeProperty, min, max, mean, p95

Table 7.150. Results
nodes similarityPairs write writeRelationshipType writeProperty min max mean p95

5

13

true

SIMILAR

score

0.19999980926513672

0.6666669845581055

0.3512822664701022

0.6666669845581055

We then could write a query to find out what types of cuisine that other people similar to us might like.

The following will find the most similar user to Praveena, and return their favorite cuisines that Praveena doesn’t (yet!) like: 

MATCH (p:Person {name: "Praveena"})-[:SIMILAR]->(other),
      (other)-[:LIKES]->(cuisine)
WHERE not((p)-[:LIKES]->(cuisine))
RETURN cuisine.name AS cuisine, count(*) AS count
ORDER BY count DESC

Table 7.151. Results
cuisine count

Italian

2

French

2

Lebanese

1

7.5.6.3. Syntax

The following will run the algorithm and write back results: 

CALL algo.labs.ml.ann(userData:List<Map>, {
    topK: 1, similarityCutoff: 0.1, write:true, writeProperty: "score"
})
YIELD nodes, similarityPairs, write, writeRelationshipType, writeProperty, min, max, mean, stdDev, p25, p50, p75, p90, p95, p99, p999, p100

Table 7.152. Parameters
Name Type Default Optional Description

'algorithm'

string

null

no

The similarity algorithm to use. Valid values: jaccard', 'cosine', 'pearson', 'euclidean'.

'data'

list

null

no

If algorithm is 'jaccard', a list of maps of the following structure: {item: nodeId, categories: [nodeId, nodeId, nodeId]}. Otherwise a list of maps of the following structure: {item: nodeId, weights: [double, double, double]} or a Cypher query.

'top'

int

0

yes

The number of similar pairs to return. If 0, it will return as many as it finds.

'topK'

int

3

yes

The number of similar values to return per node. If 0, it will return as many as it finds.

'similarityCutoff'

int

-1

yes

The threshold for similarity. Values below this will not be returned.

'degreeCutoff'

int

0

yes

The threshold for the number of items in the targets list. If the list contains less than this amount, that node will be excluded from the calculation.

concurrency

int

available CPUs

yes

The number of concurrent threads used for running the algorithm. Also provides the default value for 'readConcurrency' and 'writeConcurrency'.

'readConcurrency'

int

value of 'concurrency'

yes

The number of concurrent threads used for reading the graph.

'writeConcurrency'

int

value of 'concurrency'

yes

The number of concurrent threads used for writing the result.

'write'

boolean

false

yes

Indicates whether results should be stored.

'writeBatchSize'

int

10000

yes

The batch size to use when storing results.

'writeRelationshipType'

string

SIMILAR

yes

The relationship type to use when storing results.

'writeProperty'

string

score

yes

The property to use when storing results.

Table 7.153. Results
Name Type Description

'nodes'

int

The number of nodes passed in.

'similarityPairs'

int

The number of pairs of similar nodes computed.

'write'

boolean

Indicates whether results were stored.

'writeRelationshipType'

string

The relationship type used when storing results.

'writeProperty'

string

The property used when storing results.

'min'

double

The minimum similarity score computed.

'max'

double

The maximum similarity score computed.

'mean'

double

The mean of similarities scores computed.

'stdDev'

double

The standard deviation of similarities scores computed.

'p25'

double

The 25 percentile of similarities scores computed.

'p50'

double

The 50 percentile of similarities scores computed.

'p75'

double

The 75 percentile of similarities scores computed.

'p90'

double

The 90 percentile of similarities scores computed.

'p95'

double

The 95 percentile of similarities scores computed.

'p99'

double

The 99 percentile of similarities scores computed.

'p999'

double

The 99.9 percentile of similarities scores computed.

'p100'

double

The 25 percentile of similarities scores computed.

The following will run the algorithm and stream results: 

CALL algo.labs.ml.ann.stream(userData:List<Map>, {
    degreeCutoff: 10, similarityCutoff: 0.1, concurrency:4
})
YIELD item1, item2, count1, count2, intersection, similarity

Table 7.154. Parameters
Name Type Default Optional Description

'algorithm'

string

null

no

The similarity algorithm to use. Valid values: jaccard', 'cosine', 'pearson', 'euclidean'

'data'

list

null

no

If algorithm is 'jaccard', a list of maps of the following structure: {item: nodeId, categories: [nodeId, nodeId, nodeId]}. Otherwise a list of maps of the following structure: {item: nodeId, weights: [double, double, double]} or a Cypher query.

'top'

int

0

yes

The number of similar pairs to return. If 0, it will return as many as it finds.

'topK'

int

3

yes

The number of similar values to return per node. If 0, it will return as many as it finds.

'similarityCutoff'

int

-1

yes

The threshold for similarity. Values below this will not be returned.

'degreeCutoff'

int

0

yes

The threshold for the number of items in the targets list. If the list contains less than this amount, that node will be excluded from the calculation.

concurrency

int

available CPUs

yes

The number of concurrent threads used for running the algorithm. Also provides the default value for 'readConcurrency'.

'readConcurrency'

int

value of 'concurrency'

yes

The number of concurrent threads used for reading the graph.

Table 7.155. Results
Name Type Description

'item1'

int

The ID of one node in the similarity pair.

'item2'

int

The ID of other node in the similarity pair.

'count1'

int

The size of the targets list of one node.

'count2'

int

The size of the targets list of other node.

'intersection'

int

The number of intersecting values in the two nodes targets lists.

'similarity'

int

The similarity of the two nodes.