Approximate Nearest Neighbors (ANN)
This section describes the Approximate Nearest Neighbors algorithm in the Neo4j Graph Data Science library.
This algorithm is in the alpha tier. For more information on algorithm tiers, see Algorithms.
The Approximate Nearest Neighbors algorithm constructs a k-Nearest Neighbors Graph for a set of objects based on a provided similarity algorithm. The similarity of items is computed based on Jaccard Similarity, Cosine Similarity, Euclidean Distance, or Pearson Similarity.
The implementation in the library is based on Dong, Charikar, and Li’s paper Efficient K-Nearest Neighbor Graph Construction for Generic Similarity Measures.
1. Syntax
CALL gds.alpha.ml.ann.write(configuration: Map)
YIELD nodes, similarityPairs, writeRelationshipType, writeProperty, min, max, mean, stdDev, p25, p50, p75, p90, p95, p99, p999, p100
Name | Type | Default | Optional | Description |
---|---|---|---|---|
algorithm |
String |
null |
no |
The similarity algorithm to use. Valid values: jaccard', 'cosine', 'pearson', 'euclidean'. |
data |
List |
null |
no |
If algorithm is |
top |
Integer |
0 |
yes |
The number of similar pairs to return. If |
topK |
Integer |
3 |
yes |
The number of similar values to return per node. |
randomSeed |
Integer |
1 |
yes |
The random-seed used for neighbor-sampling. |
sampling |
Boolean |
true |
yes |
Whether the potential neighbors should be sampled. |
p |
Float |
0.5 |
yes |
Influences the sample size: |
similarityCutoff |
Integer |
-1 |
yes |
The threshold for similarity. Values below this will not be returned. |
degreeCutoff |
Integer |
0 |
yes |
The threshold for the number of items in the |
concurrency |
Integer |
4 |
yes |
The number of concurrent threads used for running the algorithm. Also provides the default value for 'writeConcurrency'. |
writeConcurrency |
Integer |
value of 'concurrency' |
yes |
The number of concurrent threads used for writing the result. |
writeBatchSize |
Integer |
10000 |
yes |
The batch size to use when storing results. |
writeRelationshipType |
String |
SIMILAR |
yes |
The relationship type to use when storing results. |
writeProperty |
String |
score |
yes |
The property to use when storing results. |
Name | Type | Description |
---|---|---|
nodes |
Integer |
The number of nodes passed in. |
similarityPairs |
Integer |
The number of pairs of similar nodes computed. |
writeRelationshipType |
String |
The relationship type used when storing results. |
writeProperty |
String |
The property used when storing results. |
min |
Float |
The minimum similarity score computed. |
max |
Float |
The maximum similarity score computed. |
mean |
Float |
The mean of similarities scores computed. |
stdDev |
Float |
The standard deviation of similarities scores computed. |
p25 |
Float |
The 25 percentile of similarities scores computed. |
p50 |
Float |
The 50 percentile of similarities scores computed. |
p75 |
Float |
The 75 percentile of similarities scores computed. |
p90 |
Float |
The 90 percentile of similarities scores computed. |
p95 |
Float |
The 95 percentile of similarities scores computed. |
p99 |
Float |
The 99 percentile of similarities scores computed. |
p999 |
Float |
The 99.9 percentile of similarities scores computed. |
p100 |
Float |
The 25 percentile of similarities scores computed. |
CALL gds.alpha.ml.ann.stream(configuration: Map)
YIELD item1, item2, count1, count2, intersection, similarity
Name | Type | Default | Optional | Description |
---|---|---|---|---|
algorithm |
String |
null |
no |
The similarity algorithm to use. Valid values: jaccard', 'cosine', 'pearson', 'euclidean' |
data |
List |
null |
no |
If algorithm is 'jaccard', a list of maps of the following structure: |
top |
Integer |
0 |
yes |
The number of similar pairs to return. If |
topK |
Integer |
3 |
yes |
The number of similar values to return per node. |
randomSeed |
Integer |
1 |
yes |
The random-seed used for neighbor-sampling. |
sampling |
Boolean |
true |
yes |
Whether the potential neighbors should be sampled. |
p |
Float |
0.5 |
yes |
Influences the sample size: |
similarityCutoff |
Integer |
-1 |
yes |
The threshold for similarity. Values below this will not be returned. |
degreeCutoff |
Integer |
0 |
yes |
The threshold for the number of items in the |
concurrency |
Integer |
4 |
yes |
The number of concurrent threads used for running the algorithm. |
Name | Type | Description |
---|---|---|
item1 |
Integer |
The ID of one node in the similarity pair. |
item2 |
Integer |
The ID of other node in the similarity pair. |
count1 |
Integer |
The size of the |
count2 |
Integer |
The size of the |
intersection |
Integer |
The number of intersecting values in the two nodes |
similarity |
Integer |
The similarity of the two nodes. |
2. Use-cases - when to use the Approximate Nearest Neighbors algorithm
We can use the Approximate Nearest Neighbors algorithm to work out the approximate k most similar items to each other. The corresponding k-Nearest Neighbors Graph can then be used as part of recommendation queries.
3. Approximate Nearest Neighbors algorithm sample
CREATE
(french:Cuisine {name:'French'}),
(italian:Cuisine {name:'Italian'}),
(indian:Cuisine {name:'Indian'}),
(lebanese:Cuisine {name:'Lebanese'}),
(portuguese:Cuisine {name:'Portuguese'}),
(zhen:Person {name: 'Zhen'}),
(praveena:Person {name: 'Praveena'}),
(michael:Person {name: 'Michael'}),
(arya:Person {name: 'Arya'}),
(karin:Person {name: 'Karin'}),
(praveena)-[:LIKES]->(indian),
(praveena)-[:LIKES]->(portuguese),
(zhen)-[:LIKES]->(french),
(zhen)-[:LIKES]->(indian),
(michael)-[:LIKES]->(french),
(michael)-[:LIKES]->(italian),
(michael)-[:LIKES]->(indian),
(arya)-[:LIKES]->(lebanese),
(arya)-[:LIKES]->(italian),
(arya)-[:LIKES]->(portuguese),
(karin)-[:LIKES]->(lebanese),
(karin)-[:LIKES]->(italian)
MATCH (p:Person)-[:LIKES]->(cuisine)
WITH {item:id(p), categories: collect(id(cuisine))} AS userData
WITH collect(userData) AS data
CALL gds.alpha.ml.ann.stream({
data: data,
algorithm: 'jaccard',
similarityCutoff: 0.1,
concurrency: 1
})
YIELD item1, item2, similarity
return gds.util.asNode(item1).name AS from, gds.util.asNode(item2).name AS to, similarity
ORDER BY from
from | to | similarity |
---|---|---|
Arya |
Karin |
0.6666666666666666 |
Arya |
Praveena |
0.25 |
Arya |
Michael |
0.2 |
Karin |
Arya |
0.6666666666666666 |
Karin |
Michael |
0.25 |
Michael |
Karin |
0.25 |
Michael |
Praveena |
0.25 |
Michael |
Arya |
0.2 |
Praveena |
Arya |
0.25 |
Praveena |
Michael |
0.25 |
Zhen |
Michael |
0.6666666666666666 |
Arya and Karin, and Zhen and Michael have the most similar food preferences, with two overlapping cuisines for a similarity of 0.66.
We also have 3 pairs of users who are not similar at all.
We’d probably want to filter those out, which we can do by passing in the similarityCutoff
parameter.
MATCH (p:Person)-[:LIKES]->(cuisine)
WITH {item:id(p), categories: collect(id(cuisine))} AS userData
WITH collect(userData) AS data
CALL gds.alpha.ml.ann.write({
algorithm: 'jaccard',
data: data,
similarityCutoff: 0.1,
showComputations: true,
concurrency: 1
})
YIELD nodes, similarityPairs, writeRelationshipType, writeProperty, min, max, mean, p95
RETURN nodes, similarityPairs, writeRelationshipType, writeProperty, min, max, mean, p95
nodes | similarityPairs | writeRelationshipType | writeProperty | min | max | mean | p95 |
---|---|---|---|---|---|---|---|
5 |
13 |
"SIMILAR" |
"score" |
0.19999980926513672 |
0.6666669845581055 |
0.3512822664701022 |
0.6666669845581055 |
We then could write a query to find out what types of cuisine that other people similar to us might like.
MATCH (p:Person {name: 'Praveena'})-[:SIMILAR]->(other),
(other)-[:LIKES]->(cuisine)
WHERE not((p)-[:LIKES]->(cuisine))
RETURN cuisine.name AS cuisine, count(*) AS count
ORDER BY cuisine DESC
cuisine | count |
---|---|
"French" |
1 |
"Italian" |
2 |
"Lebanese" |
1 |
Was this page helpful?