This section describes the Approximate Nearest Neighbors algorithm in the Neo4j Graph Data Science library.

This algorithm is in the alpha tier.
For more information on algorithm tiers, see Chapter 5, *Algorithms*.

The Approximate Nearest Neighbors algorithm constructs a k-Nearest Neighbors Graph for a set of objects based on a provided similarity algorithm. The similarity of items is computed based on Jaccard Similarity, Cosine Similarity, Euclidean Distance, or Pearson Similarity.

The implementation in the library is based on Dong, Charikar, and Li’s paper Efficient K-Nearest Neighbor Graph Construction for Generic Similarity Measures.

This section includes:

The following will run the algorithm and write back results:

```
CALL gds.alpha.ml.ann.write(configuration: Map)
YIELD nodes, similarityPairs, writeRelationshipType, writeProperty, min, max, mean, stdDev, p25, p50, p75, p90, p95, p99, p999, p100
```

Name | Type | Default | Optional | Description |
---|---|---|---|---|

algorithm |
String |
null |
no |
The similarity algorithm to use. Valid values: jaccard', 'cosine', 'pearson', 'euclidean'. |

data |
List |
null |
no |
If algorithm is |

top |
Integer |
0 |
yes |
The number of similar pairs to return. If |

topK |
Integer |
3 |
yes |
The number of similar values to return per node. |

randomSeed |
Integer |
1 |
yes |
The random-seed used for neighbor-sampling. |

sampling |
Boolean |
true |
yes |
Whether the potential neighbors should be sampled. |

p |
Float |
0.5 |
yes |
Influences the sample size: |

similarityCutoff |
Integer |
-1 |
yes |
The threshold for similarity. Values below this will not be returned. |

degreeCutoff |
Integer |
0 |
yes |
The threshold for the number of items in the |

concurrency |
Integer |
4 |
yes |
The number of concurrent threads used for running the algorithm. Also provides the default value for 'writeConcurrency'. |

writeConcurrency |
Integer |
value of 'concurrency' |
yes |
The number of concurrent threads used for writing the result. |

writeBatchSize |
Integer |
10000 |
yes |
The batch size to use when storing results. |

writeRelationshipType |
String |
SIMILAR |
yes |
The relationship type to use when storing results. |

writeProperty |
String |
score |
yes |
The property to use when storing results. |

Name | Type | Description |
---|---|---|

nodes |
Integer |
The number of nodes passed in. |

similarityPairs |
Integer |
The number of pairs of similar nodes computed. |

writeRelationshipType |
String |
The relationship type used when storing results. |

writeProperty |
String |
The property used when storing results. |

min |
Float |
The minimum similarity score computed. |

max |
Float |
The maximum similarity score computed. |

mean |
Float |
The mean of similarities scores computed. |

stdDev |
Float |
The standard deviation of similarities scores computed. |

p25 |
Float |
The 25 percentile of similarities scores computed. |

p50 |
Float |
The 50 percentile of similarities scores computed. |

p75 |
Float |
The 75 percentile of similarities scores computed. |

p90 |
Float |
The 90 percentile of similarities scores computed. |

p95 |
Float |
The 95 percentile of similarities scores computed. |

p99 |
Float |
The 99 percentile of similarities scores computed. |

p999 |
Float |
The 99.9 percentile of similarities scores computed. |

p100 |
Float |
The 25 percentile of similarities scores computed. |

The following will run the algorithm and stream results:

```
CALL gds.alpha.ml.ann.stream(configuration: Map)
YIELD item1, item2, count1, count2, intersection, similarity
```

Name | Type | Default | Optional | Description |
---|---|---|---|---|

algorithm |
String |
null |
no |
The similarity algorithm to use. Valid values: jaccard', 'cosine', 'pearson', 'euclidean' |

data |
List |
null |
no |
If algorithm is 'jaccard', a list of maps of the following structure: |

top |
Integer |
0 |
yes |
The number of similar pairs to return. If |

topK |
Integer |
3 |
yes |
The number of similar values to return per node. |

randomSeed |
Integer |
1 |
yes |
The random-seed used for neighbor-sampling. |

sampling |
Boolean |
true |
yes |
Whether the potential neighbors should be sampled. |

p |
Float |
0.5 |
yes |
Influences the sample size: |

similarityCutoff |
Integer |
-1 |
yes |
The threshold for similarity. Values below this will not be returned. |

degreeCutoff |
Integer |
0 |
yes |
The threshold for the number of items in the |

concurrency |
Integer |
4 |
yes |
The number of concurrent threads used for running the algorithm. |

Name | Type | Description |
---|---|---|

item1 |
Integer |
The ID of one node in the similarity pair. |

item2 |
Integer |
The ID of other node in the similarity pair. |

count1 |
Integer |
The size of the |

count2 |
Integer |
The size of the |

intersection |
Integer |
The number of intersecting values in the two nodes |

similarity |
Integer |
The similarity of the two nodes. |

We can use the Approximate Nearest Neighbors algorithm to work out the approximate k most similar items to each other. The corresponding k-Nearest Neighbors Graph can then be used as part of recommendation queries.

The following will create a sample graph:

```
CREATE
(french:Cuisine {name:'French'}),
(italian:Cuisine {name:'Italian'}),
(indian:Cuisine {name:'Indian'}),
(lebanese:Cuisine {name:'Lebanese'}),
(portuguese:Cuisine {name:'Portuguese'}),
(zhen:Person {name: 'Zhen'}),
(praveena:Person {name: 'Praveena'}),
(michael:Person {name: 'Michael'}),
(arya:Person {name: 'Arya'}),
(karin:Person {name: 'Karin'}),
(praveena)-[:LIKES]->(indian),
(praveena)-[:LIKES]->(portuguese),
(zhen)-[:LIKES]->(french),
(zhen)-[:LIKES]->(indian),
(michael)-[:LIKES]->(french),
(michael)-[:LIKES]->(italian),
(michael)-[:LIKES]->(indian),
(arya)-[:LIKES]->(lebanese),
(arya)-[:LIKES]->(italian),
(arya)-[:LIKES]->(portuguese),
(karin)-[:LIKES]->(lebanese),
(karin)-[:LIKES]->(italian)
```

The following will return a stream of nodes, along with up to the 3 most similar nodes to them based on Jaccard Similarity:

```
MATCH (p:Person)-[:LIKES]->(cuisine)
WITH {item:id(p), categories: collect(id(cuisine))} AS userData
WITH collect(userData) AS data
CALL gds.alpha.ml.ann.stream({
data: data,
algorithm: 'jaccard',
similarityCutoff: 0.1,
concurrency: 1
})
YIELD item1, item2, similarity
return gds.util.asNode(item1).name AS from, gds.util.asNode(item2).name AS to, similarity
ORDER BY from
```

from | to | similarity |
---|---|---|

Arya |
Karin |
0.6666666666666666 |

Arya |
Praveena |
0.25 |

Arya |
Michael |
0.2 |

Karin |
Arya |
0.6666666666666666 |

Karin |
Michael |
0.25 |

Michael |
Karin |
0.25 |

Michael |
Praveena |
0.25 |

Michael |
Arya |
0.2 |

Praveena |
Arya |
0.25 |

Praveena |
Michael |
0.25 |

Zhen |
Michael |
0.6666666666666666 |

Arya and Karin, and Zhen and Michael have the most similar food preferences, with two overlapping cuisines for a similarity
of 0.66.
We also have 3 pairs of users who are not similar at all.
We’d probably want to filter those out, which we can do by passing in the `similarityCutoff`

parameter.

The following will find up to 3 similar users for each user, and store a relationship between those users:

```
MATCH (p:Person)-[:LIKES]->(cuisine)
WITH {item:id(p), categories: collect(id(cuisine))} AS userData
WITH collect(userData) AS data
CALL gds.alpha.ml.ann.write({
algorithm: 'jaccard',
data: data,
similarityCutoff: 0.1,
showComputations: true,
concurrency: 1
})
YIELD nodes, similarityPairs, writeRelationshipType, writeProperty, min, max, mean, p95
RETURN nodes, similarityPairs, writeRelationshipType, writeProperty, min, max, mean, p95
```

nodes | similarityPairs | writeRelationshipType | writeProperty | min | max | mean | p95 |
---|---|---|---|---|---|---|---|

5 |
13 |
"SIMILAR" |
"score" |
0.19999980926513672 |
0.6666669845581055 |
0.3512822664701022 |
0.6666669845581055 |

We then could write a query to find out what types of cuisine that other people similar to us might like.

The following will find the most similar user to Praveena, and return their favorite cuisines that Praveena doesn’t (yet!) like:

```
MATCH (p:Person {name: 'Praveena'})-[:SIMILAR]->(other),
(other)-[:LIKES]->(cuisine)
WHERE not((p)-[:LIKES]->(cuisine))
RETURN cuisine.name AS cuisine, count(*) AS count
ORDER BY cuisine DESC
```

cuisine | count |
---|---|

"French" |
1 |

"Italian" |
2 |

"Lebanese" |
1 |

When executing ApproximateNearestNeighbors in parallel, it is possible that results are flaky because of the asynchronous execution fashion of the algorithm.