This section describes the Euclidean Distance algorithm in the Neo4j Labs Graph Algorithms library.
Euclidean distance measures the straight line distance between two points in n-dimensional space.
The Euclidean Distance algorithm was developed by the Neo4j Labs team and is not officially supported. |
This section includes:
Euclidean distance is computed using the following formula:
The library contains both procedures and functions to calculate similarity between sets of data. The function is best used when calculating the similarity between small numbers of sets. The procedures parallelize the computation and are therefore more appropriate for computing similarities on bigger datasets.
We can use the Euclidean Distance algorithm to work out the similarity between two things. We might then use the computed similarity as part of a recommendation query. For example, to get movie recommendations based on the preferences of users who have given similar ratings to other movies that you’ve seen.
The Euclidean Distance function computes the similarity of two lists of numbers.
Euclidean Distance is only calculated over non-NULL dimensions. When calling the function, we should provide lists that contain the overlapping items. |
We can use it to compute the similarity of two hardcoded lists.
The following will return the euclidean similarity of two lists of numbers:
RETURN algo.similarity.euclideanDistance([3,8,7,5,2,9], [10,8,6,6,4,5]) AS similarity
similarity |
---|
8.426149773176359 |
These two lists of numbers have a euclidean distance of 8.42.
We can also use it to compute the similarity of nodes based on lists computed by a Cypher query.
The following will create a sample graph:
MERGE (french:Cuisine {name:'French'})
MERGE (italian:Cuisine {name:'Italian'})
MERGE (indian:Cuisine {name:'Indian'})
MERGE (lebanese:Cuisine {name:'Lebanese'})
MERGE (portuguese:Cuisine {name:'Portuguese'})
MERGE (british:Cuisine {name:'British'})
MERGE (mauritian:Cuisine {name:'Mauritian'})
MERGE (zhen:Person {name: "Zhen"})
MERGE (praveena:Person {name: "Praveena"})
MERGE (michael:Person {name: "Michael"})
MERGE (arya:Person {name: "Arya"})
MERGE (karin:Person {name: "Karin"})
MERGE (praveena)-[:LIKES {score: 9}]->(indian)
MERGE (praveena)-[:LIKES {score: 7}]->(portuguese)
MERGE (praveena)-[:LIKES {score: 8}]->(british)
MERGE (praveena)-[:LIKES {score: 1}]->(mauritian)
MERGE (zhen)-[:LIKES {score: 10}]->(french)
MERGE (zhen)-[:LIKES {score: 6}]->(indian)
MERGE (zhen)-[:LIKES {score: 2}]->(british)
MERGE (michael)-[:LIKES {score: 8}]->(french)
MERGE (michael)-[:LIKES {score: 7}]->(italian)
MERGE (michael)-[:LIKES {score: 9}]->(indian)
MERGE (michael)-[:LIKES {score: 3}]->(portuguese)
MERGE (arya)-[:LIKES {score: 10}]->(lebanese)
MERGE (arya)-[:LIKES {score: 10}]->(italian)
MERGE (arya)-[:LIKES {score: 7}]->(portuguese)
MERGE (arya)-[:LIKES {score: 9}]->(mauritian)
MERGE (karin)-[:LIKES {score: 9}]->(lebanese)
MERGE (karin)-[:LIKES {score: 7}]->(italian)
MERGE (karin)-[:LIKES {score: 10}]->(portuguese)
The following will return the Euclidean distance of Zhen and Praveena:
MATCH (p1:Person {name: 'Zhen'})-[likes1:LIKES]->(cuisine)
MATCH (p2:Person {name: "Praveena"})-[likes2:LIKES]->(cuisine)
RETURN p1.name AS from,
p2.name AS to,
algo.similarity.euclideanDistance(collect(likes1.score), collect(likes2.score)) AS similarity
from |
to |
similarity |
---|---|---|
"Zhen" |
"Praveena" |
6.708203932499369 |
The following will return the Euclidean distance of Zhen and the other people that have a cuisine in common:
MATCH (p1:Person {name: 'Zhen'})-[likes1:LIKES]->(cuisine)
MATCH (p2:Person)-[likes2:LIKES]->(cuisine) WHERE p2 <> p1
RETURN p1.name AS from,
p2.name AS to,
algo.similarity.euclideanDistance(collect(likes1.score), collect(likes2.score)) AS similarity
ORDER BY similarity DESC
from |
to |
similarity |
---|---|---|
"Zhen" |
"Praveena" |
6.708203932499369 |
"Zhen" |
"Michael" |
3.605551275463989 |
The Euclidean Distance procedure computes similarity between all pairs of items. It is a symmetrical algorithm, which means that the result from computing the similarity of Item A to Item B is the same as computing the similarity of Item B to Item A. We can therefore compute the score for each pair of nodes once. We don’t compute the similarity of items to themselves.
The number of computations is ((# items)^2 / 2) - # items
, which can be very computationally expensive if we have a lot of items.
Euclidean Distance is only calculated over non-NULL dimensions.
The procedures expect to receive the same length lists for all items, so we need to pad those lists with |
The following will create a sample graph:
MERGE (french:Cuisine {name:'French'})
MERGE (italian:Cuisine {name:'Italian'})
MERGE (indian:Cuisine {name:'Indian'})
MERGE (lebanese:Cuisine {name:'Lebanese'})
MERGE (portuguese:Cuisine {name:'Portuguese'})
MERGE (british:Cuisine {name:'British'})
MERGE (mauritian:Cuisine {name:'Mauritian'})
MERGE (zhen:Person {name: "Zhen"})
MERGE (praveena:Person {name: "Praveena"})
MERGE (michael:Person {name: "Michael"})
MERGE (arya:Person {name: "Arya"})
MERGE (karin:Person {name: "Karin"})
MERGE (praveena)-[:LIKES {score: 9}]->(indian)
MERGE (praveena)-[:LIKES {score: 7}]->(portuguese)
MERGE (praveena)-[:LIKES {score: 8}]->(british)
MERGE (praveena)-[:LIKES {score: 1}]->(mauritian)
MERGE (zhen)-[:LIKES {score: 10}]->(french)
MERGE (zhen)-[:LIKES {score: 6}]->(indian)
MERGE (zhen)-[:LIKES {score: 2}]->(british)
MERGE (michael)-[:LIKES {score: 8}]->(french)
MERGE (michael)-[:LIKES {score: 7}]->(italian)
MERGE (michael)-[:LIKES {score: 9}]->(indian)
MERGE (michael)-[:LIKES {score: 3}]->(portuguese)
MERGE (arya)-[:LIKES {score: 10}]->(lebanese)
MERGE (arya)-[:LIKES {score: 10}]->(italian)
MERGE (arya)-[:LIKES {score: 7}]->(portuguese)
MERGE (arya)-[:LIKES {score: 9}]->(mauritian)
MERGE (karin)-[:LIKES {score: 9}]->(lebanese)
MERGE (karin)-[:LIKES {score: 7}]->(italian)
MERGE (karin)-[:LIKES {score: 10}]->(portuguese)
The following will return a stream of node pairs, along with their intersection and euclidean similarities:
MATCH (p:Person), (c:Cuisine)
OPTIONAL MATCH (p)-[likes:LIKES]->(c)
WITH {item:id(p), weights: collect(coalesce(likes.score, algo.NaN()))} as userData
WITH collect(userData) as data
CALL algo.similarity.euclidean.stream(data)
YIELD item1, item2, count1, count2, similarity
RETURN algo.asNode(item1).name AS from, algo.asNode(item2).name AS to, similarity
ORDER BY similarity
from |
to |
similarity |
---|---|---|
"Praveena" |
"Karin" |
3.0 |
"Zhen" |
"Michael" |
3.605551275463989 |
"Praveena" |
"Michael" |
4.0 |
"Arya" |
"Karin" |
4.358898943540674 |
"Michael" |
"Arya" |
5.0 |
"Zhen" |
"Praveena" |
6.708203932499369 |
"Michael" |
"Karin" |
7.0 |
"Praveena" |
"Arya" |
8.0 |
"Zhen" |
"Arya" |
NaN |
"Zhen" |
"Karin" |
NaN |
Praveena and Karin have the most similar food preferences, with a euclidean distance of 3.0. Lower scores are better here; a score of 0 would indicate that users have exactly the same preferences.
We can also see at the bottom of the list that Zhen and Arya and Zhen and Karin have a similarity of NaN
.
We get this result because there is no overlap in their food preferences.
We can filter those results out using the algo.isFinite
function.
The following will return a stream of node pairs, along with their intersection and finite euclidean similarities:
MATCH (p:Person), (c:Cuisine)
OPTIONAL MATCH (p)-[likes:LIKES]->(c)
WITH {item:id(p), weights: collect(coalesce(likes.score, algo.NaN()))} as userData
WITH collect(userData) as data
CALL algo.similarity.euclidean.stream(data)
YIELD item1, item2, count1, count2, similarity
WHERE algo.isFinite(similarity)
RETURN algo.asNode(item1).name AS from, algo.asNode(item2).name AS to, similarity
ORDER BY similarity
from |
to |
similarity |
---|---|---|
"Praveena" |
"Karin" |
3.0 |
"Zhen" |
"Michael" |
3.605551275463989 |
"Praveena" |
"Michael" |
4.0 |
"Arya" |
"Karin" |
4.358898943540674 |
"Michael" |
"Arya" |
5.0 |
"Zhen" |
"Praveena" |
6.708203932499369 |
"Michael" |
"Karin" |
7.0 |
"Praveena" |
"Arya" |
8.0 |
We can see in these results that Zhen and Arya and Zhen and Karin have been removed.
We might decide that we don’t want to see users with a similarity above 4 returned in our results.
If so, we can filter those out by passing in the similarityCutoff
parameter.
The following will return a stream of node pairs that have a similarity of at most 17, along with their euclidean distance:
MATCH (p:Person), (c:Cuisine)
OPTIONAL MATCH (p)-[likes:LIKES]->(c)
WITH {item:id(p), weights: collect(coalesce(likes.score, algo.NaN()))} as userData
WITH collect(userData) as data
CALL algo.similarity.euclidean.stream(data, {similarityCutoff: 4.0})
YIELD item1, item2, count1, count2, similarity
WHERE algo.isFinite(similarity)
RETURN algo.asNode(item1).name AS from, algo.asNode(item2).name AS to, similarity
ORDER BY similarity
from |
to |
similarity |
---|---|---|
"Praveena" |
"Karin" |
3.0 |
"Zhen" |
"Michael" |
3.605551275463989 |
"Praveena" |
"Michael" |
4.0 |
We can see that those users with a high score have been filtered out.
If we’re implementing a k-Nearest Neighbors type query we might instead want to find the most similar k
users for a given user.
We can do that by passing in the topK
parameter.
The following will return a stream of users along with the most similar user to them (i.e. k=1
):
MATCH (p:Person), (c:Cuisine)
OPTIONAL MATCH (p)-[likes:LIKES]->(c)
WITH {item:id(p), weights: collect(coalesce(likes.score, algo.NaN()))} as userData
WITH collect(userData) as data
CALL algo.similarity.euclidean.stream(data, {topK:1})
YIELD item1, item2, count1, count2, similarity
RETURN algo.asNode(item1).name AS from, algo.asNode(item2).name AS to, similarity
ORDER BY from
from |
to |
similarity |
---|---|---|
"Arya" |
"Karin" |
4.358898943540674 |
"Karin" |
"Praveena" |
3.0 |
"Michael" |
"Zhen" |
3.605551275463989 |
"Praveena" |
"Karin" |
3.0 |
"Zhen" |
"Michael" |
3.605551275463989 |
These results will not necessarily be symmetrical. For example, the person most similar to Arya is Karin, but the person most similar to Karin is Praveena.
The following will find the most similar user for each user, and store a relationship between those users:
MATCH (p:Person), (c:Cuisine)
OPTIONAL MATCH (p)-[likes:LIKES]->(c)
WITH {item:id(p), weights: collect(coalesce(likes.score, algo.NaN()))} as userData
WITH collect(userData) as data
CALL algo.similarity.euclidean(data, {topK: 1, write:true})
YIELD nodes, similarityPairs, write, writeRelationshipType, writeProperty, min, max, mean, stdDev, p25, p50, p75, p90, p95, p99, p999, p100
RETURN nodes, similarityPairs, write, writeRelationshipType, writeProperty, min, max, mean, p95
nodes |
similarityPairs |
write |
writeRelationshipType |
writeProperty |
min |
max |
mean |
p95 |
---|---|---|---|---|---|---|---|---|
5 |
5 |
TRUE |
"SIMILAR" |
"score" |
3.0 |
4.3589019775390625 |
3.5139984130859374 |
4.3589019775390625 |
We then could write a query to find out what types of cuisine that other people similar to us might like.
The following will find the most similar user to Praveena, and return their favorite cuisines that Praveena doesn’t (yet!) like:
MATCH (p:Person {name: "Praveena"})-[:SIMILAR]->(other),
(other)-[:LIKES]->(cuisine)
WHERE not((p)-[:LIKES]->(cuisine))
RETURN cuisine.name AS cuisine
cuisine |
---|
Italian |
Lebanese |
Sometimes, we don’t want to compute all pairs similarity, but would rather specify subsets of items to compare to each other.
We do this using the sourceIds
and targetIds
keys in the config.
We could use this technique to compute the similarity of a subset of items to all other items.
The following will find the most similar person (i.e. k=1
) to Arya and Praveena:
MATCH (p:Person), (c:Cuisine)
OPTIONAL MATCH (p)-[likes:LIKES]->(c)
WITH {item:id(p), name: p.name, weights: collect(coalesce(likes.score, algo.NaN()))} as userData
WITH collect(userData) as personCuisines
// create sourceIds list containing ids for Praveena and Arya
WITH personCuisines,
[value in personCuisines WHERE value.name IN ["Praveena", "Arya"] | value.item ] AS sourceIds
CALL algo.similarity.euclidean.stream(personCuisines, {sourceIds: sourceIds, topK: 1})
YIELD item1, item2, similarity
WITH algo.getNodeById(item1) AS from, algo.getNodeById(item2) AS to, similarity
RETURN from.name AS from, to.name AS to, similarity
ORDER BY similarity DESC
from |
to |
similarity |
---|---|---|
"Arya" |
"Karin" |
4.358898943540674 |
"Praveena" |
"Karin" |
3.0 |
By default the skipValue
parameter is algo.Nan()
.
The algorithm checks every value against the skipValue
to determine whether that value should be considered as part of the similarity result.
For cases where no values should be skipped, skipping can be disabled by setting skipValue
to null
.
The following will create a sample graph:
MERGE (french:Cuisine {name:'French'}) SET french.embedding = [0.71, 0.33, 0.81, 0.52, 0.41]
MERGE (italian:Cuisine {name:'Italian'}) SET italian.embedding = [0.31, 0.72, 0.58, 0.67, 0.31]
MERGE (indian:Cuisine {name:'Indian'}) SET indian.embedding = [0.43, 0.26, 0.98, 0.51, 0.76]
MERGE (lebanese:Cuisine {name:'Lebanese'}) SET lebanese.embedding = [0.12, 0.23, 0.35, 0.31, 0.39]
MERGE (portuguese:Cuisine {name:'Portuguese'}) SET portuguese.embedding = [0.47, 0.98, 0.81, 0.72, 0.89]
MERGE (british:Cuisine {name:'British'}) SET british.embedding = [0.94, 0.12, 0.23, 0.4, 0.71]
MERGE (mauritian:Cuisine {name:'Mauritian'}) SET mauritian.embedding = [0.31, 0.56, 0.98, 0.21, 0.62]
The following will find the similarity between cuisines based on the embedding
property:
MATCH (c:Cuisine)
WITH {item:id(c), weights: c.embedding} as userData
WITH collect(userData) as data
CALL algo.similarity.euclidean.stream(data, {skipValue: null})
YIELD item1, item2, count1, count2, similarity
RETURN algo.asNode(item1).name AS from, algo.asNode(item2).name AS to, similarity
ORDER BY similarity DESC
If the similarity lists are very large they can take up a lot of memory. For cases where those lists contain lots of values that should be skipped, you can use the less memory-intensive approach of using Cypher statements to project the graph instead.
The Cypher loader expects to receive 3 fields:
item
- should contain node ids, which we can return using the id
function.
category
- should contain node ids, which we can return using the id
function.
weight
- should contain a double value.
Set graph:'cypher'
in the config:
WITH "MATCH (person:Person)-[likes:LIKES]->(c)
RETURN id(person) AS item, id(c) AS category, likes.score AS weight" AS query
CALL algo.similarity.euclidean(query, {
graph: 'cypher', topK: 1, similarityCutoff: 4.0, write:true
})
YIELD nodes, similarityPairs, write, writeRelationshipType, writeProperty, min, max, mean, stdDev, p95
RETURN nodes, similarityPairs, write, writeRelationshipType, writeProperty, min, max, mean, p95
The following will run the algorithm and write back results:
CALL algo.similarity.euclidean(userData:List<Map>> or String, {
topK: 1, similarityCutoff: 17.0, write:true, writeProperty: "euclideanSimilarity"
})
YIELD nodes, similarityPairs, write, writeRelationshipType, writeProperty, min, max, mean, stdDev, p25, p50, p75, p90, p95, p99, p999, p100
Name | Type | Default | Optional | Description |
---|---|---|---|---|
|
list or string |
null |
no |
A list of maps of the following structure: |
|
int |
0 |
yes |
The number of similar pairs to return. If |
|
int |
0 |
yes |
The number of similar values to return per node. If |
|
int |
-1 |
yes |
The threshold for Euclidean distance. Values above this will not be returned. |
|
int |
0 |
yes |
The threshold for the number of items in the |
|
double |
algo.NaN() |
yes |
Value to skip when executing similarity computation. A value of |
|
int |
available CPUs |
yes |
The number of concurrent threads used for running the algorithm. Also provides the default value for 'readConcurrency' and 'writeConcurrency'. |
|
int |
value of 'concurrency' |
yes |
The number of concurrent threads used for reading the graph. |
|
int |
value of 'concurrency' |
yes |
The number of concurrent threads used for writing the result. |
|
string |
dense |
yes |
The graph name ('dense' or 'cypher'). |
|
boolean |
false |
yes |
Indicates whether results should be stored. |
|
int |
10000 |
yes |
The batch size to use when storing results. |
|
string |
SIMILAR |
yes |
The relationship type to use when storing results. |
|
string |
score |
yes |
The property to use when storing results. |
|
long[] |
null |
yes |
The ids of items from which we need to compute similarities. Defaults to all the items provided in the |
|
long[] |
null |
yes |
The ids of items to which we need to compute similarities. Defaults to all the items provided in the |
Name | Type | Description |
---|---|---|
|
int |
The number of nodes passed in. |
|
int |
The number of pairs of similar nodes computed. |
|
boolean |
Indicates whether results were stored. |
|
string |
The relationship type used when storing results. |
|
string |
The property used when storing results. |
|
double |
The minimum similarity score computed. |
|
double |
The maximum similarity score computed. |
|
double |
The mean of similarities scores computed. |
|
double |
The standard deviation of similarities scores computed. |
|
double |
The 25 percentile of similarities scores computed. |
|
double |
The 50 percentile of similarities scores computed. |
|
double |
The 75 percentile of similarities scores computed. |
|
double |
The 90 percentile of similarities scores computed. |
|
double |
The 95 percentile of similarities scores computed. |
|
double |
The 99 percentile of similarities scores computed. |
|
double |
The 99.9 percentile of similarities scores computed. |
|
double |
The 100 percentile of similarities scores computed. |
Name | Type | Default | Optional | Description |
---|---|---|---|---|
|
list or string |
null |
no |
A list of maps of the following structure: |
|
int |
0 |
yes |
The number of similar pairs to return. If |
|
int |
0 |
yes |
The number of similar values to return per node. If |
|
int |
-1 |
yes |
The threshold for euclidean distance. Values above this will not be returned. |
|
int |
0 |
yes |
The threshold for the number of items in the |
|
double |
algo.NaN() |
yes |
Value to skip when executing similarity computation. A value of |
|
int |
available CPUs |
yes |
The number of concurrent threads used for running the algorithm. Also provides the default value for 'readConcurrency'. |
|
int |
value of 'concurrency' |
yes |
The number of concurrent threads used for reading the graph. |
|
string |
dense |
yes |
The graph name ('dense' or 'cypher'). |
|
long[] |
null |
yes |
The ids of items from which we need to compute similarities. Defaults to all the items provided in the |
|
long[] |
null |
yes |
The ids of items to which we need to compute similarities. Defaults to all the items provided in the |
Name | Type | Description |
---|---|---|
|
int |
The ID of one node in the similarity pair. |
|
int |
The ID of other node in the similarity pair. |
|
int |
The size of the |
|
int |
The size of the |
|
int |
The number of intersecting values in the two nodes |
|
int |
The euclidean distance between the two nodes. |