This section describes the Pearson Similarity algorithm in the Neo4j Graph Algorithms library.
Pearson similarity is the covariance of the two ndimensional vectors divided by the product of their standard deviations.
This section includes:
Pearson similarity is computed using the following formula:
Values range between 1 and 1, where 1 is perfectly dissimilar and 1 is perfectly similar.
The library contains both procedures and functions to calculate similarity between sets of data. The function is best used when calculating the similarity between small numbers of sets. The procedures parallelize the computation and are therefore more appropriate for computing similarities on bigger datasets.
Pearson similarity is only calculated over nonNULL dimensions.
When calling the function, we should provide lists that contain the overlapping items.
The procedures expect to receive the same length lists for all items, so we need to pad those lists with algo.NaN()
where necessary.
We can use the Pearson Similarity algorithm to work out the similarity between two things. We might then use the computed similarity as part of a recommendation query. For example, to get movie recommendations based on the preferences of users who have given similar ratings to other movies that you’ve seen.
The following will return the Pearson similarity of two lists of numbers:
RETURN algo.similarity.pearson([5,8,7,5,4,9], [7,8,6,6,4,5]) AS similarity
similarity 

0.28767798089123053 
The following will create a sample graph:
MERGE (home_alone:Movie {name:'Home Alone'})
MERGE (matrix:Movie {name:'The Matrix'})
MERGE (good_men:Movie {name:'A Few Good Men'})
MERGE (top_gun:Movie {name:'Top Gun'})
MERGE (jerry:Movie {name:'Jerry Maguire'})
MERGE (gruffalo:Movie {name:'The Gruffalo'})
MERGE (zhen:Person {name: "Zhen"})
MERGE (praveena:Person {name: "Praveena"})
MERGE (michael:Person {name: "Michael"})
MERGE (arya:Person {name: "Arya"})
MERGE (karin:Person {name: "Karin"})
MERGE (zhen)[:RATED {score: 2}]>(home_alone)
MERGE (zhen)[:RATED {score: 2}]>(good_men)
MERGE (zhen)[:RATED {score: 3}]>(matrix)
MERGE (zhen)[:RATED {score: 6}]>(jerry)
MERGE (praveena)[:RATED {score: 6}]>(home_alone)
MERGE (praveena)[:RATED {score: 7}]>(good_men)
MERGE (praveena)[:RATED {score: 8}]>(matrix)
MERGE (praveena)[:RATED {score: 9}]>(jerry)
MERGE (michael)[:RATED {score: 7}]>(home_alone)
MERGE (michael)[:RATED {score: 9}]>(good_men)
MERGE (michael)[:RATED {score: 3}]>(jerry)
MERGE (michael)[:RATED {score: 4}]>(top_gun)
MERGE (arya)[:RATED {score: 8}]>(top_gun)
MERGE (arya)[:RATED {score: 1}]>(matrix)
MERGE (arya)[:RATED {score: 10}]>(jerry)
MERGE (arya)[:RATED {score: 10}]>(gruffalo)
MERGE (karin)[:RATED {score: 9}]>(top_gun)
MERGE (karin)[:RATED {score: 7}]>(matrix)
MERGE (karin)[:RATED {score: 7}]>(home_alone)
MERGE (karin)[:RATED {score: 9}]>(gruffalo)
The following will return the Pearson similarity of Arya and Karin:
MATCH (p1:Person {name: 'Arya'})[rated:RATED]>(movie)
WITH p1, algo.similarity.asVector(movie, rated.score) AS p1Vector
MATCH (p2:Person {name: 'Karin'})[rated:RATED]>(movie)
WITH p1, p2, p1Vector, algo.similarity.asVector(movie, rated.score) AS p2Vector
RETURN p1.name AS from,
p2.name AS to,
algo.similarity.pearson(p1Vector, p2Vector, {vectorType: "maps"}) AS similarity
from 
to 
similarity 

"Arya" 
"Karin" 
0.8194651785206903 
In this example, we pass in vectorType: "maps"
as an extra parameter, as well as using the algo.similarity.asVector
function to construct a vector of maps containing each movie and the corresponding rating.
We do this because the Pearson Similarity algorithm needs to compute the average of all the movies that a user has reviewed, not just the ones that they have in common with the user we’re comparing them to.
We can’t therefore just pass in collections of the ratings of movies that have been reviewed by both people.
The following will return the Pearson similarity of Arya and other people that have rated at least one movie:
MATCH (p1:Person {name: 'Arya'})[rated:RATED]>(movie)
WITH p1, algo.similarity.asVector(movie, rated.score) AS p1Vector
MATCH (p2:Person)[rated:RATED]>(movie) WHERE p2 <> p1
WITH p1, p2, p1Vector, algo.similarity.asVector(movie, rated.score) AS p2Vector
RETURN p1.name AS from,
p2.name AS to,
algo.similarity.pearson(p1Vector, p2Vector, {vectorType: "maps"}) AS similarity
ORDER BY similarity DESC
from 
to 
similarity 

"Arya" 
"Karin" 
0.8194651785206903 
"Arya" 
"Zhen" 
0.4839533792540704 
"Arya" 
"Praveena" 
0.09262336892949784 
"Arya" 
"Michael" 
0.9551953674747637 
The following will return a stream of node pairs along with their Pearson similarities:
MATCH (p:Person), (m:Movie)
OPTIONAL MATCH (p)[rated:RATED]>(m)
WITH {item:id(p), weights: collect(coalesce(rated.score, algo.NaN()))} as userData
WITH collect(userData) as data
CALL algo.similarity.pearson.stream(data)
YIELD item1, item2, count1, count2, similarity
RETURN algo.getNodeById(item1).name AS from, algo.getNodeById(item2).name AS to, similarity
ORDER BY similarity DESC
from 
to 
similarity 

"Zhen" 
"Praveena" 
0.8865926413116155 
"Zhen" 
"Karin" 
0.8320502943378437 
"Arya" 
"Karin" 
0.8194651785206903 
"Zhen" 
"Arya" 
0.4839533792540704 
"Praveena" 
"Karin" 
0.4472135954999579 
"Praveena" 
"Arya" 
0.09262336892949784 
"Praveena" 
"Michael" 
0.788492846568306 
"Zhen" 
"Michael" 
0.9091365607973364 
"Michael" 
"Arya" 
0.9551953674747637 
"Michael" 
"Karin" 
0.9863939238321437 
Zhen and Praveena are the most similar with a score of 0.88.
The maximum score is 1.0
We also have 4 pairs of users who are not similar at all.
We’d probably want to filter those out, which we can do by passing in the similarityCutoff
parameter.
The following will return a stream of node pairs that have a similarity of at least 0.1, along with their Pearson similarities:
MATCH (p:Person), (m:Movie)
OPTIONAL MATCH (p)[rated:RATED]>(m)
WITH {item:id(p), weights: collect(coalesce(rated.score, algo.NaN()))} as userData
WITH collect(userData) as data
CALL algo.similarity.pearson.stream(data, {similarityCutoff: 0.1})
YIELD item1, item2, count1, count2, similarity
RETURN algo.getNodeById(item1).name AS from, algo.getNodeById(item2).name AS to, similarity
ORDER BY similarity DESC
from 
to 
similarity 

"Zhen" 
"Praveena" 
0.8865926413116155 
"Zhen" 
"Karin" 
0.8320502943378437 
"Arya" 
"Karin" 
0.8194651785206903 
"Zhen" 
"Arya" 
0.4839533792540704 
"Praveena" 
"Karin" 
0.4472135954999579 
We can see that those users with no similarity have been filtered out.
If we’re implementing a kNearest Neighbors type query we might instead want to find the most similar k
users for a given user.
We can do that by passing in the topK
parameter.
The following will return a stream of users along with the most similar user to them (i.e. k=1
):
MATCH (p:Person), (m:Movie)
OPTIONAL MATCH (p)[rated:RATED]>(m)
WITH {item:id(p), weights: collect(coalesce(rated.score, algo.NaN()))} as userData
WITH collect(userData) as data
CALL algo.similarity.pearson.stream(data, {topK:1, similarityCutoff: 0.0})
YIELD item1, item2, count1, count2, similarity
RETURN algo.getNodeById(item1).name AS from, algo.getNodeById(item2).name AS to, similarity
ORDER BY similarity DESC
from 
to 
similarity 

"Zhen" 
"Praveena" 
0.8865926413116155 
"Praveena" 
"Zhen" 
0.8865926413116155 
"Karin" 
"Zhen" 
0.8320502943378437 
"Arya" 
"Karin" 
0.8194651785206903 
These results will not necessarily be symmetrical. For example, the person most similar to Arya is Karin, but the person most similar to Karin is Zhen.
Name  Type  Default  Optional  Description 


list 
null 
no 
A list of maps of the following structure: 

int 
0 
yes 
The number of similar pairs to return. If 

int 
0 
yes 
The number of similar values to return per node. If 

int 
1 
yes 
The threshold for cosine similarity. Values below this will not be returned. 

int 
0 
yes 
The threshold for the number of items in the 

int 
available CPUs 
yes 
The number of concurrent threads. 
Name  Type  Description 


int 
The ID of one node in the similarity pair. 

int 
The ID of other node in the similarity pair. 

int 
The size of the 

int 
The size of the 

int 
The number of intersecting values in the two nodes 

int 
The cosine similarity of the two nodes. 
The following will find the most similar user for each user, and store a relationship between those users:
MATCH (p:Person), (m:Movie)
OPTIONAL MATCH (p)[rated:RATED]>(m)
WITH {item:id(p), weights: collect(coalesce(rated.score, algo.NaN()))} as userData
WITH collect(userData) as data
CALL algo.similarity.pearson(data, {topK: 1, similarityCutoff: 0.1, write:true})
YIELD nodes, similarityPairs, write, writeRelationshipType, writeProperty, min, max, mean, stdDev, p25, p50, p75, p90, p95, p99, p999, p100
RETURN nodes, similarityPairs, write, writeRelationshipType, writeProperty, min, max, mean, p95
nodes 
similarityPairs 
write 
writeRelationshipType 
writeProperty 
min 
max 
mean 
p95 

5 
5 
TRUE 
"SIMILAR" 
"score" 
0.5198593139648438 
1.0000038146972656 
0.8981185913085937 
1.0000038146972656 
We then could write a query to find out which are the movies that other people similar to us liked.
The following will find the most similar user to Praveena, and return their favourite movies that Praveena didn’t (yet!) rate:
MATCH (p:Person {name: "Praveena"})[:SIMILAR]>(other),
(other)[r:RATED]>(movie)
WHERE not((p)[:RATED]>(movie)) and r.score >= 8
RETURN movie.name AS movie
movie 

The Matrix 
If the similarity lists are very large they can take up a lot of memory. For cases where those lists contain lots of values that should be skipped, you can use the less memoryintensive approach of using Cypher statements to project the graph instead.
The Cypher loader expects to receive 3 fields:
item
 should contain node ids, which we can return using the id
function.
category
 should contain node ids, which we can return using the id
function.
weight
 should contain a double value.
Set graph:'cypher'
in the config:
WITH "MATCH (person:Person)[rated:RATED]>(c)
RETURN id(person) AS item, id(c) AS category, rated.score AS weight" AS query
CALL algo.similarity.pearson(query, {
graph: 'cypher', topK: 1, similarityCutoff: 0.1, write:true
})
YIELD nodes, similarityPairs, write, writeRelationshipType, writeProperty, min, max, mean, stdDev, p95
RETURN nodes, similarityPairs, write, writeRelationshipType, writeProperty, min, max, mean, p95
By default the skipValue
parameter is algo.Nan()
.
The algorithm checks every value against the skipValue
to determine whether that value should be considered as part of the similarity result.
For cases where no values should be skipped, skipping can be disabled by setting skipValue
to null
.
The following will create a sample graph:
MERGE (home_alone:Movie {name:'Home Alone'}) SET home_alone.embedding = [0.71, 0.33, 0.81, 0.52, 0.41]
MERGE (matrix:Movie {name:'The Matrix'}) SET matrix.embedding = [0.31, 0.72, 0.58, 0.67, 0.31]
MERGE (good_men:Movie {name:'A Few Good Men'}) SET good_men.embedding = [0.43, 0.26, 0.98, 0.51, 0.76]
MERGE (top_gun:Movie {name:'Top Gun'}) SET top_gun.embedding = [0.12, 0.23, 0.35, 0.31, 0.3]
MERGE (jerry:Movie {name:'Jerry Maguire'}) SET jerry.embedding = [0.47, 0.98, 0.81, 0.72, 0]
The following will find the similarity between movies based on the embedding
property:
MATCH (m:Movie)
WITH {item:id(m), weights: m.embedding} as userData
WITH collect(userData) as data
CALL algo.similarity.pearson.stream(data, {skipValue: null})
YIELD item1, item2, count1, count2, similarity
RETURN algo.getNodeById(item1).name AS from, algo.getNodeById(item2).name AS to, similarity
ORDER BY similarity DESC
The following will run the algorithm and write back results:
CALL algo.similarity.pearson(userData:List<Map>> or String, {
topK: 1, similarityCutoff: 0.1, write:true, writeProperty: "pearsonSimilarity"
})
YIELD nodes, similarityPairs, write, writeRelationshipType, writeProperty, min, max, mean, stdDev, p25, p50, p75, p90, p95, p99, p999, p100
Name  Type  Default  Optional  Description 


list or string 
null 
no 
A list of maps of the following structure: 

int 
0 
yes 
The number of similar pairs to return. If 

int 
0 
yes 
The number of similar values to return per node. If 

int 
1 
yes 
The threshold for similarity. Values below this will not be returned. 

int 
0 
yes 
The threshold for the number of items in the 

double 
algo.NaN() 
yes 
Value to skip when executing similarity computation. A value of 

int 
available CPUs 
yes 
The number of concurrent threads. 

string 
dense 
yes 
The graph name ('dense' or 'cypher'). 

boolean 
false 
yes 
Indicates whether results should be stored. 

int 
10000 
yes 
The batch size to use when storing results. 

string 
SIMILAR 
yes 
The relationship type to use when storing results. 

string 
score 
yes 
The property to use when storing results. 
Name  Type  Description 


int 
The number of nodes passed in. 

int 
The number of pairs of similar nodes computed. 

boolean 
Indicates whether results were stored. 

string 
The relationship type used when storing results. 

string 
The property used when storing results. 

double 
The minimum similarity score computed. 

double 
The maximum similarity score computed. 

double 
The mean of similarities scores computed. 

double 
The standard deviation of similarities scores computed. 

double 
The 25 percentile of similarities scores computed. 

double 
The 50 percentile of similarities scores computed. 

double 
The 75 percentile of similarities scores computed. 

double 
The 90 percentile of similarities scores computed. 

double 
The 95 percentile of similarities scores computed. 

double 
The 99 percentile of similarities scores computed. 

double 
The 99.9 percentile of similarities scores computed. 

double 
The 100 percentile of similarities scores computed. 
The following will run the algorithm and stream results:
CALL algo.similarity.pearson.stream(userData:List<Map>> or String, {
degreeCutoff: 10, similarityCutoff: 0.1, concurrency:4
})
YIELD item1, item2, count1, count2, intersection, similarity
Name  Type  Default  Optional  Description 


list or string 
null 
no 
A list of maps of the following structure: 

int 
0 
yes 
The number of similar pairs to return. If 

int 
0 
yes 
The number of similar values to return per node. If 

int 
1 
yes 
The threshold for similarity. Values below this will not be returned. 

int 
0 
yes 
The threshold for the number of items in the 

double 
algo.NaN() 
yes 
Value to skip when executing similarity computation. A value of 

int 
available CPUs 
yes 
The number of concurrent threads. 

string 
dense 
yes 
The graph name ('dense' or 'cypher'). 
Name  Type  Description 


int 
The ID of one node in the similarity pair. 

int 
The ID of other node in the similarity pair. 

int 
The size of the 

int 
The size of the 

int 
The number of intersecting values in the two nodes 

int 
The cosine similarity of the two nodes. 