7.5.3. The Pearson Similarity algorithm

This section describes the Pearson Similarity algorithm in the Neo4j Labs Graph Algorithms library.

Pearson similarity is the covariance of the two n-dimensional vectors divided by the product of their standard deviations.

 The Pearson Similarity algorithm was developed by the Neo4j Labs team and is not officially supported.

This section includes:

7.5.3.1. History and explanation

Pearson similarity is computed using the following formula: Values range between -1 and 1, where -1 is perfectly dissimilar and 1 is perfectly similar.

The library contains both procedures and functions to calculate similarity between sets of data. The function is best used when calculating the similarity between small numbers of sets. The procedures parallelize the computation and are therefore more appropriate for computing similarities on bigger datasets.

7.5.3.2. Use-cases - when to use the Pearson Similarity algorithm

We can use the Pearson Similarity algorithm to work out the similarity between two things. We might then use the computed similarity as part of a recommendation query. For example, to get movie recommendations based on the preferences of users who have given similar ratings to other movies that you’ve seen.

7.5.3.3. Pearson Similarity algorithm function sample

The Pearson Similarity function computes the similarity of two lists of numbers.

 Pearson Similarity is only calculated over non-NULL dimensions. When calling the function, we should provide lists that contain the overlapping items.

We can use it to compute the similarity of two hardcoded lists.

The following will return the Pearson similarity of two lists of numbers:

RETURN algo.similarity.pearson([5,8,7,5,4,9], [7,8,6,6,4,5]) AS similarity

Table 7.110. Results
similarity

0.28767798089123053

We can also use it to compute the similarity of nodes based on lists computed by a Cypher query.

The following will create a sample graph:

MERGE (home_alone:Movie {name:'Home Alone'})
MERGE (matrix:Movie {name:'The Matrix'})
MERGE (good_men:Movie {name:'A Few Good Men'})
MERGE (top_gun:Movie {name:'Top Gun'})
MERGE (jerry:Movie {name:'Jerry Maguire'})
MERGE (gruffalo:Movie {name:'The Gruffalo'})

MERGE (zhen:Person {name: "Zhen"})
MERGE (praveena:Person {name: "Praveena"})
MERGE (michael:Person {name: "Michael"})
MERGE (arya:Person {name: "Arya"})
MERGE (karin:Person {name: "Karin"})

MERGE (zhen)-[:RATED {score: 2}]->(home_alone)
MERGE (zhen)-[:RATED {score: 2}]->(good_men)
MERGE (zhen)-[:RATED {score: 3}]->(matrix)
MERGE (zhen)-[:RATED {score: 6}]->(jerry)

MERGE (praveena)-[:RATED {score: 6}]->(home_alone)
MERGE (praveena)-[:RATED {score: 7}]->(good_men)
MERGE (praveena)-[:RATED {score: 8}]->(matrix)
MERGE (praveena)-[:RATED {score: 9}]->(jerry)

MERGE (michael)-[:RATED {score: 7}]->(home_alone)
MERGE (michael)-[:RATED {score: 9}]->(good_men)
MERGE (michael)-[:RATED {score: 3}]->(jerry)
MERGE (michael)-[:RATED {score: 4}]->(top_gun)

MERGE (arya)-[:RATED {score: 8}]->(top_gun)
MERGE (arya)-[:RATED {score: 1}]->(matrix)
MERGE (arya)-[:RATED {score: 10}]->(jerry)
MERGE (arya)-[:RATED {score: 10}]->(gruffalo)

MERGE (karin)-[:RATED {score: 9}]->(top_gun)
MERGE (karin)-[:RATED {score: 7}]->(matrix)
MERGE (karin)-[:RATED {score: 7}]->(home_alone)
MERGE (karin)-[:RATED {score: 9}]->(gruffalo)

The following will return the Pearson similarity of Arya and Karin:

MATCH (p1:Person {name: 'Arya'})-[rated:RATED]->(movie)
WITH p1, algo.similarity.asVector(movie, rated.score) AS p1Vector
MATCH (p2:Person {name: 'Karin'})-[rated:RATED]->(movie)
WITH p1, p2, p1Vector, algo.similarity.asVector(movie, rated.score) AS p2Vector
RETURN p1.name AS from,
p2.name AS to,
algo.similarity.pearson(p1Vector, p2Vector, {vectorType: "maps"}) AS similarity

Table 7.111. Results
from to similarity

"Arya"

"Karin"

0.8194651785206903

In this example, we pass in vectorType: "maps" as an extra parameter, as well as using the algo.similarity.asVector function to construct a vector of maps containing each movie and the corresponding rating. We do this because the Pearson Similarity algorithm needs to compute the average of all the movies that a user has reviewed, not just the ones that they have in common with the user we’re comparing them to. We can’t therefore just pass in collections of the ratings of movies that have been reviewed by both people.

The following will return the Pearson similarity of Arya and other people that have rated at least one movie:

MATCH (p1:Person {name: 'Arya'})-[rated:RATED]->(movie)
WITH p1, algo.similarity.asVector(movie, rated.score) AS p1Vector
MATCH (p2:Person)-[rated:RATED]->(movie) WHERE p2 <> p1
WITH p1, p2, p1Vector, algo.similarity.asVector(movie, rated.score) AS p2Vector
RETURN p1.name AS from,
p2.name AS to,
algo.similarity.pearson(p1Vector, p2Vector, {vectorType: "maps"}) AS similarity
ORDER BY similarity DESC

Table 7.112. Results
from to similarity

"Arya"

"Karin"

0.8194651785206903

"Arya"

"Zhen"

0.4839533792540704

"Arya"

"Praveena"

0.09262336892949784

"Arya"

"Michael"

-0.9551953674747637

7.5.3.4. Pearson Similarity algorithm procedures sample

The Pearson Similarity procedure computes similarity between all pairs of items. It is a symmetrical algorithm, which means that the result from computing the similarity of Item A to Item B is the same as computing the similarity of Item B to Item A. We can therefore compute the score for each pair of nodes once. We don’t compute the similarity of items to themselves.

The number of computations is ((# items)^2 / 2) - # items, which can be very computationally expensive if we have a lot of items.

 Pearson Similarity is only calculated over non-NULL dimensions. The procedures expect to receive the same length lists for all items, so we need to pad those lists with algo.NaN() where necessary.

The following will create a sample graph:

MERGE (home_alone:Movie {name:'Home Alone'})
MERGE (matrix:Movie {name:'The Matrix'})
MERGE (good_men:Movie {name:'A Few Good Men'})
MERGE (top_gun:Movie {name:'Top Gun'})
MERGE (jerry:Movie {name:'Jerry Maguire'})
MERGE (gruffalo:Movie {name:'The Gruffalo'})

MERGE (zhen:Person {name: "Zhen"})
MERGE (praveena:Person {name: "Praveena"})
MERGE (michael:Person {name: "Michael"})
MERGE (arya:Person {name: "Arya"})
MERGE (karin:Person {name: "Karin"})

MERGE (zhen)-[:RATED {score: 2}]->(home_alone)
MERGE (zhen)-[:RATED {score: 2}]->(good_men)
MERGE (zhen)-[:RATED {score: 3}]->(matrix)
MERGE (zhen)-[:RATED {score: 6}]->(jerry)

MERGE (praveena)-[:RATED {score: 6}]->(home_alone)
MERGE (praveena)-[:RATED {score: 7}]->(good_men)
MERGE (praveena)-[:RATED {score: 8}]->(matrix)
MERGE (praveena)-[:RATED {score: 9}]->(jerry)

MERGE (michael)-[:RATED {score: 7}]->(home_alone)
MERGE (michael)-[:RATED {score: 9}]->(good_men)
MERGE (michael)-[:RATED {score: 3}]->(jerry)
MERGE (michael)-[:RATED {score: 4}]->(top_gun)

MERGE (arya)-[:RATED {score: 8}]->(top_gun)
MERGE (arya)-[:RATED {score: 1}]->(matrix)
MERGE (arya)-[:RATED {score: 10}]->(jerry)
MERGE (arya)-[:RATED {score: 10}]->(gruffalo)

MERGE (karin)-[:RATED {score: 9}]->(top_gun)
MERGE (karin)-[:RATED {score: 7}]->(matrix)
MERGE (karin)-[:RATED {score: 7}]->(home_alone)
MERGE (karin)-[:RATED {score: 9}]->(gruffalo)

The following will return a stream of node pairs along with their Pearson similarities:

MATCH (p:Person), (m:Movie)
OPTIONAL MATCH (p)-[rated:RATED]->(m)
WITH {item:id(p), weights: collect(coalesce(rated.score, algo.NaN()))} as userData
WITH collect(userData) as data
CALL algo.similarity.pearson.stream(data)
YIELD item1, item2, count1, count2, similarity
RETURN algo.asNode(item1).name AS from, algo.asNode(item2).name AS to, similarity
ORDER BY similarity DESC

Table 7.113. Results
from to similarity

"Zhen"

"Praveena"

0.8865926413116155

"Zhen"

"Karin"

0.8320502943378437

"Arya"

"Karin"

0.8194651785206903

"Zhen"

"Arya"

0.4839533792540704

"Praveena"

"Karin"

0.4472135954999579

"Praveena"

"Arya"

0.09262336892949784

"Praveena"

"Michael"

-0.788492846568306

"Zhen"

"Michael"

-0.9091365607973364

"Michael"

"Arya"

-0.9551953674747637

"Michael"

"Karin"

-0.9863939238321437

Zhen and Praveena are the most similar with a score of 0.88. The maximum score is 1.0 We also have 4 pairs of users who are not similar at all. We’d probably want to filter those out, which we can do by passing in the similarityCutoff parameter.

The following will return a stream of node pairs that have a similarity of at least 0.1, along with their Pearson similarities:

MATCH (p:Person), (m:Movie)
OPTIONAL MATCH (p)-[rated:RATED]->(m)
WITH {item:id(p), weights: collect(coalesce(rated.score, algo.NaN()))} as userData
WITH collect(userData) as data
CALL algo.similarity.pearson.stream(data, {similarityCutoff: 0.1})
YIELD item1, item2, count1, count2, similarity
RETURN algo.asNode(item1).name AS from, algo.asNode(item2).name AS to, similarity
ORDER BY similarity DESC

Table 7.114. Results
from to similarity

"Zhen"

"Praveena"

0.8865926413116155

"Zhen"

"Karin"

0.8320502943378437

"Arya"

"Karin"

0.8194651785206903

"Zhen"

"Arya"

0.4839533792540704

"Praveena"

"Karin"

0.4472135954999579

We can see that those users with no similarity have been filtered out. If we’re implementing a k-Nearest Neighbors type query we might instead want to find the most similar k users for a given user. We can do that by passing in the topK parameter.

The following will return a stream of users along with the most similar user to them (i.e. k=1):

MATCH (p:Person), (m:Movie)
OPTIONAL MATCH (p)-[rated:RATED]->(m)
WITH {item:id(p), weights: collect(coalesce(rated.score, algo.NaN()))} as userData
WITH collect(userData) as data
CALL algo.similarity.pearson.stream(data, {topK:1, similarityCutoff: 0.0})
YIELD item1, item2, count1, count2, similarity
RETURN algo.asNode(item1).name AS from, algo.asNode(item2).name AS to, similarity
ORDER BY similarity DESC

Table 7.115. Results
from to similarity

"Zhen"

"Praveena"

0.8865926413116155

"Praveena"

"Zhen"

0.8865926413116155

"Karin"

"Zhen"

0.8320502943378437

"Arya"

"Karin"

0.8194651785206903

These results will not necessarily be symmetrical. For example, the person most similar to Arya is Karin, but the person most similar to Karin is Zhen.

Table 7.116. Parameters
Name Type Default Optional Description

data

list

null

no

A list of maps of the following structure: {item: nodeId, weights: [weight, weight, weight]}.

top

int

0

yes

The number of similar pairs to return. If 0, it will return as many as it finds.

topK

int

0

yes

The number of similar values to return per node. If 0, it will return as many as it finds.

similarityCutoff

int

-1

yes

The threshold for cosine similarity. Values below this will not be returned.

degreeCutoff

int

0

yes

The threshold for the number of items in the targets list. If the list contains less than this amount, that node will be excluded from the calculation.

concurrency

int

available CPUs

yes

The number of concurrent threads.

int

value of 'concurrency'

yes

The number of concurrent threads used for reading the graph.

writeConcurrency

int

value of 'concurrency'

yes

The number of concurrent threads used for writing the result.

Table 7.117. Results
Name Type Description

item1

int

The ID of one node in the similarity pair.

item2

int

The ID of other node in the similarity pair.

count1

int

The size of the targets list of one node.

count2

int

The size of the targets list of other node.

intersection

int

The number of intersecting values in the two nodes targets lists.

similarity

int

The cosine similarity of the two nodes.

The following will find the most similar user for each user, and store a relationship between those users:

MATCH (p:Person), (m:Movie)
OPTIONAL MATCH (p)-[rated:RATED]->(m)
WITH {item:id(p), weights: collect(coalesce(rated.score, algo.NaN()))} as userData
WITH collect(userData) as data
CALL algo.similarity.pearson(data, {topK: 1, similarityCutoff: 0.1, write:true})
YIELD nodes, similarityPairs, write, writeRelationshipType, writeProperty, min, max, mean, stdDev, p25, p50, p75, p90, p95, p99, p999, p100
RETURN nodes, similarityPairs, write, writeRelationshipType, writeProperty, min, max, mean, p95

Table 7.118. Results
nodes similarityPairs write writeRelationshipType writeProperty min max mean p95

5

5

TRUE

"SIMILAR"

"score"

0.5198593139648438

1.0000038146972656

0.8981185913085937

1.0000038146972656

We then could write a query to find out which are the movies that other people similar to us liked.

The following will find the most similar user to Praveena, and return their favourite movies that Praveena didn’t (yet!) rate:

MATCH (p:Person {name: "Praveena"})-[:SIMILAR]->(other),
(other)-[r:RATED]->(movie)
WHERE not((p)-[:RATED]->(movie)) and r.score >= 8
RETURN movie.name AS movie

Table 7.119. Results
movie

The Matrix

7.5.3.5. Specifying source and target ids

Sometimes, we don’t want to compute all pairs similarity, but would rather specify subsets of items to compare to each other. We do this using the sourceIds and targetIds keys in the config.

We could use this technique to compute the similarity of a subset of items to all other items.

The following will find the most similar person (i.e. k=1) to Arya and Praveena:

MATCH (p:Person), (m:Movie)
OPTIONAL MATCH (p)-[rated:RATED]->(m)
WITH {item:id(p), name: p.name, weights: collect(coalesce(rated.score, algo.NaN()))} as userData
WITH collect(userData) as personCuisines

// create sourceIds list containing ids for Praveena and Arya
WITH personCuisines,
[value in personCuisines WHERE value.name IN ["Praveena", "Arya"] | value.item ] AS sourceIds

CALL algo.similarity.pearson.stream(personCuisines, {sourceIds: sourceIds, topK: 1})
YIELD item1, item2, similarity
WITH algo.getNodeById(item1) AS from, algo.getNodeById(item2) AS to, similarity
RETURN from.name AS from, to.name AS to, similarity
ORDER BY similarity DESC

Table 7.120. Results
from to similarity

Praveena

Zhen

0.8865926413116155

Arya

Karin

0.8194651785206903

7.5.3.6. Skipping values

By default the skipValue parameter is algo.Nan(). The algorithm checks every value against the skipValue to determine whether that value should be considered as part of the similarity result. For cases where no values should be skipped, skipping can be disabled by setting skipValue to null.

The following will create a sample graph:

MERGE (home_alone:Movie {name:'Home Alone'})    SET home_alone.embedding = [0.71, 0.33, 0.81, 0.52, 0.41]
MERGE (matrix:Movie {name:'The Matrix'})        SET matrix.embedding = [0.31, 0.72, 0.58, 0.67, 0.31]
MERGE (good_men:Movie {name:'A Few Good Men'})  SET good_men.embedding = [0.43, 0.26, 0.98, 0.51, 0.76]
MERGE (top_gun:Movie {name:'Top Gun'})          SET top_gun.embedding = [0.12, 0.23, 0.35, 0.31, 0.3]
MERGE (jerry:Movie {name:'Jerry Maguire'})      SET jerry.embedding = [0.47, 0.98, 0.81, 0.72, 0]

The following will find the similarity between movies based on the embedding property:

MATCH (m:Movie)
WITH {item:id(m), weights: m.embedding} as userData
WITH collect(userData) as data
CALL algo.similarity.pearson.stream(data, {skipValue: null})
YIELD item1, item2, count1, count2, similarity
RETURN algo.asNode(item1).name AS from, algo.asNode(item2).name AS to, similarity
ORDER BY similarity DESC

7.5.3.7. Cypher projection

If the similarity lists are very large they can take up a lot of memory. For cases where those lists contain lots of values that should be skipped, you can use the less memory-intensive approach of using Cypher statements to project the graph instead.

The Cypher loader expects to receive 3 fields:

• item - should contain node ids, which we can return using the id function.
• category - should contain node ids, which we can return using the id function.
• weight - should contain a double value.

Set graph:'cypher' in the config:

WITH "MATCH (person:Person)-[rated:RATED]->(c)
RETURN id(person) AS item, id(c) AS category, rated.score AS weight" AS query
CALL algo.similarity.pearson(query, {
graph: 'cypher', topK: 1, similarityCutoff: 0.1, write:true
})
YIELD nodes, similarityPairs, write, writeRelationshipType, writeProperty, min, max, mean, stdDev, p95
RETURN nodes, similarityPairs, write, writeRelationshipType, writeProperty, min, max, mean, p95

7.5.3.8. Syntax

The following will run the algorithm and write back results:

CALL algo.similarity.pearson(userData:List<Map>> or String, {
topK: 1, similarityCutoff: 0.1, write:true, writeProperty: "pearsonSimilarity"
})
YIELD nodes, similarityPairs, write, writeRelationshipType, writeProperty, min, max, mean, stdDev, p25, p50, p75, p90, p95, p99, p999, p100

Table 7.121. Parameters
Name Type Default Optional Description

data

list or string

null

no

A list of maps of the following structure: {item: nodeId, weights: [double, double, double]} or a Cypher query.

top

int

0

yes

The number of similar pairs to return. If 0, it will return as many as it finds.

topK

int

0

yes

The number of similar values to return per node. If 0, it will return as many as it finds.

similarityCutoff

int

-1

yes

The threshold for similarity. Values below this will not be returned.

degreeCutoff

int

0

yes

The threshold for the number of items in the targets list. If the list contains less than this amount, that node will be excluded from the calculation.

skipValue

double

algo.NaN()

yes

Value to skip when executing similarity computation. A value of null means that skipping is disabled.

concurrency

int

available CPUs

yes

The number of concurrent threads used for running the algorithm. Also provides the default value for 'readConcurrency' and 'writeConcurrency'.

int

value of 'concurrency'

yes

The number of concurrent threads used for reading the graph.

writeConcurrency

int

value of 'concurrency'

yes

The number of concurrent threads used for writing the result.

graph

string

dense

yes

The graph name ('dense' or 'cypher').

write

boolean

false

yes

Indicates whether results should be stored.

writeBatchSize

int

10000

yes

The batch size to use when storing results.

writeRelationshipType

string

SIMILAR

yes

The relationship type to use when storing results.

writeProperty

string

score

yes

The property to use when storing results.

sourceIds

long[]

null

yes

The ids of items from which we need to compute similarities. Defaults to all the items provided in the data parameter.

targetIds

long[]

null

yes

The ids of items to which we need to compute similarities. Defaults to all the items provided in the data parameter.

Table 7.122. Results
Name Type Description

nodes

int

The number of nodes passed in.

similarityPairs

int

The number of pairs of similar nodes computed.

write

boolean

Indicates whether results were stored.

writeRelationshipType

string

The relationship type used when storing results.

writeProperty

string

The property used when storing results.

min

double

The minimum similarity score computed.

max

double

The maximum similarity score computed.

mean

double

The mean of similarities scores computed.

stdDev

double

The standard deviation of similarities scores computed.

p25

double

The 25 percentile of similarities scores computed.

p50

double

The 50 percentile of similarities scores computed.

p75

double

The 75 percentile of similarities scores computed.

p90

double

The 90 percentile of similarities scores computed.

p95

double

The 95 percentile of similarities scores computed.

p99

double

The 99 percentile of similarities scores computed.

p999

double

The 99.9 percentile of similarities scores computed.

p100

double

The 100 percentile of similarities scores computed.

The following will run the algorithm and stream results:

CALL algo.similarity.pearson.stream(userData:List<Map>> or String, {
degreeCutoff: 10, similarityCutoff: 0.1, concurrency:4
})
YIELD item1, item2, count1, count2, intersection, similarity

Table 7.123. Parameters
Name Type Default Optional Description

data

list or string

null

no

A list of maps of the following structure: {item: nodeId, weights: [double, double, double]} or a Cypher query.

top

int

0

yes

The number of similar pairs to return. If 0, it will return as many as it finds.

topK

int

0

yes

The number of similar values to return per node. If 0, it will return as many as it finds.

similarityCutoff

int

-1

yes

The threshold for similarity. Values below this will not be returned.

degreeCutoff

int

0

yes

The threshold for the number of items in the targets list. If the list contains less than this amount, that node will be excluded from the calculation.

skipValue

double

algo.NaN()

yes

Value to skip when executing similarity computation. A value of null means that skipping is disabled.

concurrency

int

available CPUs

yes

The number of concurrent threads used for running the algorithm. Also provides the default value for 'readConcurrency'.

int

value of 'concurrency'

yes

The number of concurrent threads used for reading the graph.

writeConcurrency

int

value of 'concurrency'

yes

The number of concurrent threads used for writing the result.

graph

string

dense

yes

The graph name ('dense' or 'cypher').

sourceIds

long[]

null

yes

The ids of items from which we need to compute similarities. Defaults to all the items provided in the data parameter.

targetIds

long[]

null

yes

The ids of items to which we need to compute similarities. Defaults to all the items provided in the data parameter.

Table 7.124. Results
Name Type Description

item1

int

The ID of one node in the similarity pair.

item2

int

The ID of other node in the similarity pair.

count1

int

The size of the targets list of one node.

count2

int

The size of the targets list of other node.

intersection

int

The number of intersecting values in the two nodes targets lists.

similarity

int

The cosine similarity of the two nodes.