# Cosine Similarity

This section describes the Cosine Similarity algorithm in the Neo4j Graph Data Science library.

Cosine similarity is the cosine of the angle between two n-dimensional vectors in an n-dimensional space. It is the dot product of the two vectors divided by the product of the two vectors' lengths (or magnitudes).

This algorithm is in the alpha tier. For more information on algorithm tiers, see Algorithms.

## 1. History and explanation

Cosine similarity is computed using the following formula: Values range between -1 and 1, where -1 is perfectly dissimilar and 1 is perfectly similar.

The library contains both procedures and functions to calculate similarity between sets of data. The function is best used when calculating the similarity between small numbers of sets. The procedures parallelize the computation and are therefore more appropriate for computing similarities on bigger datasets.

## 2. Use-cases - when to use the Cosine Similarity algorithm

We can use the Cosine Similarity algorithm to work out the similarity between two things. We might then use the computed similarity as part of a recommendation query. For example, to get movie recommendations based on the preferences of users who have given similar ratings to other movies that you’ve seen.

## 3. Syntax

The following will create an anonymous graph to run the algorithm on and write back results:
``````CALL gds.alpha.similarity.cosine.write(configuration: Map)
YIELD nodes, similarityPairs, writeRelationshipType, writeProperty, min, max, mean, stdDev, p25, p50, p75, p90, p95, p99, p999, p100``````
Table 1. Parameters
Name Type Default Optional Description

configuration

Map

n/a

no

Algorithm-specific configuration.

Table 2. Configuration
Name Type Default Optional Description

data

String[]

null

no

A list of maps of the following structure: `{item: nodeId, weights: [double, double, double]}` or a Cypher query.

top

Integer

0

yes

The number of similar pairs to return. If `0`, it will return as many as it finds.

topK

Integer

3

yes

The number of similar values to return per node. If `0`, it will return as many as it finds.

similarityCutoff

Integer

-1

yes

The threshold for similarity. Values below this will not be returned.

degreeCutoff

Integer

0

yes

The threshold for the number of items in the `targets` list. If the list contains less than this amount, that node will be excluded from the calculation.

skipValue

Float

gds.util.NaN()

yes

Value to skip when executing similarity computation. A value of `null` means that skipping is disabled.

concurrency

Integer

4

yes

The number of concurrent threads used for running the algorithm. Also provides the default value for 'writeConcurrency'.

writeConcurrency

Integer

value of 'concurrency'

yes

The number of concurrent threads used for writing the result.

graph

String

dense

yes

The graph type ('dense' or 'cypher').

writeBatchSize

Integer

10000

yes

The batch size to use when storing results.

writeRelationshipType

String

SIMILAR

yes

The relationship type to use when storing results.

writeProperty

String

score

yes

The property to use when storing results.

sourceIds

Integer[]

null

yes

The ids of items from which we need to compute similarities. Defaults to all the items provided in the `data` parameter.

targetIds

Integer[]

null

yes

The ids of items to which we need to compute similarities. Defaults to all the items provided in the `data` parameter.

Table 3. Results
Name Type Description

nodes

Integer

The number of nodes passed in.

similarityPairs

Integer

The number of pairs of similar nodes computed.

writeRelationshipType

String

The relationship type used when storing results.

writeProperty

String

The property used when storing results.

min

Float

The minimum similarity score computed.

max

Float

The maximum similarity score computed.

mean

Float

The mean of similarities scores computed.

stdDev

Float

The standard deviation of similarities scores computed.

p25

Float

The 25 percentile of similarities scores computed.

p50

Float

The 50 percentile of similarities scores computed.

p75

Float

The 75 percentile of similarities scores computed.

p90

Float

The 90 percentile of similarities scores computed.

p95

Float

The 95 percentile of similarities scores computed.

p99

Float

The 99 percentile of similarities scores computed.

p999

Float

The 99.9 percentile of similarities scores computed.

p100

Float

The 100 percentile of similarities scores computed.

The following will create an anonymous graph to run the algorithm on and stream results:
``````CALL gds.alpha.similarity.cosine.stream(configuration: Map)
YIELD item1, item2, count1, count2, intersection, similarity``````
Table 4. Parameters
Name Type Default Optional Description

configuration

Map

n/a

no

Algorithm-specific configuration.

Table 5. Configuration
Name Type Default Optional Description

data

String[]

null

no

A list of maps of the following structure: `{item: nodeId, weights: [double, double, double]}` or a Cypher query.

top

Integer

0

yes

The number of similar pairs to return. If `0`, it will return as many as it finds.

topK

Integer

3

yes

The number of similar values to return per node. If `0`, it will return as many as it finds.

similarityCutoff

Integer

-1

yes

The threshold for similarity. Values below this will not be returned.

degreeCutoff

Integer

0

yes

The threshold for the number of items in the `targets` list. If the list contains less than this amount, that node will be excluded from the calculation.

skipValue

Float

null

yes

Value to skip when executing similarity computation. A value of `null` means that skipping is disabled.

concurrency

Integer

4

yes

The number of concurrent threads used for running the algorithm.

graph

String

dense

yes

The graph type ('dense' or 'cypher').

sourceIds

Integer[]

null

yes

The ids of items from which we need to compute similarities. Defaults to all the items provided in the `data` parameter.

targetIds

Integer[]

null

yes

The ids of items to which we need to compute similarities. Defaults to all the items provided in the `data` parameter.

Table 6. Results
Name Type Description

item1

Integer

The ID of one node in the similarity pair.

item2

Integer

The ID of other node in the similarity pair.

count1

Integer

The size of the `targets` list of one node.

count2

Integer

The size of the `targets` list of other node.

intersection

Integer

The number of intersecting values in the two nodes `targets` lists.

similarity

Integer

The cosine similarity of the two nodes.

## 4. Cosine Similarity algorithm function sample

The Cosine Similarity function computes the similarity of two lists of numbers.

 Cosine Similarity is only calculated over non-NULL dimensions. When calling the function, we should provide lists that contain the overlapping items.

We can use it to compute the similarity of two hardcoded lists.

The following will return the cosine similarity of two lists of numbers:
``RETURN gds.alpha.similarity.cosine([3,8,7,5,2,9], [10,8,6,6,4,5]) AS similarity``
Table 7. Results
`similarity`

0.8638935626791597

These two lists of numbers have a Cosine similarity of 0.863. We can see how this result is derived by breaking down the formula: We can also use it to compute the similarity of nodes based on lists computed by a Cypher query.

The following will create a sample graph:
``````CREATE (french:Cuisine {name:'French'})
CREATE (italian:Cuisine {name:'Italian'})
CREATE (indian:Cuisine {name:'Indian'})
CREATE (lebanese:Cuisine {name:'Lebanese'})
CREATE (portuguese:Cuisine {name:'Portuguese'})
CREATE (british:Cuisine {name:'British'})
CREATE (mauritian:Cuisine {name:'Mauritian'})

CREATE (zhen:Person {name: "Zhen"})
CREATE (praveena:Person {name: "Praveena"})
CREATE (michael:Person {name: "Michael"})
CREATE (arya:Person {name: "Arya"})
CREATE (karin:Person {name: "Karin"})

CREATE (praveena)-[:LIKES {score: 9}]->(indian)
CREATE (praveena)-[:LIKES {score: 7}]->(portuguese)
CREATE (praveena)-[:LIKES {score: 8}]->(british)
CREATE (praveena)-[:LIKES {score: 1}]->(mauritian)

CREATE (zhen)-[:LIKES {score: 10}]->(french)
CREATE (zhen)-[:LIKES {score: 6}]->(indian)
CREATE (zhen)-[:LIKES {score: 2}]->(british)

CREATE (michael)-[:LIKES {score: 8}]->(french)
CREATE (michael)-[:LIKES {score: 7}]->(italian)
CREATE (michael)-[:LIKES {score: 9}]->(indian)
CREATE (michael)-[:LIKES {score: 3}]->(portuguese)

CREATE (arya)-[:LIKES {score: 10}]->(lebanese)
CREATE (arya)-[:LIKES {score: 10}]->(italian)
CREATE (arya)-[:LIKES {score: 7}]->(portuguese)
CREATE (arya)-[:LIKES {score: 9}]->(mauritian)

CREATE (karin)-[:LIKES {score: 9}]->(lebanese)
CREATE (karin)-[:LIKES {score: 7}]->(italian)
CREATE (karin)-[:LIKES {score: 10}]->(portuguese)``````
The following will return the Cosine similarity of Michael and Arya:
``````MATCH (p1:Person {name: 'Michael'})-[likes1:LIKES]->(cuisine)
MATCH (p2:Person {name: "Arya"})-[likes2:LIKES]->(cuisine)
RETURN p1.name AS from,
p2.name AS to,
gds.alpha.similarity.cosine(collect(likes1.score), collect(likes2.score)) AS similarity``````
Table 8. Results
`from` `to` `similarity`

"Michael"

"Arya"

0.9788908326303921

The following will return the Cosine similarity of Michael and the other people that have a cuisine in common:
``````MATCH (p1:Person {name: 'Michael'})-[likes1:LIKES]->(cuisine)
MATCH (p2:Person)-[likes2:LIKES]->(cuisine) WHERE p2 <> p1
RETURN p1.name AS from,
p2.name AS to,
gds.alpha.similarity.cosine(collect(likes1.score), collect(likes2.score)) AS similarity
ORDER BY similarity DESC``````
Table 9. Results
`from` `to` `similarity`

"Michael"

"Arya"

0.9788908326303921

"Michael"

"Zhen"

0.9542262139256075

"Michael"

"Praveena"

0.9429903335828894

"Michael"

"Karin"

0.8498063272285821

## 5. Cosine Similarity algorithm procedures examples

The Cosine Similarity procedure computes similarity between all pairs of items. It is a symmetrical algorithm, which means that the result from computing the similarity of Item A to Item B is the same as computing the similarity of Item B to Item A. We can therefore compute the score for each pair of nodes once. We don’t compute the similarity of items to themselves.

The number of computations is `((# items)^2 / 2) - # items`, which can be very computationally expensive if we have a lot of items.

 Cosine Similarity is only calculated over non-NULL dimensions. The procedures expect to receive the same length lists for all items. Otherwise, longer lists will be trimmed to the length of the shortest list.
The following will create a sample graph:
``````CREATE (french:Cuisine {name:'French'})
CREATE (italian:Cuisine {name:'Italian'})
CREATE (indian:Cuisine {name:'Indian'})
CREATE (lebanese:Cuisine {name:'Lebanese'})
CREATE (portuguese:Cuisine {name:'Portuguese'})
CREATE (british:Cuisine {name:'British'})
CREATE (mauritian:Cuisine {name:'Mauritian'})

CREATE (zhen:Person {name: "Zhen"})
CREATE (praveena:Person {name: "Praveena"})
CREATE (michael:Person {name: "Michael"})
CREATE (arya:Person {name: "Arya"})
CREATE (karin:Person {name: "Karin"})

CREATE (praveena)-[:LIKES {score: 9}]->(indian)
CREATE (praveena)-[:LIKES {score: 7}]->(portuguese)
CREATE (praveena)-[:LIKES {score: 8}]->(british)
CREATE (praveena)-[:LIKES {score: 1}]->(mauritian)

CREATE (zhen)-[:LIKES {score: 10}]->(french)
CREATE (zhen)-[:LIKES {score: 6}]->(indian)
CREATE (zhen)-[:LIKES {score: 2}]->(british)

CREATE (michael)-[:LIKES {score: 8}]->(french)
CREATE (michael)-[:LIKES {score: 7}]->(italian)
CREATE (michael)-[:LIKES {score: 9}]->(indian)
CREATE (michael)-[:LIKES {score: 3}]->(portuguese)

CREATE (arya)-[:LIKES {score: 10}]->(lebanese)
CREATE (arya)-[:LIKES {score: 10}]->(italian)
CREATE (arya)-[:LIKES {score: 7}]->(portuguese)
CREATE (arya)-[:LIKES {score: 9}]->(mauritian)

CREATE (karin)-[:LIKES {score: 9}]->(lebanese)
CREATE (karin)-[:LIKES {score: 7}]->(italian)
CREATE (karin)-[:LIKES {score: 10}]->(portuguese)``````

### 5.1. Stream

The following will return a stream of node pairs along with their Cosine similarities:
``````MATCH (p:Person), (c:Cuisine)
OPTIONAL MATCH (p)-[likes:LIKES]->(c)
WITH {item:id(p), weights: collect(coalesce(likes.score, gds.util.NaN()))} AS userData
WITH collect(userData) AS data
CALL gds.alpha.similarity.cosine.stream({data: data})
YIELD item1, item2, count1, count2, similarity
RETURN gds.util.asNode(item1).name AS from, gds.util.asNode(item2).name AS to, similarity
ORDER BY similarity DESC``````
Table 10. Results
`from` `to` `similarity`

"Praveena"

"Karin"

1.0

"Michael"

"Arya"

0.9788908326303921

"Arya"

"Karin"

0.9610904115204073

"Zhen"

"Michael"

0.9542262139256075

"Praveena"

"Michael"

0.9429903335828895

"Zhen"

"Praveena"

0.9191450300180579

"Michael"

"Karin"

0.8498063272285821

"Praveena"

"Arya"

0.7194014606174091

"Zhen"

"Arya"

0.0

"Zhen"

"Karin"

0.0

Praveena and Karin have the most similar food tastes, with a score of 1.0, and there are also several other pairs of users with similar tastes. The scores here are unusually high because our users haven’t liked many of the same cuisines. We also have 2 pairs of users who are not similar at all. We’d probably want to filter those out, which we can do by passing in the `similarityCutoff` parameter.

The following will return a stream of node pairs that have a similarity of at least 0.1, along with their cosine similarities:
``````MATCH (p:Person), (c:Cuisine)
OPTIONAL MATCH (p)-[likes:LIKES]->(c)
WITH {item:id(p), weights: collect(coalesce(likes.score, gds.util.NaN()))} AS userData
WITH collect(userData) AS data
CALL gds.alpha.similarity.cosine.stream({
data: data,
similarityCutoff: 0.0
})
YIELD item1, item2, count1, count2, similarity
RETURN gds.util.asNode(item1).name AS from, gds.util.asNode(item2).name AS to, similarity
ORDER BY similarity DESC``````
Table 11. Results
`from` `to` `similarity`

"Praveena"

"Karin"

1.0

"Michael"

"Arya"

0.9788908326303921

"Arya"

"Karin"

0.9610904115204073

"Zhen"

"Michael"

0.9542262139256075

"Praveena"

"Michael"

0.9429903335828895

"Zhen"

"Praveena"

0.9191450300180579

"Michael"

"Karin"

0.8498063272285821

"Praveena"

"Arya"

0.7194014606174091

We can see that those users with no similarity have been filtered out. If we’re implementing a k-Nearest Neighbors type query we might instead want to find the most similar `k` users for a given user. We can do that by passing in the `topK` parameter.

The following will return a stream of users along with the most similar user to them (i.e. `k=1`):
``````MATCH (p:Person), (c:Cuisine)
OPTIONAL MATCH (p)-[likes:LIKES]->(c)
WITH {item:id(p), weights: collect(coalesce(likes.score, gds.util.NaN()))} AS userData
WITH collect(userData) AS data
CALL gds.alpha.similarity.cosine.stream({
data: data,
similarityCutoff: 0.0,
topK: 1
})
YIELD item1, item2, count1, count2, similarity
RETURN gds.util.asNode(item1).name AS from, gds.util.asNode(item2).name AS to, similarity
ORDER BY from``````
Table 12. Results
`from` `to` `similarity`

"Arya"

"Michael"

0.9788908326303921

"Karin"

"Praveena"

1.0

"Michael"

"Arya"

0.9788908326303921

"Praveena"

"Karin"

1.0

"Zhen"

"Michael"

0.9542262139256075

These results will not be symmetrical. For example, the person most similar to Zhen is Michael, but the person most similar to Michael is Arya.

### 5.2. Write

The following will find the most similar user for each user, and store a relationship between those users:
``````MATCH (p:Person), (c:Cuisine)
OPTIONAL MATCH (p)-[likes:LIKES]->(c)
WITH {item:id(p), weights: collect(coalesce(likes.score, gds.util.NaN()))} AS userData
WITH collect(userData) AS data
CALL gds.alpha.similarity.cosine.write({
data: data,
topK: 1,
similarityCutoff: 0.1
})
YIELD nodes, similarityPairs, writeRelationshipType, writeProperty, min, max, mean, stdDev, p25, p50, p75, p90, p95, p99, p999, p100
RETURN nodes, similarityPairs, writeRelationshipType, writeProperty, min, max, mean, p95``````
Table 13. Results
`nodes` `similarityPairs` `writeRelationshipType` `writeProperty` `min` `max` `mean` `p95`

5

5

"SIMILAR"

"score"

0.9542236328125

1.0000038146972656

0.9824020385742187

1.0000038146972656

We then could write a query to find out what types of cuisine that other people similar to us might like.

The following will find the most similar user to Praveena, and return their favourite cuisines that Praveena doesn’t (yet!) like:
``````MATCH (p:Person {name: "Praveena"})-[:SIMILAR]->(other),
(other)-[:LIKES]->(cuisine)
WHERE not((p)-[:LIKES]->(cuisine))
RETURN cuisine.name AS cuisine``````
Table 14. Results
`cuisine`

Italian

Lebanese

### 5.3. Stats

The following will run the algorithm and returns the result in form of statistical and measurement values
``````MATCH (p:Person), (c:Cuisine)
OPTIONAL MATCH (p)-[likes:LIKES]->(c)
WITH {item:id(p), weights: collect(coalesce(likes.score, gds.util.NaN()))} AS userData
WITH collect(userData) AS data
CALL gds.alpha.similarity.cosine.stats({
data: data,
topK: 1,
similarityCutoff: 0.1
})
YIELD nodes, similarityPairs, min, max, mean, p95
RETURN nodes, similarityPairs, min, max, mean, p95``````

## 6. Specifying source and target ids

Sometimes, we don’t want to compute all pairs similarity, but would rather specify subsets of items to compare to each other. We do this using the `sourceIds` and `targetIds` keys in the config.

We could use this technique to compute the similarity of a subset of items to all other items.

The following will find the most similar person (i.e. `k=1`) to Arya and Praveena:
``````MATCH (p:Person), (c:Cuisine)
OPTIONAL MATCH (p)-[likes:LIKES]->(c)
WITH {item:id(p), name: p.name, weights: collect(coalesce(likes.score, gds.util.NaN()))} AS userData
WITH collect(userData) AS personCuisines
WITH personCuisines,
[value in personCuisines WHERE value.name IN ["Praveena", "Arya"] | value.item ] AS sourceIds
CALL gds.alpha.similarity.cosine.stream({
data: personCuisines,
sourceIds: sourceIds,
topK: 1
})
YIELD item1, item2, similarity
WITH gds.util.asNode(item1) AS from, gds.util.asNode(item2) AS to, similarity
RETURN from.name AS from, to.name AS to, similarity
ORDER BY similarity DESC``````
Table 15. Results
`from` `to` `similarity`

Praveena

Karin

1.0

Arya

Michael

0.9788908326303921

## 7. Skipping values

The algorithm checks every value in the input vectors against the `skipValue` to determine whether that value should be considered as part of the similarity computation. Vectors of different length are padded with `NaN` values which are skipped by default. Setting a `skipValue` allows skipping an additional value. A common value to skip is `0.0`.

The following will create a sample graph storing an embedding vector for each node:
``````CREATE (french:Cuisine {name:'French'})          SET french.embedding = [0.0, 0.33, 0.81, 0.52, 0.41]
CREATE (italian:Cuisine {name:'Italian'})        SET italian.embedding = [0.31, 0.72, 0.58, 0.67, 0.31]
CREATE (indian:Cuisine {name:'Indian'})          SET indian.embedding = [0.43, 0.0, 0.98, 0.51, 0.76]
CREATE (lebanese:Cuisine {name:'Lebanese'})      SET lebanese.embedding = [0.12, 0.23, 0.35, 0.31, 0.39]
CREATE (portuguese:Cuisine {name:'Portuguese'})  SET portuguese.embedding = [0.47, 0.98, 0.0, 0.72, 0.89]
CREATE (british:Cuisine {name:'British'})        SET british.embedding = [0.94, 0.12, 0.23, 0.4, 0.71]
CREATE (mauritian:Cuisine {name:'Mauritian'})    SET mauritian.embedding = [0.31, 0.56, 0.98, 0.0, 0.62]``````
The following will find the top 3 similarities between cuisines based on the `embedding` property:
``````MATCH (c:Cuisine)
WITH {item:id(c), weights: c.embedding} AS userData
WITH collect(userData) AS data
CALL gds.alpha.similarity.cosine.stream({
data: data,
skipValue: 0.0
})
YIELD item1, item2, count1, count2, similarity
RETURN gds.util.asNode(item1).name AS from, gds.util.asNode(item2).name AS to, similarity
ORDER BY similarity DESC, from ASC
LIMIT 3``````
Table 16. Results with skipping `0.0` values:
`from` `to` `similarity`

"Mauritian"

"Portuguese"

0.9955829148132149

"Portuguese"

"Mauritian"

0.9955829148132149

"Indian"

"Portuguese"

0.9954426605601884

Without skipping `0.0` values the result would look different:

Table 17. Results without skipping `0.0` values:
`from` `to` `similarity`

"Lebanese"

"French"

0.9372771447068958

"French"

"Lebanese"

0.9372771447068958

"Indian"

"Lebanese"

0.9110882139221992

## 8. Cypher projection

Set `graph:'cypher'` in the config:
``````WITH 'MATCH (person:Person)-[likes:LIKES]->(c)
RETURN id(person) AS item, id(c) AS category, likes.score AS weight' AS query
CALL gds.alpha.similarity.cosine.write({
data: query,
graph: 'cypher',
topK: 1,
similarityCutoff: 0.1
})
YIELD nodes, similarityPairs, writeRelationshipType, writeProperty, min, max, mean, stdDev, p95
RETURN nodes, similarityPairs, writeRelationshipType, writeProperty, min, max, mean, p95``````