### 9.5.1. The Jaccard Similarity algorithm

This section describes the Jaccard Similarity algorithm in the Neo4j Labs Graph Algorithms library.

Jaccard Similarity (coefficient), a term coined by Paul Jaccard, measures similarities between sets. It is defined as the size of the intersection divided by the size of the union of two sets.

 The Jaccard Similarity algorithm was developed by the Neo4j Labs team and is not officially supported.

This section includes:

#### 9.5.1.1. History and explanation

Jaccard similarity is computed using the following formula:

The library contains both procedures and functions to calculate similarity between sets of data. The function is best used when calculating the similarity between small numbers of sets. The procedures parallelize the computation, and are therefore more appropriate for computing similarities on bigger datasets.

#### 9.5.1.2. Use-cases - when to use the Jaccard Similarity algorithm

We can use the Jaccard Similarity algorithm to work out the similarity between two things. We might then use the computed similarity as part of a recommendation query. For example, you can use the Jaccard Similarity algorithm to show the products that were purchased by similar customers, in terms of previous products purchased.

#### 9.5.1.3. Jaccard Similarity algorithm function sample

The Jaccard similarity function computes the similarity of two lists of numbers.

We can use it to compute the similarity of two hardcoded lists.

The following will return the Jaccard similarity of two lists of numbers:

``RETURN algo.similarity.jaccard([1,2,3], [1,2,4,5]) AS similarity``

Table 9.83. Results
`similarity`

0.4

These two lists of numbers have a Jaccard similarity of 0.4. We can see how this result is derived by breaking down the formula:

``````J(A,B) = ∣A ∩ B∣ / ∣A∣ + ∣B∣ - ∣A ∩ B|
J(A,B) = 2 / 3 + 4 - 2
= 2 / 5
= 0.4``````

We can also use it to compute the similarity of nodes based on lists computed by a Cypher query.

The following will create a sample graph:

``````MERGE (french:Cuisine {name:'French'})
MERGE (italian:Cuisine {name:'Italian'})
MERGE (indian:Cuisine {name:'Indian'})
MERGE (lebanese:Cuisine {name:'Lebanese'})
MERGE (portuguese:Cuisine {name:'Portuguese'})

MERGE (zhen:Person {name: "Zhen"})
MERGE (praveena:Person {name: "Praveena"})
MERGE (michael:Person {name: "Michael"})
MERGE (arya:Person {name: "Arya"})
MERGE (karin:Person {name: "Karin"})

MERGE (praveena)-[:LIKES]->(indian)
MERGE (praveena)-[:LIKES]->(portuguese)

MERGE (zhen)-[:LIKES]->(french)
MERGE (zhen)-[:LIKES]->(indian)

MERGE (michael)-[:LIKES]->(french)
MERGE (michael)-[:LIKES]->(italian)
MERGE (michael)-[:LIKES]->(indian)

MERGE (arya)-[:LIKES]->(lebanese)
MERGE (arya)-[:LIKES]->(italian)
MERGE (arya)-[:LIKES]->(portuguese)

MERGE (karin)-[:LIKES]->(lebanese)
MERGE (karin)-[:LIKES]->(italian)``````

The following will return the Jaccard similarity of Karin and Arya:

``````MATCH (p1:Person {name: 'Karin'})-[:LIKES]->(cuisine1)
WITH p1, collect(id(cuisine1)) AS p1Cuisine
MATCH (p2:Person {name: "Arya"})-[:LIKES]->(cuisine2)
WITH p1, p1Cuisine, p2, collect(id(cuisine2)) AS p2Cuisine
RETURN p1.name AS from,
p2.name AS to,
algo.similarity.jaccard(p1Cuisine, p2Cuisine) AS similarity``````

Table 9.84. Results
`from` `to` `similarity`

"Karin"

"Arya"

0.66

The following will return the Jaccard similarity of Karin and the other people that have a cuisine in common:

``````MATCH (p1:Person {name: 'Karin'})-[:LIKES]->(cuisine1)
WITH p1, collect(id(cuisine1)) AS p1Cuisine
MATCH (p2:Person)-[:LIKES]->(cuisine2) WHERE p1 <> p2
WITH p1, p1Cuisine, p2, collect(id(cuisine2)) AS p2Cuisine
RETURN p1.name AS from,
p2.name AS to,
algo.similarity.jaccard(p1Cuisine, p2Cuisine) AS similarity
ORDER BY similarity DESC``````

Table 9.85. Results
`from` `to` `similarity`

"Karin"

"Arya"

0.66

"Karin"

"Michael"

0.25

"Karin"

"Praveena"

0.0

"Karin"

"Zhen"

0.0

#### 9.5.1.4. Jaccard Similarity algorithm procedures sample

The Jaccard Similarity procedure computes similarity between all pairs of items. It is a symmetrical algorithm, which means that the result from computing the similarity of Item A to Item B is the same as computing the similarity of Item B to Item A. We can therefore compute the score for each pair of nodes once. We don’t compute the similarity of items to themselves.

The number of computations is `((# items)^2 / 2) - # items`, which can be very computationally expensive if we have a lot of items.

The following will create a sample graph:

``````MERGE (french:Cuisine {name:'French'})
MERGE (italian:Cuisine {name:'Italian'})
MERGE (indian:Cuisine {name:'Indian'})
MERGE (lebanese:Cuisine {name:'Lebanese'})
MERGE (portuguese:Cuisine {name:'Portuguese'})

MERGE (zhen:Person {name: "Zhen"})
MERGE (praveena:Person {name: "Praveena"})
MERGE (michael:Person {name: "Michael"})
MERGE (arya:Person {name: "Arya"})
MERGE (karin:Person {name: "Karin"})

MERGE (shrimp:Recipe {title: "Shrimp Bolognese"})
MERGE (saltimbocca:Recipe {title: "Saltimbocca alla roman"})
MERGE (periperi:Recipe {title: "Peri Peri Naan"})

MERGE (praveena)-[:LIKES]->(indian)
MERGE (praveena)-[:LIKES]->(portuguese)

MERGE (zhen)-[:LIKES]->(french)
MERGE (zhen)-[:LIKES]->(indian)

MERGE (michael)-[:LIKES]->(french)
MERGE (michael)-[:LIKES]->(italian)
MERGE (michael)-[:LIKES]->(indian)

MERGE (arya)-[:LIKES]->(lebanese)
MERGE (arya)-[:LIKES]->(italian)
MERGE (arya)-[:LIKES]->(portuguese)

MERGE (karin)-[:LIKES]->(lebanese)
MERGE (karin)-[:LIKES]->(italian)

MERGE (shrimp)-[:TYPE]->(italian)
MERGE (shrimp)-[:TYPE]->(indian)

MERGE (saltimbocca)-[:TYPE]->(italian)
MERGE (saltimbocca)-[:TYPE]->(french)

MERGE (periperi)-[:TYPE]->(portuguese)
MERGE (periperi)-[:TYPE]->(indian)``````

The following will return a stream of node pairs along with their intersection and Jaccard similarities:

``````MATCH (p:Person)-[:LIKES]->(cuisine)
WITH {item:id(p), categories: collect(id(cuisine))} as userData
WITH collect(userData) as data
CALL algo.similarity.jaccard.stream(data)
YIELD item1, item2, count1, count2, intersection, similarity
RETURN algo.asNode(item1).name AS from, algo.asNode(item2).name AS to, intersection, similarity
ORDER BY similarity DESC``````

Table 9.86. Results
From To Intersection Similarity

Arya

Karin

2

0.66

Zhen

Michael

2

0.66

Zhen

Praveena

1

0.33

Michael

Karin

1

0.25

Praveena

Michael

1

0.25

Praveena

Arya

1

0.25

Michael

Arya

1

0.2

Praveena

Karin

0

0

Zhen

Arya

0

0

Zhen

Karin

0

0

Arya and Karin, and Zhen and Michael have the most similar food preferences, with two overlapping cuisines for a similarity of 0.66. We also have 3 pairs of users who are not similar at all. We’d probably want to filter those out, which we can do by passing in the `similarityCutoff` parameter.

The following will return a stream of node pairs that have a similarity of at least 0.1, along with their intersection and Jaccard similarities:

``````MATCH (p:Person)-[:LIKES]->(cuisine)
WITH {item:id(p), categories: collect(id(cuisine))} as userData
WITH collect(userData) as data
CALL algo.similarity.jaccard.stream(data, {similarityCutoff: 0.0})
YIELD item1, item2, count1, count2, intersection, similarity
RETURN algo.asNode(item1).name AS from, algo.asNode(item2).name AS to, intersection, similarity
ORDER BY similarity DESC``````

Table 9.87. Results
`from` `to` `intersection` `similarity`

Arya

Karin

2

0.66

Zhen

Michael

2

0.66

Zhen

Praveena

1

0.33

Michael

Karin

1

0.25

Praveena

Michael

1

0.25

Praveena

Arya

1

0.25

Michael

Arya

1

0.2

We can see that those users with no similarity have been filtered out. If we’re implementing a k-Nearest Neighbors type query we might instead want to find the most similar `k` users for a given user. We can do that by passing in the `topK` parameter.

The following will return a stream of users along with the most similar user to them (i.e. `k=1`):

``````MATCH (p:Person)-[:LIKES]->(cuisine)
WITH {item:id(p), categories: collect(id(cuisine))} as userData
WITH collect(userData) as data
CALL algo.similarity.jaccard.stream(data, {topK: 1, similarityCutoff: 0.0})
YIELD item1, item2, count1, count2, intersection, similarity
RETURN algo.asNode(item1).name AS from, algo.asNode(item2).name AS to, similarity
ORDER BY from``````

Table 9.88. Results
`from` `to` `similarity`

Arya

Karin

0.66

Karin

Arya

0.66

Michael

Zhen

0.66

Praveena

Zhen

0.33

Zhen

Michael

0.66

These results will not be symmetrical. For example, the person most similar to Praveena is Zhen, but the person most similar to Zhen is actually Michael.

The following will find the most similar user for each user, and store a relationship between those users:

``````MATCH (p:Person)-[:LIKES]->(cuisine)
WITH {item:id(p), categories: collect(id(cuisine))} as userData
WITH collect(userData) as data
CALL algo.similarity.jaccard(data, {topK: 1, similarityCutoff: 0.1, write:true})
YIELD nodes, similarityPairs, write, writeRelationshipType, writeProperty, min, max, mean, stdDev, p25, p50, p75, p90, p95, p99, p999, p100
RETURN nodes, similarityPairs, write, writeRelationshipType, writeProperty, min, max, mean, p95``````

Table 9.89. Results
`nodes` `similarityPairs` `write` `writeRelationshipType` `writeProperty` `min` `max` `mean` `p95`

5

5

true

SIMILAR

score

0.33

0.66

0.59

0.66

We then could write a query to find out what types of cuisine that other people similar to us might like.

The following will find the most similar user to Praveena, and return their favorite cuisines that Praveena doesn’t (yet!) like:

``````MATCH (p:Person {name: "Praveena"})-[:SIMILAR]->(other),
(other)-[:LIKES]->(cuisine)
WHERE not((p)-[:LIKES]->(cuisine))
RETURN cuisine.name AS cuisine``````

Table 9.90. Results
`cuisine`

French

#### 9.5.1.5. Specifying source and target ids

Sometimes, we don’t want to compute all pairs similarity, but would rather specify subsets of items to compare to each other. We do this using the `sourceIds` and `targetIds` keys in the config.

We might want to use this technique when comparing nodes with different labels that intersect on a common label. In our sample dataset, recipes and people both have relationships to cuisines, which means we have a way of computing similarities between recipes and people.

The following will find similarities between recipes and people, based on the cuisines that they have in common:

``````// compute categories for recipes
MATCH (recipe:Recipe)-[:TYPE]->(cuisine)
WITH {item:id(recipe), categories: collect(id(cuisine))} as data
WITH collect(data) AS recipeCuisines

// compute categories for people
MATCH (person:Person)-[:LIKES]->(cuisine)
WITH recipeCuisines, {item:id(person), categories: collect(id(cuisine))} as data
WITH recipeCuisines, collect(data) AS personCuisines

// create sourceIds and targetIds lists
WITH recipeCuisines, personCuisines,
[value in recipeCuisines | value.item] AS sourceIds,
[value in personCuisines | value.item] AS targetIds

CALL algo.similarity.jaccard.stream(recipeCuisines + personCuisines, {sourceIds: sourceIds, targetIds: targetIds})
YIELD item1, item2, similarity
WITH algo.getNodeById(item1) AS from, algo.getNodeById(item2) AS to, similarity
RETURN from.title AS from, to.name AS to, similarity
ORDER BY similarity DESC
LIMIT 10``````

Table 9.91. Results
`from` `to` `similarity`

Peri Peri Naan

Praveena

1.0

Shrimp Bolognese

Michael

0.6666666666666666

Saltimbocca alla roman

Michael

0.6666666666666666

Peri Peri Naan

Zhen

0.3333333333333333

Shrimp Bolognese

Karin

0.3333333333333333

Shrimp Bolognese

Praveena

0.3333333333333333

Saltimbocca alla roman

Zhen

0.3333333333333333

Shrimp Bolognese

Zhen

0.3333333333333333

Saltimbocca alla roman

Karin

0.3333333333333333

Peri Peri Naan

Arya

0.25

The Peri Peri Naan and Praveena have a perfect score of 1.0 because they overlap via Portuguese and Indian cuisines. Michael likes French, Indian, and Italian food, and two of the fusion recipes combine two of those cuisines, giving us a score of 0.66.

We could also use this technique to compute the similarity of a subset of items to all other items.

The following will find the most similar person (i.e. `k=1`) to Arya and Praveena:

``````MATCH (person:Person)-[:LIKES]->(cuisine)
WITH {item:id(person), name: person.name, categories: collect(id(cuisine))} as data
WITH collect(data) AS personCuisines

// create sourceIds list containing ids for Praveena and Arya
WITH personCuisines,
[value in personCuisines WHERE value.name IN ["Praveena", "Arya"] | value.item ] AS sourceIds

CALL algo.similarity.jaccard.stream(personCuisines, {sourceIds: sourceIds, topK: 1})
YIELD item1, item2, similarity
WITH algo.getNodeById(item1) AS from, algo.getNodeById(item2) AS to, similarity
RETURN from.name AS from, to.name AS to, similarity
ORDER BY similarity DESC``````

Table 9.92. Results
`from` `to` `similarity`

Arya

Karin

0.6666666666666666

Praveena

Zhen

0.3333333333333333

#### 9.5.1.6. Syntax

The following will run the algorithm and write back results:

``````CALL algo.similarity.jaccard(userData:List<Map>, {
topK: 1, similarityCutoff: 0.1, write:true, writeProperty: "jaccardSimilarity"
})
YIELD nodes, similarityPairs, write, writeRelationshipType, writeProperty, min, max, mean, stdDev, p25, p50, p75, p90, p95, p99, p999, p100``````

Table 9.93. Parameters
Name Type Default Optional Description

`data`

list

null

no

A list of maps of the following structure: `{item: nodeId, categories: [nodeId, nodeId, nodeId]}`.

`top`

int

0

yes

The number of similar pairs to return. If `0`, it will return as many as it finds.

`topK`

int

0

yes

The number of similar values to return per node. If `0`, it will return as many as it finds.

`similarityCutoff`

int

-1

yes

The threshold for Jaccard similarity. Values below this will not be returned.

`degreeCutoff`

int

0

yes

The threshold for the number of items in the `targets` list. If the list contains less than this amount, that node will be excluded from the calculation.

`concurrency`

int

available CPUs

yes

The number of concurrent threads used for running the algorithm. Also provides the default value for 'readConcurrency' and 'writeConcurrency'.

`readConcurrency`

int

value of 'concurrency'

yes

`writeConcurrency`

int

value of 'concurrency'

yes

The number of concurrent threads used for writing the result.

`write`

boolean

false

yes

Indicates whether results should be stored.

`writeBatchSize`

int

10000

yes

The batch size to use when storing results.

`writeRelationshipType`

string

SIMILAR

yes

The relationship type to use when storing results.

`writeProperty`

string

score

yes

The property to use when storing results.

`sourceIds`

long[]

null

yes

The ids of items from which we need to compute similarities. Defaults to all the items provided in the `data` parameter.

`targetIds`

long[]

null

yes

The ids of items to which we need to compute similarities. Defaults to all the items provided in the `data` parameter.

`sourceIds`

long[]

null

yes

The ids of items from which we need to compute similarities. Defaults to all the items provided in the `data` parameter.

`targetIds`

long[]

null

yes

The ids of items to which we need to compute similarities. Defaults to all the items provided in the `data` parameter.

Table 9.94. Results
Name Type Description

`nodes`

int

The number of nodes passed in.

`similarityPairs`

int

The number of pairs of similar nodes computed.

`write`

boolean

Indicates whether results were stored.

`writeRelationshipType`

string

The relationship type used when storing results.

`writeProperty`

string

The property used when storing results.

`min`

double

The minimum similarity score computed.

`max`

double

The maximum similarity score computed.

`mean`

double

The mean of similarities scores computed.

`stdDev`

double

The standard deviation of similarities scores computed.

`p25`

double

The 25 percentile of similarities scores computed.

`p50`

double

The 50 percentile of similarities scores computed.

`p75`

double

The 75 percentile of similarities scores computed.

`p90`

double

The 90 percentile of similarities scores computed.

`p95`

double

The 95 percentile of similarities scores computed.

`p99`

double

The 99 percentile of similarities scores computed.

`p999`

double

The 99.9 percentile of similarities scores computed.

`p100`

double

The 25 percentile of similarities scores computed.

The following will run the algorithm and stream results:

``````CALL algo.similarity.jaccard.stream(userData:List<Map>, {
degreeCutoff: 10, similarityCutoff: 0.1, concurrency:4
})
YIELD item1, item2, count1, count2, intersection, similarity``````

Table 9.95. Parameters
Name Type Default Optional Description

`data`

list

null

no

A list of maps of the following structure: `{item: nodeId, categories: [nodeId, nodeId, nodeId]}`.

`top`

int

0

yes

The number of similar pairs to return. If `0`, it will return as many as it finds.

`topK`

int

0

yes

The number of similar values to return per node. If `0`, it will return as many as it finds.

`similarityCutoff`

int

-1

yes

The threshold for Jaccard similarity. Values below this will not be returned.

`degreeCutoff`

int

0

yes

The threshold for the number of items in the `targets` list. If the list contains less than this amount, that node will be excluded from the calculation.

`concurrency`

int

available CPUs

yes

The number of concurrent threads used for running the algorithm. Also provides the default value for 'readConcurrency'.

`readConcurrency`

int

value of 'concurrency'

yes

`sourceIds`

long[]

null

yes

The ids of items from which we need to compute similarities. Defaults to all the items provided in the `data` parameter.

`targetIds`

long[]

null

yes

The ids of items to which we need to compute similarities. Defaults to all the items provided in the `data` parameter.

`sourceIds`

long[]

null

yes

The ids of items from which we need to compute similarities. Defaults to all the items provided in the `data` parameter.

`targetIds`

long[]

null

yes

The ids of items to which we need to compute similarities. Defaults to all the items provided in the `data` parameter.

Table 9.96. Results
Name Type Description

`item1`

int

The ID of one node in the similarity pair.

`item2`

int

The ID of other node in the similarity pair.

`count1`

int

The size of the `targets` list of one node.

`count2`

int

The size of the `targets` list of other node.

`intersection`

int

The number of intersecting values in the two nodes `targets` lists.

`similarity`

int

The Jaccard similarity of the two nodes.