This section describes the Jaccard Similarity algorithm in the Neo4j Graph Algorithms library.
Jaccard similarity (coefficient), a term coined by Paul Jaccard, measures similarities between sets. It is defined as the size of the intersection divided by the size of the union of two sets.
This section includes:
Jaccard similarity is computed using the following formula:
The library contains both procedures and functions to calculate similarity between sets of data. The function is best used when calculating the similarity between small numbers of sets. The procedures parallelize the computation, and are therefore more appropriate for computing similarities on bigger datasets.
We can use the Jaccard Similarity algorithm to work out the similarity between two things. We might then use the computed similarity as part of a recommendation query. For example, you can use the Jaccard Similarity algorithm to show the products that were purchased by similar customers, in terms of previous products purchased.
The following will return the Jaccard similarity of two lists of numbers:
RETURN algo.similarity.jaccard([1,2,3], [1,2,4,5]) AS similarity
similarity 

0.4 
These two lists of numbers have a Jaccard similarity of 0.4. We can see how this result is derived by breaking down the formula:
J(A,B) = ∣A ∩ B∣ / ∣A∣ + ∣B∣  ∣A ∩ B
J(A,B) = 2 / 3 + 4  2
= 2 / 5
= 0.4
The following will create a sample graph:
MERGE (french:Cuisine {name:'French'})
MERGE (italian:Cuisine {name:'Italian'})
MERGE (indian:Cuisine {name:'Indian'})
MERGE (lebanese:Cuisine {name:'Lebanese'})
MERGE (portuguese:Cuisine {name:'Portuguese'})
MERGE (zhen:Person {name: "Zhen"})
MERGE (praveena:Person {name: "Praveena"})
MERGE (michael:Person {name: "Michael"})
MERGE (arya:Person {name: "Arya"})
MERGE (karin:Person {name: "Karin"})
MERGE (praveena)[:LIKES]>(indian)
MERGE (praveena)[:LIKES]>(portuguese)
MERGE (zhen)[:LIKES]>(french)
MERGE (zhen)[:LIKES]>(indian)
MERGE (michael)[:LIKES]>(french)
MERGE (michael)[:LIKES]>(italian)
MERGE (michael)[:LIKES]>(indian)
MERGE (arya)[:LIKES]>(lebanese)
MERGE (arya)[:LIKES]>(italian)
MERGE (arya)[:LIKES]>(portuguese)
MERGE (karin)[:LIKES]>(lebanese)
MERGE (karin)[:LIKES]>(italian)
The following will return the Jaccard similarity of Karin and Arya:
MATCH (p1:Person {name: 'Karin'})[:LIKES]>(cuisine1)
WITH p1, collect(id(cuisine1)) AS p1Cuisine
MATCH (p2:Person {name: "Arya"})[:LIKES]>(cuisine2)
WITH p1, p1Cuisine, p2, collect(id(cuisine2)) AS p2Cuisine
RETURN p1.name AS from,
p2.name AS to,
algo.similarity.jaccard(p1Cuisine, p2Cuisine) AS similarity
from 
to 
similarity 

"Karin" 
"Arya" 
0.66 
The following will return the Jaccard similarity of Karin and the other people that have a cuisine in common:
MATCH (p1:Person {name: 'Karin'})[:LIKES]>(cuisine1)
WITH p1, collect(id(cuisine1)) AS p1Cuisine
MATCH (p2:Person)[:LIKES]>(cuisine2) WHERE p1 <> p2
WITH p1, p1Cuisine, p2, collect(id(cuisine2)) AS p2Cuisine
RETURN p1.name AS from,
p2.name AS to,
algo.similarity.jaccard(p1Cuisine, p2Cuisine) AS similarity
ORDER BY similarity DESC
from 
to 
similarity 

"Karin" 
"Arya" 
0.66 
"Karin" 
"Michael" 
0.25 
"Karin" 
"Praveena" 
0.0 
"Karin" 
"Zhen" 
0.0 
The following will return a stream of node pairs along with their intersection and Jaccard similarities:
MATCH (p:Person)[:LIKES]>(cuisine)
WITH {item:id(p), categories: collect(id(cuisine))} as userData
WITH collect(userData) as data
CALL algo.similarity.jaccard.stream(data)
YIELD item1, item2, count1, count2, intersection, similarity
RETURN algo.getNodeById(item1).name AS from, algo.getNodeById(item2).name AS to, intersection, similarity
ORDER BY similarity DESC
From  To  Intersection  Similarity 

Arya 
Karin 
2 
0.66 
Zhen 
Michael 
2 
0.66 
Zhen 
Praveena 
1 
0.33 
Michael 
Karin 
1 
0.25 
Praveena 
Michael 
1 
0.25 
Praveena 
Arya 
1 
0.25 
Michael 
Arya 
1 
0.2 
Praveena 
Karin 
0 
0 
Zhen 
Arya 
0 
0 
Zhen 
Karin 
0 
0 
Arya and Karin, and Zhen and Michael have the most similar food preferences, with two overlapping cuisines for a similarity
of 0.66.
We also have 3 pairs of users who are not similar at all.
We’d probably want to filter those out, which we can do by passing in the similarityCutoff
parameter.
The following will return a stream of node pairs that have a similarity of at least 0.1, along with their intersection and Jaccard similarities:
MATCH (p:Person)[:LIKES]>(cuisine)
WITH {item:id(p), categories: collect(id(cuisine))} as userData
WITH collect(userData) as data
CALL algo.similarity.jaccard.stream(data, {similarityCutoff: 0.0})
YIELD item1, item2, count1, count2, intersection, similarity
RETURN algo.getNodeById(item1).name AS from, algo.getNodeById(item2).name AS to, intersection, similarity
ORDER BY similarity DESC
from 
to 
intersection 
similarity 

Arya 
Karin 
2 
0.66 
Zhen 
Michael 
2 
0.66 
Zhen 
Praveena 
1 
0.33 
Michael 
Karin 
1 
0.25 
Praveena 
Michael 
1 
0.25 
Praveena 
Arya 
1 
0.25 
Michael 
Arya 
1 
0.2 
We can see that those users with no similarity have been filtered out.
If we’re implementing a kNearest Neighbors type query we might instead want to find the most similar k
users for a given user.
We can do that by passing in the topK
parameter.
The following will return a stream of users along with the most similar user to them (i.e. k=1
):
MATCH (p:Person)[:LIKES]>(cuisine)
WITH {item:id(p), categories: collect(id(cuisine))} as userData
WITH collect(userData) as data
CALL algo.similarity.jaccard.stream(data, {topK: 1, similarityCutoff: 0.0})
YIELD item1, item2, count1, count2, intersection, similarity
RETURN algo.getNodeById(item1).name AS from, algo.getNodeById(item2).name AS to, similarity
ORDER BY from
from 
to 
similarity 

Arya 
Karin 
0.66 
Karin 
Arya 
0.66 
Michael 
Zhen 
0.66 
Praveena 
Zhen 
0.33 
Zhen 
Michael 
0.66 
These results will not be symmetrical. For example, the person most similar to Praveena is Zhen, but the person most similar to Zhen is actually Michael.
Name  Type  Default  Optional  Description 


list 
null 
no 
A list of maps of the following structure: 

int 
0 
yes 
The number of similar pairs to return. If 

int 
0 
yes 
The number of similar values to return per node. If 

int 
1 
yes 
The threshold for Jaccard similarity. Values below this will not be returned. 

int 
0 
yes 
The threshold for the number of items in the 

int 
available CPUs 
yes 
The number of concurrent threads. 
Name  Type  Description 


int 
The ID of one node in the similarity pair. 

int 
The ID of other node in the similarity pair. 

int 
The size of the 

int 
The size of the 

int 
The number of intersecting values in the two nodes 

int 
The Jaccard similarity of the two nodes. 
The following will find the most similar user for each user, and store a relationship between those users:
MATCH (p:Person)[:LIKES]>(cuisine)
WITH {item:id(p), categories: collect(id(cuisine))} as userData
WITH collect(userData) as data
CALL algo.similarity.jaccard(data, {topK: 1, similarityCutoff: 0.1, write:true})
YIELD nodes, similarityPairs, write, writeRelationshipType, writeProperty, min, max, mean, stdDev, p25, p50, p75, p90, p95, p99, p999, p100
RETURN nodes, similarityPairs, write, writeRelationshipType, writeProperty, min, max, mean, p95
nodes 
similarityPairs 
write 
writeRelationshipType 
writeProperty 
min 
max 
mean 
p95 

5 
5 
true 
SIMILAR 
score 
0.33 
0.66 
0.59 
0.66 
We then could write a query to find out what types of cuisine that other people similar to us might like.
The following will find the most similar user to Praveena, and return their favorite cuisines that Praveena doesn’t (yet!) like:
MATCH (p:Person {name: "Praveena"})[:SIMILAR]>(other),
(other)[:LIKES]>(cuisine)
WHERE not((p)[:LIKES]>(cuisine))
RETURN cuisine.name AS cuisine
cuisine 

French 
Name  Type  Default  Optional  Description 


list 
null 
no 
A list of maps of the following structure: 

int 
0 
yes 
The number of similar pairs to return. If 

int 
0 
yes 
The number of similar values to return per node. If 

int 
1 
yes 
The threshold for Jaccard similarity. Values below this will not be returned. 

int 
0 
yes 
The threshold for the number of items in the 

int 
available CPUs 
yes 
The number of concurrent threads. 

boolean 
false 
yes 
Indicates whether results should be stored. 

int 
10000 
yes 
The batch size to use when storing results. 

string 
SIMILAR 
yes 
The relationship type to use when storing results. 

string 
score 
yes 
The property to use when storing results. 
Name  Type  Description 


int 
The number of nodes passed in. 

int 
The number of pairs of similar nodes computed. 

boolean 
Indicates whether results were stored. 

string 
The relationship type used when storing results. 

string 
The property used when storing results. 

double 
The minimum similarity score computed. 

double 
The maximum similarity score computed. 

double 
The mean of similarities scores computed. 

double 
The standard deviation of similarities scores computed. 

double 
The 25 percentile of similarities scores computed. 

double 
The 50 percentile of similarities scores computed. 

double 
The 75 percentile of similarities scores computed. 

double 
The 90 percentile of similarities scores computed. 

double 
The 95 percentile of similarities scores computed. 

double 
The 99 percentile of similarities scores computed. 

double 
The 99.9 percentile of similarities scores computed. 

double 
The 25 percentile of similarities scores computed. 