5.4.2. Jaccard Similarity

This section describes the Jaccard Similarity algorithm in the Neo4j Graph Data Science library.

Jaccard Similarity (coefficient), a term coined by Paul Jaccard, measures similarities between sets. It is defined as the size of the intersection divided by the size of the union of two sets. This notion has been generalized for multisets, where duplicate elements are counted as weights.

The GDS Jaccard Similarity function is defined for lists, which are interpreted as multisets.

This algorithm is in the alpha tier. For more information on algorithm tiers, see Chapter 5, Algorithms.

A related procedure for computing Jaccard similarity is described in Section 5.4.1, “Node Similarity”.

This section includes:

5.4.2.1. History and explanation

Jaccard Similarity is computed using the following formula:

jaccard

The library contains functions to calculate similarity between sets of data. The Jaccard Similarity function is best used when calculating the similarity between small numbers of sets.

5.4.2.2. Use-cases - when to use the Jaccard Similarity algorithm

We can use the Jaccard Similarity algorithm to work out the similarity between two things. We might then use the computed similarity as part of a recommendation query. For example, you can use the Jaccard Similarity algorithm to show the products that were purchased by similar customers, in terms of previous products purchased.

5.4.2.3. Jaccard Similarity algorithm function sample

The Jaccard Similarity function computes the similarity of two lists of numbers.

We can use it to compute the similarity of two hardcoded lists.

The following will return the Jaccard Similarity of two lists of numbers: 

RETURN gds.alpha.similarity.jaccard([1,2,3], [1,2,4,5]) AS similarity

Table 5.278. Results
similarity

0.4

These two lists of numbers have a Jaccard Similarity of 0.4. We can see how this result is derived by breaking down the formula:

J(A,B) = ∣A ∩ B∣ / ∣A∣ + ∣B∣ - ∣A ∩ B|
J(A,B) = 2 / 3 + 4 - 2
       = 2 / 5
       = 0.4

We can also use it to compute the similarity of nodes based on lists computed by a Cypher query.

The following will create a sample graph: 

CREATE
  (french:Cuisine {name:'French'}),
  (italian:Cuisine {name:'Italian'}),
  (indian:Cuisine {name:'Indian'}),
  (lebanese:Cuisine {name:'Lebanese'}),
  (portuguese:Cuisine {name:'Portuguese'}),

  (zhen:Person {name: 'Zhen'}),
  (praveena:Person {name: 'Praveena'}),
  (michael:Person {name: 'Michael'}),
  (arya:Person {name: 'Arya'}),
  (karin:Person {name: 'Karin'}),

  (praveena)-[:LIKES]->(indian),
  (praveena)-[:LIKES]->(portuguese),

  (zhen)-[:LIKES]->(french),
  (zhen)-[:LIKES]->(indian),

  (michael)-[:LIKES]->(french),
  (michael)-[:LIKES]->(italian),
  (michael)-[:LIKES]->(indian),

  (arya)-[:LIKES]->(lebanese),
  (arya)-[:LIKES]->(italian),
  (arya)-[:LIKES]->(portuguese),

  (karin)-[:LIKES]->(lebanese),
  (karin)-[:LIKES]->(italian)

The following will return the Jaccard Similarity of Karin and Arya: 

MATCH (p1:Person {name: 'Karin'})-[:LIKES]->(cuisine1)
WITH p1, collect(id(cuisine1)) AS p1Cuisine
MATCH (p2:Person {name: "Arya"})-[:LIKES]->(cuisine2)
WITH p1, p1Cuisine, p2, collect(id(cuisine2)) AS p2Cuisine
RETURN p1.name AS from,
       p2.name AS to,
       gds.alpha.similarity.jaccard(p1Cuisine, p2Cuisine) AS similarity

Table 5.279. Results
from to similarity

"Karin"

"Arya"

0.6666666666666666

The following will return the Jaccard Similarity of Karin and the other people that have a cuisine in common: 

MATCH (p1:Person {name: 'Karin'})-[:LIKES]->(cuisine1)
WITH p1, collect(id(cuisine1)) AS p1Cuisine
MATCH (p2:Person)-[:LIKES]->(cuisine2) WHERE p1 <> p2
WITH p1, p1Cuisine, p2, collect(id(cuisine2)) AS p2Cuisine
RETURN p1.name AS from,
       p2.name AS to,
       gds.alpha.similarity.jaccard(p1Cuisine, p2Cuisine) AS similarity
ORDER BY to, similarity DESC

Table 5.280. Results
from to similarity

"Karin"

"Arya"

0.6666666666666666

"Karin"

"Michael"

0.25

"Karin"

"Praveena"

0.0

"Karin"

"Zhen"

0.0