Similarity functions

Definitions

The Neo4j GDS library provides a set of measures that can be used to calculate similarity between two arrays ps, pt of numbers.

The similarity functions can be classified into two groups. The first is categorical measures which treat the arrays as sets and calculate similarity based on the intersection between the two sets. The second is numerical measures which compute similarity based on how close the numbers at each position are to each other.

Similarity Function name Formula Type Value range

gds.similarity.jaccard

jacard

Categorical

[0,1]

gds.similarity.overlap

overlap

Categorical

[0, 1]

gds.similarity.cosine

cosine

Numerical

[-1, 1]

gds.similarity.pearson

pearson

Numerical

[-1, 1]

gds.similarity.euclideanDistance

ed

Numerical

[0, ∞)

gds.similarity.euclidean

euclidean

Numerical

(0, 1]

Examples

An example of usage for each function is provided below:

Jaccard similarity function
RETURN gds.similarity.jaccard(
  [1.0, 5.0, 3.0, 6.7],
  [5.0, 2.5, 3.1, 9.0]
) AS jaccardSimilarity
Table 1. Results
jaccardSimilarity

0.142857142857143

Overlap similarity function
RETURN gds.similarity.overlap(
  [1.0, 5.0, 3.0, 6.7],
  [5.0, 2.5, 3.1, 9.0]
) AS overlapSimilarity
Table 2. Results
overlapSimilarity

0.25

Cosine similarity function
RETURN gds.similarity.cosine(
  [1.0, 5.0, 3.0, 6.7],
  [5.0, 2.5, 3.1, 9.0]
) AS cosineSimilarity
Table 3. Results
cosineSimilarity

0.882757381034594

Pearson similarity function
RETURN gds.similarity.pearson(
  [1.0, 5.0, 3.0, 6.7],
  [5.0, 2.5, 3.1, 9.0]
) AS pearsonSimilarity
Table 4. Results
pearsonSimilarity

0.468277483648113

Euclidean similarity function
RETURN gds.similarity.euclidean(
  [1.0, 5.0, 3.0, 6.7],
  [5.0, 2.5, 3.1, 9.0]
)  AS euclideanSimilarity
Table 5. Results
euclideanSimilarity

0.160030485454022

Euclidean distance function
RETURN gds.similarity.euclideanDistance(
  [1.0, 5.0, 3.0, 6.7],
  [5.0, 2.5, 3.1, 9.0]
) AS euclideanDistance
Table 6. Results
euclideanDistance

5.248809388804284

The functions can also compute results when one or more values in the provided vectors are null. In the case of functions based on intersection such as Jaccard or Overlap, the null values are excluded from the set and the computation. In the rest of the functions the null value is replaced with a 0.0 value. See the examples below.

Jaccard with null values
RETURN gds.similarity.jaccard(
  [1.0, null, 3.0],
  [1.0, 2.0, 3.0]
) AS jaccardSimilarity
Table 7. Results
jaccardSimilarity

0.666666666666667

Cosine with null values
RETURN gds.similarity.cosine(
  [1.0, null, 3.0],
  [1.0, 2.0, 3.0]
) AS cosineSimilarity
Table 8. Results
cosineSimilarity

0.845154254728517