Similarity functions

Definitions

The Neo4j GDS library provides a set of measures that can be used to calculate similarity between two arrays p_s, p_t of numbers.

The similarity functions can be classified into two groups. The first is categorical measures which treat the arrays as sets and calculate similarity based on the intersection between the two sets. The second is numerical measures which compute similarity based on how close the numbers at each position are to each other.

Similarity Function name	Type	Value range
`gds.similarity.jaccard`	Categorical	`[0,1]`
`gds.similarity.overlap`	Categorical	`[0, 1]`
`gds.similarity.cosine`	Numerical	`[-1, 1]`
`gds.similarity.pearson`	Numerical	`[-1, 1]`
`gds.similarity.euclideanDistance`	Numerical	`[0, ∞)`
`gds.similarity.euclidean`	Numerical	`(0, 1]`

Examples

An example of usage for each function is provided below:

Jaccard similarity function

RETURN gds.similarity.jaccard(
  [1.0, 5.0, 3.0, 6.7],
  [5.0, 2.5, 3.1, 9.0]
) AS jaccardSimilarity

Table 1. Results
jaccardSimilarity
0.142857142857143

Overlap similarity function

RETURN gds.similarity.overlap(
  [1.0, 5.0, 3.0, 6.7],
  [5.0, 2.5, 3.1, 9.0]
) AS overlapSimilarity

Table 2. Results
overlapSimilarity
0.25

Cosine similarity function

RETURN gds.similarity.cosine(
  [1.0, 5.0, 3.0, 6.7],
  [5.0, 2.5, 3.1, 9.0]
) AS cosineSimilarity

Table 3. Results
cosineSimilarity
0.882757381034594

Pearson similarity function

RETURN gds.similarity.pearson(
  [1.0, 5.0, 3.0, 6.7],
  [5.0, 2.5, 3.1, 9.0]
) AS pearsonSimilarity

Table 4. Results
pearsonSimilarity
0.468277483648113

Euclidean similarity function

RETURN gds.similarity.euclidean(
  [1.0, 5.0, 3.0, 6.7],
  [5.0, 2.5, 3.1, 9.0]
)  AS euclideanSimilarity

Table 5. Results
euclideanSimilarity
0.160030485454022

Euclidean distance function

RETURN gds.similarity.euclideanDistance(
  [1.0, 5.0, 3.0, 6.7],
  [5.0, 2.5, 3.1, 9.0]
) AS euclideanDistance

Table 6. Results
euclideanDistance
5.248809388804284

The functions can also compute results when one or more values in the provided vectors are null. In the case of functions based on intersection such as Jaccard or Overlap, the null values are excluded from the set and the computation. In the rest of the functions the null value is replaced with a 0.0 value. See the examples below.

Jaccard with null values

RETURN gds.similarity.jaccard(
  [1.0, null, 3.0],
  [1.0, 2.0, 3.0]
) AS jaccardSimilarity

Table 7. Results
jaccardSimilarity
0.666666666666667

Cosine with null values

RETURN gds.similarity.cosine(
  [1.0, null, 3.0],
  [1.0, 2.0, 3.0]
) AS cosineSimilarity

Table 8. Results
cosineSimilarity
0.845154254728517