5.4.6. Overlap Similarity

This section describes the Overlap Similarity algorithm in the Neo4j Graph Data Science library.

Overlap similarity measures overlap between two sets. It is defined as the size of the intersection of two sets, divided by the size of the smaller of the two sets.

This algorithm is in the alpha tier. For more information on algorithm tiers, see Chapter 5, Algorithms.

This section includes:

5.4.6.1. History and explanation

Overlap similarity is computed using the following formula:

overlap

The library contains both procedures and functions to calculate similarity between sets of data. The function is best used when calculating the similarity between small numbers of sets. The procedures parallelize the computation, and are therefore more appropriate for computing similarities on bigger datasets.

5.4.6.2. Use-cases - when to use the Overlap Similarity algorithm

We can use the Overlap Similarity algorithm to work out which things are subsets of others. We might then use these computed subsets to learn a taxonomy from tagged data, as described by Jesús Barrasa.

5.4.6.3. Overlap Similarity algorithm function sample

The following will return the Overlap similarity of two lists of numbers: 

RETURN gds.alpha.similarity.overlap([1,2,3], [1,2,4,5]) AS similarity

Table 5.329. Results
similarity

0.6666666666666666

These two lists of numbers have an overlap similarity of 0.66. We can see how this result is derived by breaking down the formula:

O(A,B) = (∣A ∩ B∣) / (min(∣A|,|B|))
O(A,B) = 2 / min(3,4)
       = 2 / 3
       = 0.66

5.4.6.4. Overlap Similarity algorithm procedures sample

The following will create a sample graph: 

 CREATE
 (fahrenheit451:Book {title:'Fahrenheit 451'}),
 (dune:Book {title:'Dune'}),
 (hungerGames:Book {title:'The Hunger Games'}),
 (nineteen84:Book {title:'1984'}),
 (gatsby:Book {title:'The Great Gatsby'}),

 (scienceFiction:Genre {name: "Science Fiction"}),
 (fantasy:Genre {name: "Fantasy"}),
 (dystopia:Genre {name: "Dystopia"}),
 (classics:Genre {name: "Classics"}),

 (fahrenheit451)-[:HAS_GENRE]->(dystopia),
 (fahrenheit451)-[:HAS_GENRE]->(scienceFiction),
 (fahrenheit451)-[:HAS_GENRE]->(fantasy),
 (fahrenheit451)-[:HAS_GENRE]->(classics),

 (hungerGames)-[:HAS_GENRE]->(scienceFiction),
 (hungerGames)-[:HAS_GENRE]->(fantasy),

 (nineteen84)-[:HAS_GENRE]->(scienceFiction),
 (nineteen84)-[:HAS_GENRE]->(dystopia),
 (nineteen84)-[:HAS_GENRE]->(classics),

 (dune)-[:HAS_GENRE]->(scienceFiction),
 (dune)-[:HAS_GENRE]->(fantasy),
 (dune)-[:HAS_GENRE]->(classics),

  (gatsby)-[:HAS_GENRE]->(classics)

Stream

The following will return a stream of node pairs, along with their intersection and overlap similarities: 

 MATCH (book:Book)-[:HAS_GENRE]->(genre)
 WITH {item:id(genre), categories: collect(id(book))} AS userData
 WITH collect(userData) AS data
 CALL gds.alpha.similarity.overlap.stream({data: data})
 YIELD item1, item2, count1, count2, intersection, similarity
 RETURN gds.util.asNode(item1).name AS from, gds.util.asNode(item2).name AS to,
        count1, count2, intersection, similarity
 ORDER BY similarity DESC

Table 5.330. Results
from to count1 count2 intersection similarity

Fantasy

Science Fiction

3

4

3

1.0

Dystopia

Science Fiction

2

4

2

1.0

Dystopia

Classics

2

4

2

1.0

Science Fiction

Classics

4

4

3

0.75

Fantasy

Classics

3

4

2

0.66

Dystopia

Fantasy

2

3

1

0.5

Fantasy and Dystopia are both clear subgenres of Science Fiction - 100% of the books that list those as genres also list Science Fiction as a genre. Dystopia is also a subgenre of Classics. The others are less obvious; Dystopia probably isn’t a subgenre of Fantasy, but the other two pairs could be subgenres.

The following will return a stream of node pairs that have a similarity of at least 0.75, along with their intersection and overlap similarities: 

 MATCH (book:Book)-[:HAS_GENRE]->(genre)
 WITH {item:id(genre), categories: collect(id(book))} AS userData
 WITH collect(userData) AS data
 CALL gds.alpha.similarity.overlap.stream({
   data: data,
   similarityCutoff: 0.75
 })
 YIELD item1, item2, count1, count2, intersection, similarity
 RETURN gds.util.asNode(item1).name AS from, gds.util.asNode(item2).name AS to,
        count1, count2, intersection, similarity
 ORDER BY similarity DESC

Table 5.331. Results
from to count1 count2 intersection similarity

Fantasy

Science Fiction

3

4

3

1.0

Dystopia

Classics

2

4

2

1.0

Dystopia

Science Fiction

2

4

2

1.0

Science Fiction

Classics

4

4

3

0.75

We can see that those genres with lower similarity have been filtered out. If we’re implementing a k-Nearest Neighbors type query we might instead want to find the most similar k super genres for a given genre. We can do that by passing in the topK parameter.

The following will return a stream of genres, along with the two most similar super genres to them (i.e. k=2): 

 MATCH (book:Book)-[:HAS_GENRE]->(genre)
 WITH {item:id(genre), categories: collect(id(book))} AS userData
 WITH collect(userData) AS data
 CALL gds.alpha.similarity.overlap.stream({
  data: data,
  topK: 2
 })
 YIELD item1, item2, count1, count2, intersection, similarity
 RETURN gds.util.asNode(item1).name AS from, gds.util.asNode(item2).name AS to,
        count1, count2, intersection, similarity
 ORDER BY from

Table 5.332. Results
from to count1 count2 intersection similarity

Dystopia

Classics

2

4

2

1.0

Dystopia

Science Fiction

2

4

2

1.0

Fantasy

Science Fiction

3

4

3

1.0

Fantasy

Classics

3

4

2

0.6666666666666666

Science Fiction

Classics

4

4

3

0.75

Write

The following will find the most similar genre for each genre, and store a relationship between those genres: 

 MATCH (book:Book)-[:HAS_GENRE]->(genre)
 WITH {item:id(genre), categories: collect(id(book))} AS userData
 WITH collect(userData) AS data
 CALL gds.alpha.similarity.overlap.write({
  data: data,
  topK: 2,
  similarityCutoff: 0.5
 })
 YIELD nodes, similarityPairs, writeRelationshipType, writeProperty, min, max, mean, stdDev, p25, p50, p75, p90, p95, p99, p999, p100
 RETURN nodes, similarityPairs, writeRelationshipType, writeProperty, min, max, mean, p95

Table 5.333. Results
nodes similarityPairs writeRelationshipType writeProperty min max mean p95

4

5

NARROWER_THAN

score

0.6666641235351562

1.0000038146972656

0.8833351135253906

1.0000038146972656

We then could write a query to find out the genre hierarchy for a specific genre.

The following will find the genre hierarchy for the Fantasy genre. 

 MATCH path = (fantasy:Genre {name: "Fantasy"})-[:NARROWER_THAN*]->(genre)
 RETURN [node in nodes(path) | node.name] AS hierarchy
 ORDER BY length(path)

Table 5.334. Results
hierarchy

["Fantasy", "Science Fiction"]

["Fantasy", "Classics"]

["Fantasy", "Science Fiction", "Classics"]

Stats

The following will run the algorithm and returns the result in form of statistical and measurement values. 

 MATCH (book:Book)-[:HAS_GENRE]->(genre)
 WITH {item:id(genre), categories: collect(id(book))} AS userData
 WITH collect(userData) AS data
 CALL gds.alpha.similarity.overlap.stats({
  data: data,
  topK: 2,
  similarityCutoff: 0.5
 })
 YIELD nodes, similarityPairs, writeRelationshipType, writeProperty, min, max, mean, p95
 RETURN nodes, similarityPairs, writeRelationshipType, writeProperty, min, max, mean, p95

5.4.6.5. Specifying source and target ids

Sometimes, we don’t want to compute all pairs similarity, but would rather specify subsets of items to compare to each other. We do this using the sourceIds and targetIds keys in the config.

We could use this technique to compute the similarity of a subset of items to all other items.

The following will return the super genres for the Fantasy and Classics genres: 

 MATCH (book:Book)-[:HAS_GENRE]->(genre)
 WITH {item:id(genre), name: genre.name, categories: collect(id(book))} AS userData
 WITH collect(userData) AS data
 WITH data,
      [value in data WHERE value.name IN ["Fantasy", "Classics"] | value.item ] AS sourceIds
 CALL gds.alpha.similarity.overlap.stream({
  data: data,
  sourceIds: sourceIds
 })
 YIELD item1, item2, count1, count2, intersection, similarity
 RETURN gds.util.asNode(item1).name AS from, gds.util.asNode(item2).name AS to, similarity
 ORDER BY similarity DESC

Table 5.335. Results
from to similarity

Fantasy

Science Fiction

1.0

Classics

Science Fiction

0.75

Fantasy

Classics

0.6666666666666666

5.4.6.6. Syntax

The following will run the algorithm and write back results: 

CALL gds.alpha.similarity.overlap.write(configuration: Map)
YIELD nodes, similarityPairs, writeRelationshipType, writeProperty, min, max, mean, stdDev, p25, p50, p75, p90, p95, p99, p999, p100

Table 5.336. Parameters
Name Type Default Optional Description

configuration

Map

n/a

no

Algorithm-specific configuration.

Table 5.337. Configuration
Name Type Default Optional Description

data

List or String

null

no

A list of maps of the following structure: {item: nodeId, weights: [double, double, double]} or a Cypher query.

top

Integer

0

yes

The number of similar pairs to return. If 0, it will return as many as it finds.

topK

Integer

3

yes

The number of similar values to return per node. If 0, it will return as many as it finds.

similarityCutoff

Integer

-1

yes

The threshold for similarity. Values below this will not be returned.

degreeCutoff

Integer

0

yes

The threshold for the number of items in the targets list. If the list contains less than this amount, that node will be excluded from the calculation.

skipValue

Float

gds.util.NaN()

yes

Value to skip when executing similarity computation. A value of null means that skipping is disabled.

concurrency

Integer

4

yes

The number of concurrent threads used for running the algorithm. Also provides the default value for 'writeConcurrency'.

writeConcurrency

Integer

value of 'concurrency'

yes

The number of concurrent threads used for writing the result.

graph

String

dense

yes

The graph name ('dense' or 'cypher').

writeBatchSize

Integer

10000

yes

The batch size to use when storing results.

writeRelationshipType

String

SIMILAR

yes

The relationship type to use when storing results.

writeProperty

String

score

yes

The property to use when storing results.

sourceIds

String[]

null

yes

The ids of items from which we need to compute similarities. Defaults to all the items provided in the data parameter.

targetIds

String[]

null

yes

The ids of items to which we need to compute similarities. Defaults to all the items provided in the data parameter.

Table 5.338. Results
Name Type Description

nodes

Integer

The number of nodes passed in.

similarityPairs

Integer

The number of pairs of similar nodes computed.

writeRelationshipType

String

The relationship type used when storing results.

writeProperty

String

The property used when storing results.

min

Float

The minimum similarity score computed.

max

Float

The maximum similarity score computed.

mean

Float

The mean of similarities scores computed.

stdDev

Float

The standard deviation of similarities scores computed.

p25

Float

The 25 percentile of similarities scores computed.

p50

Float

The 50 percentile of similarities scores computed.

p75

Float

The 75 percentile of similarities scores computed.

p90

Float

The 90 percentile of similarities scores computed.

p95

Float

The 95 percentile of similarities scores computed.

p99

Float

The 99 percentile of similarities scores computed.

p999

Float

The 99.9 percentile of similarities scores computed.

p100

Float

The 100 percentile of similarities scores computed.

The following will run the algorithm and stream results: 

CALL gds.alpha.similarity.overlap.stream(configuration: Map)
YIELD item1, item2, count1, count2, similarity

Table 5.339. Parameters
Name Type Default Optional Description

configuration

Map

n/a

no

Algorithm-specific configuration.

Table 5.340. Configuration
Name Type Default Optional Description

data

List or String

null

no

A list of maps of the following structure: {item: nodeId, weights: [double, double, double]} or a Cypher query.

top

Integer

0

yes

The number of similar pairs to return. If 0, it will return as many as it finds.

topK

Integer

3

yes

The number of similar values to return per node. If 0, it will return as many as it finds.

similarityCutoff

Integer

-1

yes

The threshold for similarity. Values below this will not be returned.

degreeCutoff

Integer

0

yes

The threshold for the number of items in the targets list. If the list contains less than this amount, that node will be excluded from the calculation.

skipValue

Float

gds.util.NaN()

yes

Value to skip when executing similarity computation. A value of null means that skipping is disabled.

concurrency

Integer

4

yes

The number of concurrent threads used for running the algorithm.

graph

String

dense

yes

The graph name ('dense' or 'cypher').

sourceIds

Integer[]

null

yes

The ids of items from which we need to compute similarities. Defaults to all the items provided in the data parameter.

targetIds

Integer[]

null

yes

The ids of items to which we need to compute similarities. Defaults to all the items provided in the data parameter.

Table 5.341. Results
Name Type Description

item1

Integer

The ID of one node in the similarity pair.

item2

Integer

The ID of other node in the similarity pair.

count1

Integer

The size of the targets list of one node.

count2

Integer

The size of the targets list of other node.

intersection

Integer

The number of intersecting values in the two nodes targets lists.

similarity

Integer

The overlap similarity of the two nodes.