# Overlap Similarity

This section describes the Overlap Similarity algorithm in the Neo4j Graph Data Science library.

Overlap similarity measures overlap between two sets. It is defined as the size of the intersection of two sets, divided by the size of the smaller of the two sets.

This algorithm is in the alpha tier. For more information on algorithm tiers, see Algorithms.

## 1. History and explanation

Overlap similarity is computed using the following formula:

The library contains both procedures and functions to calculate similarity between sets of data. The function is best used when calculating the similarity between small numbers of sets. The procedures parallelize the computation, and are therefore more appropriate for computing similarities on bigger datasets.

## 2. Use-cases - when to use the Overlap Similarity algorithm

We can use the Overlap Similarity algorithm to work out which things are subsets of others. We might then use these computed subsets to learn a taxonomy from tagged data, as described by Jesús Barrasa.

## 3. Overlap Similarity algorithm function sample

`RETURN gds.alpha.similarity.overlap([1,2,3], [1,2,4,5]) AS similarity`

`similarity` |
---|

0.6666666666666666 |

These two lists of numbers have an overlap similarity of 0.66. We can see how this result is derived by breaking down the formula:

O(A,B) = (∣A ∩ B∣) / (min(∣A|,|B|)) O(A,B) = 2 / min(3,4) = 2 / 3 = 0.66

## 4. Overlap Similarity algorithm procedures sample

```
CREATE
(fahrenheit451:Book {title:'Fahrenheit 451'}),
(dune:Book {title:'Dune'}),
(hungerGames:Book {title:'The Hunger Games'}),
(nineteen84:Book {title:'1984'}),
(gatsby:Book {title:'The Great Gatsby'}),
(scienceFiction:Genre {name: "Science Fiction"}),
(fantasy:Genre {name: "Fantasy"}),
(dystopia:Genre {name: "Dystopia"}),
(classics:Genre {name: "Classics"}),
(fahrenheit451)-[:HAS_GENRE]->(dystopia),
(fahrenheit451)-[:HAS_GENRE]->(scienceFiction),
(fahrenheit451)-[:HAS_GENRE]->(fantasy),
(fahrenheit451)-[:HAS_GENRE]->(classics),
(hungerGames)-[:HAS_GENRE]->(scienceFiction),
(hungerGames)-[:HAS_GENRE]->(fantasy),
(nineteen84)-[:HAS_GENRE]->(scienceFiction),
(nineteen84)-[:HAS_GENRE]->(dystopia),
(nineteen84)-[:HAS_GENRE]->(classics),
(dune)-[:HAS_GENRE]->(scienceFiction),
(dune)-[:HAS_GENRE]->(fantasy),
(dune)-[:HAS_GENRE]->(classics),
(gatsby)-[:HAS_GENRE]->(classics)
```

### 4.1. Stream

```
MATCH (book:Book)-[:HAS_GENRE]->(genre)
WITH {item:id(genre), categories: collect(id(book))} AS userData
WITH collect(userData) AS data
CALL gds.alpha.similarity.overlap.stream({data: data})
YIELD item1, item2, count1, count2, intersection, similarity
RETURN gds.util.asNode(item1).name AS from, gds.util.asNode(item2).name AS to,
count1, count2, intersection, similarity
ORDER BY similarity DESC
```

`from` |
`to` |
`count1` |
`count2` |
`intersection` |
`similarity` |
---|---|---|---|---|---|

Fantasy |
Science Fiction |
3 |
4 |
3 |
1.0 |

Dystopia |
Science Fiction |
2 |
4 |
2 |
1.0 |

Dystopia |
Classics |
2 |
4 |
2 |
1.0 |

Science Fiction |
Classics |
4 |
4 |
3 |
0.75 |

Fantasy |
Classics |
3 |
4 |
2 |
0.66 |

Dystopia |
Fantasy |
2 |
3 |
1 |
0.5 |

Fantasy and Dystopia are both clear subgenres of Science Fiction - 100% of the books that list those as genres also list Science Fiction as a genre. Dystopia is also a subgenre of Classics. The others are less obvious; Dystopia probably isn’t a subgenre of Fantasy, but the other two pairs could be subgenres.

```
MATCH (book:Book)-[:HAS_GENRE]->(genre)
WITH {item:id(genre), categories: collect(id(book))} AS userData
WITH collect(userData) AS data
CALL gds.alpha.similarity.overlap.stream({
data: data,
similarityCutoff: 0.75
})
YIELD item1, item2, count1, count2, intersection, similarity
RETURN gds.util.asNode(item1).name AS from, gds.util.asNode(item2).name AS to,
count1, count2, intersection, similarity
ORDER BY similarity DESC
```

`from` |
`to` |
`count1` |
`count2` |
`intersection` |
`similarity` |
---|---|---|---|---|---|

Fantasy |
Science Fiction |
3 |
4 |
3 |
1.0 |

Dystopia |
Classics |
2 |
4 |
2 |
1.0 |

Dystopia |
Science Fiction |
2 |
4 |
2 |
1.0 |

Science Fiction |
Classics |
4 |
4 |
3 |
0.75 |

We can see that those genres with lower similarity have been filtered out.
If we’re implementing a k-Nearest Neighbors type query we might instead want to find the most similar `k`

super genres for a given genre.
We can do that by passing in the `topK`

parameter.

`k=2`

):```
MATCH (book:Book)-[:HAS_GENRE]->(genre)
WITH {item:id(genre), categories: collect(id(book))} AS userData
WITH collect(userData) AS data
CALL gds.alpha.similarity.overlap.stream({
data: data,
topK: 2
})
YIELD item1, item2, count1, count2, intersection, similarity
RETURN gds.util.asNode(item1).name AS from, gds.util.asNode(item2).name AS to,
count1, count2, intersection, similarity
ORDER BY from
```

`from` |
`to` |
`count1` |
`count2` |
`intersection` |
`similarity` |
---|---|---|---|---|---|

Dystopia |
Classics |
2 |
4 |
2 |
1.0 |

Dystopia |
Science Fiction |
2 |
4 |
2 |
1.0 |

Fantasy |
Science Fiction |
3 |
4 |
3 |
1.0 |

Fantasy |
Classics |
3 |
4 |
2 |
0.6666666666666666 |

Science Fiction |
Classics |
4 |
4 |
3 |
0.75 |

### 4.2. Write

```
MATCH (book:Book)-[:HAS_GENRE]->(genre)
WITH {item:id(genre), categories: collect(id(book))} AS userData
WITH collect(userData) AS data
CALL gds.alpha.similarity.overlap.write({
data: data,
topK: 2,
similarityCutoff: 0.5
})
YIELD nodes, similarityPairs, writeRelationshipType, writeProperty, min, max, mean, stdDev, p25, p50, p75, p90, p95, p99, p999, p100
RETURN nodes, similarityPairs, writeRelationshipType, writeProperty, min, max, mean, p95
```

`nodes` |
`similarityPairs` |
`writeRelationshipType` |
`writeProperty` |
`min` |
`max` |
`mean` |
`p95` |
---|---|---|---|---|---|---|---|

4 |
5 |
NARROWER_THAN |
score |
0.6666641235351562 |
1.0000038146972656 |
0.8833351135253906 |
1.0000038146972656 |

We then could write a query to find out the genre hierarchy for a specific genre.

```
MATCH path = (fantasy:Genre {name: "Fantasy"})-[:NARROWER_THAN*]->(genre)
RETURN [node in nodes(path) | node.name] AS hierarchy
ORDER BY length(path)
```

`hierarchy` |
---|

["Fantasy", "Science Fiction"] |

["Fantasy", "Classics"] |

["Fantasy", "Science Fiction", "Classics"] |

### 4.3. Stats

```
MATCH (book:Book)-[:HAS_GENRE]->(genre)
WITH {item:id(genre), categories: collect(id(book))} AS userData
WITH collect(userData) AS data
CALL gds.alpha.similarity.overlap.stats({
data: data,
topK: 2,
similarityCutoff: 0.5
})
YIELD nodes, similarityPairs, writeRelationshipType, writeProperty, min, max, mean, p95
RETURN nodes, similarityPairs, writeRelationshipType, writeProperty, min, max, mean, p95
```

## 5. Specifying source and target ids

Sometimes, we don’t want to compute all pairs similarity, but would rather specify subsets of items to compare to each other.
We do this using the `sourceIds`

and `targetIds`

keys in the config.

We could use this technique to compute the similarity of a subset of items to all other items.

`Fantasy`

and `Classics`

genres:```
MATCH (book:Book)-[:HAS_GENRE]->(genre)
WITH {item:id(genre), name: genre.name, categories: collect(id(book))} AS userData
WITH collect(userData) AS data
WITH data,
[value in data WHERE value.name IN ["Fantasy", "Classics"] | value.item ] AS sourceIds
CALL gds.alpha.similarity.overlap.stream({
data: data,
sourceIds: sourceIds
})
YIELD item1, item2, count1, count2, intersection, similarity
RETURN gds.util.asNode(item1).name AS from, gds.util.asNode(item2).name AS to, similarity
ORDER BY similarity DESC
```

`from` |
`to` |
`similarity` |
---|---|---|

Fantasy |
Science Fiction |
1.0 |

Classics |
Science Fiction |
0.75 |

Fantasy |
Classics |
0.6666666666666666 |

## 6. Syntax

```
CALL gds.alpha.similarity.overlap.write(configuration: Map)
YIELD nodes, similarityPairs, writeRelationshipType, writeProperty, min, max, mean, stdDev, p25, p50, p75, p90, p95, p99, p999, p100
```

Name | Type | Default | Optional | Description |
---|---|---|---|---|

configuration |
Map |
n/a |
no |
Algorithm-specific configuration. |

Name | Type | Default | Optional | Description |
---|---|---|---|---|

data |
List or String |
null |
no |
A list of maps of the following structure: |

top |
Integer |
0 |
yes |
The number of similar pairs to return. If |

topK |
Integer |
3 |
yes |
The number of similar values to return per node. If |

similarityCutoff |
Integer |
-1 |
yes |
The threshold for similarity. Values below this will not be returned. |

degreeCutoff |
Integer |
0 |
yes |
The threshold for the number of items in the |

skipValue |
Float |
gds.util.NaN() |
yes |
Value to skip when executing similarity computation. A value of |

concurrency |
Integer |
4 |
yes |
The number of concurrent threads used for running the algorithm. Also provides the default value for 'writeConcurrency'. |

writeConcurrency |
Integer |
value of 'concurrency' |
yes |
The number of concurrent threads used for writing the result. |

graph |
String |
dense |
yes |
The graph name ('dense' or 'cypher'). |

writeBatchSize |
Integer |
10000 |
yes |
The batch size to use when storing results. |

writeRelationshipType |
String |
SIMILAR |
yes |
The relationship type to use when storing results. |

writeProperty |
String |
score |
yes |
The property to use when storing results. |

sourceIds |
List of String |
null |
yes |
The ids of items from which we need to compute similarities. Defaults to all the items provided in the |

targetIds |
List of String |
null |
yes |
The ids of items to which we need to compute similarities. Defaults to all the items provided in the |

Name | Type | Description |
---|---|---|

nodes |
Integer |
The number of nodes passed in. |

similarityPairs |
Integer |
The number of pairs of similar nodes computed. |

writeRelationshipType |
String |
The relationship type used when storing results. |

writeProperty |
String |
The property used when storing results. |

min |
Float |
The minimum similarity score computed. |

max |
Float |
The maximum similarity score computed. |

mean |
Float |
The mean of similarities scores computed. |

stdDev |
Float |
The standard deviation of similarities scores computed. |

p25 |
Float |
The 25 percentile of similarities scores computed. |

p50 |
Float |
The 50 percentile of similarities scores computed. |

p75 |
Float |
The 75 percentile of similarities scores computed. |

p90 |
Float |
The 90 percentile of similarities scores computed. |

p95 |
Float |
The 95 percentile of similarities scores computed. |

p99 |
Float |
The 99 percentile of similarities scores computed. |

p999 |
Float |
The 99.9 percentile of similarities scores computed. |

p100 |
Float |
The 100 percentile of similarities scores computed. |

```
CALL gds.alpha.similarity.overlap.stream(configuration: Map)
YIELD item1, item2, count1, count2, similarity
```

Name | Type | Default | Optional | Description |
---|---|---|---|---|

configuration |
Map |
n/a |
no |
Algorithm-specific configuration. |

Name | Type | Default | Optional | Description |
---|---|---|---|---|

data |
List or String |
null |
no |
A list of maps of the following structure: |

top |
Integer |
0 |
yes |
The number of similar pairs to return. If |

topK |
Integer |
3 |
yes |
The number of similar values to return per node. If |

similarityCutoff |
Integer |
-1 |
yes |
The threshold for similarity. Values below this will not be returned. |

degreeCutoff |
Integer |
0 |
yes |
The threshold for the number of items in the |

skipValue |
Float |
gds.util.NaN() |
yes |
Value to skip when executing similarity computation. A value of |

concurrency |
Integer |
4 |
yes |
The number of concurrent threads used for running the algorithm. |

graph |
String |
dense |
yes |
The graph name ('dense' or 'cypher'). |

sourceIds |
List of Integer |
null |
yes |
The ids of items from which we need to compute similarities. Defaults to all the items provided in the |

targetIds |
List of Integer |
null |
yes |
The ids of items to which we need to compute similarities. Defaults to all the items provided in the |

Name | Type | Description |
---|---|---|

item1 |
Integer |
The ID of one node in the similarity pair. |

item2 |
Integer |
The ID of other node in the similarity pair. |

count1 |
Integer |
The size of the |

count2 |
Integer |
The size of the |

intersection |
Integer |
The number of intersecting values in the two nodes |

similarity |
Integer |
The overlap similarity of the two nodes. |

**Was this page helpful?**