7.1. The Node Similarity algorithm

This section describes the Node Similarity algorithm in the Neo4j Graph Algorithms library.

This topic includes:

7.1.1. Introduction

The Node Similarity algorithm compares a set of nodes based on the nodes they are connected to. Two nodes are considered similar if they share many of the same neighbors. Node Similarity computes pair-wise similarities based on the Jaccard metric.

The input of this algorithm is a bipartite, connected graph containing two disjoint node sets. Each relationship starts from a node in the first node set and ends at a node in the second node set. The Node Similarity algorithm compares all nodes from the first node set with each other based on their relationships to nodes in the second set. The complexity of this comparison grows quadratically with the number of nodes to compare. The algorithm reduces the complexity by ignoring disconnected nodes.

In addition to computational complexity, the memory requirement for producing results also scales roughly quadratically. In order to bound memory usage, the algorithm requires an explicit limit on the number of results to compute per node. This is the 'topK' parameter. It can be set to any value, except 0.

The output of the algorithm are new relationships between pairs of the first node set. Similarity scores are expressed via relationship properties.

For more information on this algorithm, see:

Running this algorithm requires sufficient available memory. Before running this algorithm, we recommend that you read Section 2.4, “Memory Requirements”.

7.1.2. Syntax

The following describes the API for running the algorithm and writing results back to Neo4j: 

CALL algo.nodeSimilarity(nodeFilter: STRING, relationshipFilter: STRING, {
  write: BOOLEAN,
  writeProperty: STRING
  // additional configuration
})
YIELD nodesCompared, relationships, write, writeRelationshipType, writeProperty

Table 7.1. Parameters
Name Type Default Optional Description

nodeFilter

string

null

yes

The node label to load from the graph. If null, load all nodes.

relationshipFilter

string

null

yes

The relationship types to load from the graph. If null, load all relationships.

config

map

{}

yes

Additional configuration, see below.

Table 7.2. Configuration
Name Type Default Optional Description

graph

string

'huge'

yes

Use 'huge' when describing the subset of the graph with node label and relationship type parameters. Use 'cypher' for describing the subset using Cypher queries for nodes and relationships.

concurrency

int

available CPUs

yes

The number of concurrent threads used for running the algorithm. Also provides the default value for 'readConcurrency' and 'writeConcurrency'. This is dependent on the Neo4j edition; for more information, see Section 1.4.2, “CPU”.

readConcurrency

int

value of 'concurrency'

yes

The number of concurrent threads used for reading the graph.

writeConcurrency

int

value of 'concurrency'

yes

The number of concurrent threads used for writing the result.

write

boolean

true

yes

Specifies if the result should be written back as a node property.

writeRelationshipType

string

SIMILAR

yes

The relationship type used to represent a similarity score.

writeProperty

string

'score'

yes

The relationship property that stores the similarity score.

similarityCutoff

float

1E-42

yes

Lower limit for the similarity score to be present in the result.

degreeCutoff

int

1

yes

Lower limit on the node degree for a node to be considered in the comparisons. This value can not be lower than 1.

topK

int

10

yes

Limit on the number of scores per node. The K largest results are returned.

bottomK

int

10

yes

Limit on the number of scores per node. The K smallest results are returned.

topN

int

0

yes

Global limit on the number of scores computed. The N largest total results are returned.

bottomN

int

0

yes

Global limit on the number of scores computed. The N smallest total results are returned.

Table 7.3. Results
Name Type Description

nodesCompared

int

The number of nodes compared.

relationships

int

The number of relationships created.

write

boolean

Specifies if the result was written back as a new relationship.

writeRelationshipType

string

The relationship type used to represent a similarity score.

writeProperty

string

The relationship property that stores the similarity score.

loadMillis

int

Milliseconds for loading data.

computeMillis

int

Milliseconds for running the algorithm.

writeMillis

int

Milliseconds for writing result data back to Neo4j.

postProcessingMillis

int

Milliseconds for computing percentiles.

min

double

The minimum similarity score computed.

max

double

The maximum similarity score computed.

mean

double

The mean of similarities scores computed.

stdDev

double

The standard deviation of similarities scores computed.

p1

double

The 1 percentile of similarity scores computed.

p5

double

The 5 percentile of similarity scores computed.

p10

double

The 10 percentile of similarity scores computed.

p25

double

The 25 percentile of similarity scores computed.

p50

double

The 50 percentile of similarity scores computed.

p75

double

The 75 percentile of similarity scores computed.

p90

double

The 90 percentile of similarity scores computed.

p95

double

The 95 percentile of similarity scores computed.

p99

double

The 99 percentile of similarity scores computed.

p100

double

The 100 percentile of similarity scores computed.

The following describes the API for running the algorithm and streaming results: 

CALL algo.nodeSimilarity.stream(nodeFilter: STRING, relationshipFilter: STRING, {
  // configuration
})
YIELD node1, node2, similarity

Table 7.4. Parameters
Name Type Default Optional Description

nodeFilter

string

null

yes

The node label to load from the graph. If null, load all nodes.

relationshipFilter

string

null

yes

The relationship types to load from the graph. If null, load all relationships.

config

map

{}

yes

Additional configuration, see below.

Table 7.5. Configuration
Name Type Default Optional Description

graph

string

'huge'

yes

Use 'huge' when describing the subset of the graph with node label and relationship type parameters. Use 'cypher' for describing the subset using Cypher queries for nodes and relationships.

concurrency

int

available CPUs

yes

The number of concurrent threads used for running the algorithm. Also provides the default value for 'readConcurrency' and 'writeConcurrency'. This is dependent on the Neo4j edition; for more information, see Section 1.4.2, “CPU”.

readConcurrency

int

value of 'concurrency'

yes

The number of concurrent threads used for reading the graph.

similarityCutoff

float

1E-42

yes

Lower limit for the similarity score to be present in the result.

degreeCutoff

int

1

yes

Lower limit on the node degree for a node to be considered in the comparisons. This value can not be lower than 1.

topK

int

10

yes

Limit on the number of scores per node. The K largest results are returned.

bottomK

int

10

yes

Limit on the number of scores per node. The K smallest results are returned.

topN

int

0

yes

Global limit on the number of scores computed. The N largest total results are returned.

bottomN

int

0

yes

Global limit on the number of scores computed. The N smallest total results are returned.

Table 7.6. Results
Name Type Description

node1

int

The Neo4j ID of the first node.

node2

int

The Neo4j ID of the second node.

similarity

double

The similarity score for the two nodes.

7.1.3. Examples

Consider the graph created by the following Cypher statement:

CREATE (alice:Person {name: 'Alice'})
CREATE (bob:Person {name: 'Bob'})
CREATE (carol:Person {name: 'Carol'})
CREATE (dave:Person {name: 'Dave'})
CREATE (eve:Person {name: 'Eve'})
CREATE (guitar:Instrument {name: 'Guitar'})
CREATE (synth:Instrument {name: 'Synthesizer'})
CREATE (bongos:Instrument {name: 'Bongos'})
CREATE (trumpet:Instrument {name: 'Trumpet'})

CREATE (alice)-[:LIKES]->(guitar)
CREATE (alice)-[:LIKES]->(synth)
CREATE (alice)-[:LIKES]->(bongos)
CREATE (bob)-[:LIKES]->(guitar)
CREATE (bob)-[:LIKES]->(synth)
CREATE (carol)-[:LIKES]->(bongos)
CREATE (dave)-[:LIKES]->(guitar)
CREATE (dave)-[:LIKES]->(synth)
CREATE (dave)-[:LIKES]->(bongos);

This bipartite graph has two node sets, Person nodes and Instrument nodes. The two node sets are connected via LIKES relationships. Each relationship starts at a Person node and ends at an Instrument node.

In the example, we want to use the Node Similarity algorithm to compare persons based on the instruments they like.

The Node Similarity algorithm will only compute similarity for nodes that have a degree of at least 1. In the example graph, the Eve node will not be compared to other Person nodes.

7.1.3.1. Streaming results

The following will load the graph, run the algorithm, and stream results: 

CALL algo.nodeSimilarity.stream('Person | Instrument', 'LIKES', {
  direction: 'OUTGOING'
})
YIELD node1, node2, similarity
RETURN algo.asNode(node1).name AS Person1, algo.asNode(node2).name AS Person2, similarity
ORDER BY similarity DESCENDING, Person1, Person2

Table 7.7. Results
Person1 Person2 similarity

"Alice"

"Dave"

1.0

"Dave"

"Alice"

1.0

"Alice"

"Bob"

0.6666666666666666

"Bob"

"Alice"

0.6666666666666666

"Bob"

"Dave"

0.6666666666666666

"Dave"

"Bob"

0.6666666666666666

"Alice"

"Carol"

0.3333333333333333

"Carol"

"Alice"

0.3333333333333333

"Carol"

"Dave"

0.3333333333333333

"Dave"

"Carol"

0.3333333333333333

10 rows

We use default values for the procedure configuration parameter. TopK is set to 10, topN is set to 0. Because of that the result set contains the top 10 similarity scores for each node.

7.1.3.2. Writing results

To instead write the similarity results back to the graph in Neo4j, use the following query. Each result is written as a new relationship between the compared nodes. The similarity score is written as a property on the relationship.

The following will load the graph, run the algorithm, and write back results: 

CALL algo.nodeSimilarity('Person | Instrument', 'LIKES', {
  direction: 'OUTGOING',
  write: true
})
YIELD nodesCompared, relationships, write, writeProperty, writeRelationshipType;

Table 7.8. Results
nodesCompared relationships write writeProperty writeRelationshipType

4

10

true

"score"

"SIMILAR"

As we can see from the results, the number of created relationships is equal to the number of rows in the streaming example.

7.1.3.3. Limiting results

There are four limits that can be applied to the similarity results. Top limits the result to the highest similarity scores. Bottom limits the result to the lowest similarity scores. Both top and bottom limits can apply to the result as a whole ("N"), or to the result per node ("K").

There must always be a "K" limit, either bottomK or topK, which is a positive number. The default value for topK and bottomK is 10.

Table 7.9. Result limits
  total results results per node

highest score

topN

topK

lowest score

bottomN

bottomK

topK and bottomK

TopK and bottomK are limits on the number of scores computed per node. For topK, the K largest similarity scores per node are returned. For bottomK, the K smallest similarity scores per node are returned. TopK and bottomK cannot be 0, used in conjuntion, and the default value is 10. If neither is specified, topK is used.

The following will load the graph, run the algorithm, and stream the top 1 result per node: 

CALL algo.nodeSimilarity.stream('Person | Instrument', 'LIKES', {
  direction: 'OUTGOING',
  topK: 1
})
YIELD node1, node2, similarity
RETURN algo.asNode(node1).name AS Person1, algo.asNode(node2).name AS Person2, similarity
ORDER BY Person1

Table 7.10. Results
Person1 Person2 similarity

"Alice"

"Dave"

1.0

"Bob"

"Alice"

0.6666666666666666

"Carol"

"Alice"

0.3333333333333333

"Dave"

"Alice"

1.0

4 rows

The following will load the graph, run the algorithm, and stream the bottom 1 result per node: 

CALL algo.nodeSimilarity.stream('Person | Instrument', 'LIKES', {
  direction: 'OUTGOING',
  bottomK: 1
})
YIELD node1, node2, similarity
RETURN algo.asNode(node1).name AS Person1, algo.asNode(node2).name AS Person2, similarity
ORDER BY Person1

Table 7.11. Results
Person1 Person2 similarity

Alice

Carol

0.3333333333333333

Bob

Alice

0.6666666666666666

Carol

Alice

0.3333333333333333

Dave

Carol

0.3333333333333333

4 rows

topN and bottomN

TopN and bottomN limit the number of similarity scores across all nodes. This is a limit on the total result set, in addition to the topK or bottomK limit on the results per node. For topN, the N largest similarity scores are returned. For bottomN, the N smallest similarity scores are returned. A value of 0 means no global limit is imposed and all results from topK or bottomK are returned.

The following will load the graph, run the algorithm, and stream the 3 highest out of the top 1 results per node: 

CALL algo.nodeSimilarity.stream('Person | Instrument', 'LIKES', {
  direction: 'OUTGOING',
  topK: 1,
  topN: 3
})
YIELD node1, node2, similarity
RETURN algo.asNode(node1).name AS Person1, algo.asNode(node2).name AS Person2, similarity
ORDER BY similarity DESC, Person1, Person2

Table 7.12. Results
Person1 Person2 similarity

"Alice"

"Dave"

1.0

"Dave"

"Alice"

1.0

"Bob"

"Alice"

0.6666666666666666

3 rows

7.1.3.4. Degree cutoff and similarity cutoff

Degree cutoff is a lower limit on the node degree for a node to be considered in the comparisons. This value can not be lower than 1.

The following will ignore nodes with less than 3 LIKES relationships: 

CALL algo.nodeSimilarity.stream('Person | Instrument', 'LIKES', {
  direction: 'OUTGOING',
  degreeCutOff: 3
})
YIELD node1, node2, similarity
RETURN algo.asNode(node1).name AS Person1, algo.asNode(node2).name AS Person2, similarity
ORDER BY Person1

Table 7.13. Results
Person1 Person2 similarity

"Alice"

"Dave"

1.0

"Dave"

"Alice"

1.0

2 rows

Similarity cutoff is a lower limit for the similarity score to be present in the result. The default value is very small (1E-42) to exclude results with a similarity score of 0.

Setting similarity cutoff to 0 may yield a very large result set, increased runtime and memory consumption.

The following will ignore node pairs with a similarty score less than 0.5: 

CALL algo.nodeSimilarity.stream('Person | Instrument', 'LIKES', {
  direction: 'OUTGOING',
  similarityCutoff: 0.5
})
YIELD node1, node2, similarity
RETURN algo.asNode(node1).name AS Person1, algo.asNode(node2).name AS Person2, similarity
ORDER BY Person1

Table 7.14. Results
Person1 Person2 similarity

"Alice"

"Dave"

1.0

"Alice"

"Bob"

0.6666666666666666

"Bob"

"Dave"

0.6666666666666666

"Bob"

"Alice"

0.6666666666666666

"Dave"

"Alice"

1.0

"Dave"

"Bob"

0.6666666666666666

6 rows