This section describes the Node Similarity algorithm in the Neo4j Graph Algorithms library.
This topic includes:
The Node Similarity algorithm compares a set of nodes based on the nodes they are connected to. Two nodes are considered similar if they share many of the same neighbors. Node Similarity computes pairwise similarities based on the Jaccard metric.
The input of this algorithm is a bipartite, connected graph containing two disjoint node sets. Each relationship starts from a node in the first node set and ends at a node in the second node set. The Node Similarity algorithm compares all nodes from the first node set with each other based on their relationships to nodes in the second set. The complexity of this comparison grows quadratically with the number of nodes to compare. The algorithm reduces the complexity by ignoring disconnected nodes.
In addition to computational complexity, the memory requirement for producing results also scales roughly quadratically. In order to bound memory usage, the algorithm requires an explicit limit on the number of results to compute per node. This is the 'topK' parameter. It can be set to any value, except 0.
The output of the algorithm are new relationships between pairs of the first node set. Similarity scores are expressed via relationship properties.
For more information on this algorithm, see:
Running this algorithm requires sufficient available memory. Before running this algorithm, we recommend that you read Section 2.4, “Memory Requirements”. 
The following describes the API for running the algorithm and writing results back to Neo4j:
CALL algo.nodeSimilarity(nodeFilter: STRING, relationshipFilter: STRING, {
write: BOOLEAN,
writeProperty: STRING
// additional configuration
})
YIELD nodesCompared, relationships, write, writeRelationshipType, writeProperty
Name  Type  Default  Optional  Description 

nodeFilter 
string 

yes 
The node label to load from the graph. If null, load all nodes. 
relationshipFilter 
string 

yes 
The relationship types to load from the graph. If null, load all relationships. 
config 
map 

yes 
Additional configuration, see below. 
Name  Type  Default  Optional  Description 

graph 
string 
'huge' 
yes 
Use 'huge' when describing the subset of the graph with node label and relationship type parameters. Use 'cypher' for describing the subset using Cypher queries for nodes and relationships. 
concurrency 
int 
available CPUs 
yes 
The number of concurrent threads used for running the algorithm. Also provides the default value for 'readConcurrency' and 'writeConcurrency'. This is dependent on the Neo4j edition; for more information, see Section 1.4.2, “CPU”. 
readConcurrency 
int 
value of 'concurrency' 
yes 
The number of concurrent threads used for reading the graph. 
writeConcurrency 
int 
value of 'concurrency' 
yes 
The number of concurrent threads used for writing the result. 
write 
boolean 
true 
yes 
Specifies if the result should be written back as a node property. 
writeRelationshipType 
string 
SIMILAR 
yes 
The relationship type used to represent a similarity score. 
writeProperty 
string 
'score' 
yes 
The relationship property that stores the similarity score. 
similarityCutoff 
float 
1E42 
yes 
Lower limit for the similarity score to be present in the result. 
degreeCutoff 
int 
1 
yes 
Lower limit on the node degree for a node to be considered in the comparisons. This value can not be lower than 1. 
topK 
int 
10 
yes 
Limit on the number of scores per node. The K largest results are returned. 
bottomK 
int 
10 
yes 
Limit on the number of scores per node. The K smallest results are returned. 
topN 
int 
0 
yes 
Global limit on the number of scores computed. The N largest total results are returned. 
bottomN 
int 
0 
yes 
Global limit on the number of scores computed. The N smallest total results are returned. 
Name  Type  Description 

nodesCompared 
int 
The number of nodes compared. 
relationships 
int 
The number of relationships created. 
write 
boolean 
Specifies if the result was written back as a new relationship. 
writeRelationshipType 
string 
The relationship type used to represent a similarity score. 
writeProperty 
string 
The relationship property that stores the similarity score. 
loadMillis 
int 
Milliseconds for loading data. 
computeMillis 
int 
Milliseconds for running the algorithm. 
writeMillis 
int 
Milliseconds for writing result data back to Neo4j. 
postProcessingMillis 
int 
Milliseconds for computing percentiles. 
min 
double 
The minimum similarity score computed. 
max 
double 
The maximum similarity score computed. 
mean 
double 
The mean of similarities scores computed. 
stdDev 
double 
The standard deviation of similarities scores computed. 
p1 
double 
The 1 percentile of similarity scores computed. 
p5 
double 
The 5 percentile of similarity scores computed. 
p10 
double 
The 10 percentile of similarity scores computed. 
p25 
double 
The 25 percentile of similarity scores computed. 
p50 
double 
The 50 percentile of similarity scores computed. 
p75 
double 
The 75 percentile of similarity scores computed. 
p90 
double 
The 90 percentile of similarity scores computed. 
p95 
double 
The 95 percentile of similarity scores computed. 
p99 
double 
The 99 percentile of similarity scores computed. 
p100 
double 
The 100 percentile of similarity scores computed. 
The following describes the API for running the algorithm and streaming results:
CALL algo.nodeSimilarity.stream(nodeFilter: STRING, relationshipFilter: STRING, {
// configuration
})
YIELD node1, node2, similarity
Name  Type  Default  Optional  Description 

nodeFilter 
string 
null 
yes 
The node label to load from the graph. If null, load all nodes. 
relationshipFilter 
string 
null 
yes 
The relationship types to load from the graph. If null, load all relationships. 
config 
map 
{} 
yes 
Additional configuration, see below. 
Name  Type  Default  Optional  Description 


string 
'huge' 
yes 
Use 'huge' when describing the subset of the graph with node label and relationship type parameters. Use 'cypher' for describing the subset using Cypher queries for nodes and relationships. 

int 
available CPUs 
yes 
The number of concurrent threads used for running the algorithm. Also provides the default value for 'readConcurrency' and 'writeConcurrency'. This is dependent on the Neo4j edition; for more information, see Section 1.4.2, “CPU”. 

int 
value of 'concurrency' 
yes 
The number of concurrent threads used for reading the graph. 

float 
1E42 
yes 
Lower limit for the similarity score to be present in the result. 

int 
1 
yes 
Lower limit on the node degree for a node to be considered in the comparisons. This value can not be lower than 1. 

int 
10 
yes 
Limit on the number of scores per node. The K largest results are returned. 

int 
10 
yes 
Limit on the number of scores per node. The K smallest results are returned. 

int 
0 
yes 
Global limit on the number of scores computed. The N largest total results are returned. 

int 
0 
yes 
Global limit on the number of scores computed. The N smallest total results are returned. 
Name  Type  Description 


int 
The Neo4j ID of the first node. 

int 
The Neo4j ID of the second node. 

double 
The similarity score for the two nodes. 
Consider the graph created by the following Cypher statement:
CREATE (alice:Person {name: 'Alice'})
CREATE (bob:Person {name: 'Bob'})
CREATE (carol:Person {name: 'Carol'})
CREATE (dave:Person {name: 'Dave'})
CREATE (eve:Person {name: 'Eve'})
CREATE (guitar:Instrument {name: 'Guitar'})
CREATE (synth:Instrument {name: 'Synthesizer'})
CREATE (bongos:Instrument {name: 'Bongos'})
CREATE (trumpet:Instrument {name: 'Trumpet'})
CREATE (alice)[:LIKES]>(guitar)
CREATE (alice)[:LIKES]>(synth)
CREATE (alice)[:LIKES]>(bongos)
CREATE (bob)[:LIKES]>(guitar)
CREATE (bob)[:LIKES]>(synth)
CREATE (carol)[:LIKES]>(bongos)
CREATE (dave)[:LIKES]>(guitar)
CREATE (dave)[:LIKES]>(synth)
CREATE (dave)[:LIKES]>(bongos);
This bipartite graph has two node sets, Person nodes and Instrument nodes. The two node sets are connected via LIKES relationships. Each relationship starts at a Person node and ends at an Instrument node.
In the example, we want to use the Node Similarity algorithm to compare persons based on the instruments they like.
The Node Similarity algorithm will only compute similarity for nodes that have a degree of at least 1. In the example graph, the Eve node will not be compared to other Person nodes.
The following will load the graph, run the algorithm, and stream results:
CALL algo.nodeSimilarity.stream('Person  Instrument', 'LIKES', {
direction: 'OUTGOING'
})
YIELD node1, node2, similarity
RETURN algo.asNode(node1).name AS Person1, algo.asNode(node2).name AS Person2, similarity
ORDER BY similarity DESCENDING, Person1, Person2
Person1  Person2  similarity 

"Alice" 
"Dave" 
1.0 
"Dave" 
"Alice" 
1.0 
"Alice" 
"Bob" 
0.6666666666666666 
"Bob" 
"Alice" 
0.6666666666666666 
"Bob" 
"Dave" 
0.6666666666666666 
"Dave" 
"Bob" 
0.6666666666666666 
"Alice" 
"Carol" 
0.3333333333333333 
"Carol" 
"Alice" 
0.3333333333333333 
"Carol" 
"Dave" 
0.3333333333333333 
"Dave" 
"Carol" 
0.3333333333333333 
10 rows 
We use default values for the procedure configuration parameter. TopK is set to 10, topN is set to 0. Because of that the result set contains the top 10 similarity scores for each node.
To instead write the similarity results back to the graph in Neo4j, use the following query. Each result is written as a new relationship between the compared nodes. The similarity score is written as a property on the relationship.
The following will load the graph, run the algorithm, and write back results:
CALL algo.nodeSimilarity('Person  Instrument', 'LIKES', {
direction: 'OUTGOING',
write: true
})
YIELD nodesCompared, relationships, write, writeProperty, writeRelationshipType;
nodesCompared  relationships  write  writeProperty  writeRelationshipType 

4 
10 
true 
"score" 
"SIMILAR" 
As we can see from the results, the number of created relationships is equal to the number of rows in the streaming example.
There are four limits that can be applied to the similarity results. Top limits the result to the highest similarity scores. Bottom limits the result to the lowest similarity scores. Both top and bottom limits can apply to the result as a whole ("N"), or to the result per node ("K").
There must always be a "K" limit, either bottomK or topK, which is a positive number. The default value for topK and bottomK is 10. 
total results  results per node  

highest score 
topN 
topK 
lowest score 
bottomN 
bottomK 
TopK and bottomK are limits on the number of scores computed per node. For topK, the K largest similarity scores per node are returned. For bottomK, the K smallest similarity scores per node are returned. TopK and bottomK cannot be 0, used in conjuntion, and the default value is 10. If neither is specified, topK is used.
The following will load the graph, run the algorithm, and stream the top 1 result per node:
CALL algo.nodeSimilarity.stream('Person  Instrument', 'LIKES', {
direction: 'OUTGOING',
topK: 1
})
YIELD node1, node2, similarity
RETURN algo.asNode(node1).name AS Person1, algo.asNode(node2).name AS Person2, similarity
ORDER BY Person1
Person1  Person2  similarity 

"Alice" 
"Dave" 
1.0 
"Bob" 
"Alice" 
0.6666666666666666 
"Carol" 
"Alice" 
0.3333333333333333 
"Dave" 
"Alice" 
1.0 
4 rows 
The following will load the graph, run the algorithm, and stream the bottom 1 result per node:
CALL algo.nodeSimilarity.stream('Person  Instrument', 'LIKES', {
direction: 'OUTGOING',
bottomK: 1
})
YIELD node1, node2, similarity
RETURN algo.asNode(node1).name AS Person1, algo.asNode(node2).name AS Person2, similarity
ORDER BY Person1
Person1  Person2  similarity 

Alice 
Carol 
0.3333333333333333 
Bob 
Alice 
0.6666666666666666 
Carol 
Alice 
0.3333333333333333 
Dave 
Carol 
0.3333333333333333 
4 rows 
TopN and bottomN limit the number of similarity scores across all nodes. This is a limit on the total result set, in addition to the topK or bottomK limit on the results per node. For topN, the N largest similarity scores are returned. For bottomN, the N smallest similarity scores are returned. A value of 0 means no global limit is imposed and all results from topK or bottomK are returned.
The following will load the graph, run the algorithm, and stream the 3 highest out of the top 1 results per node:
CALL algo.nodeSimilarity.stream('Person  Instrument', 'LIKES', {
direction: 'OUTGOING',
topK: 1,
topN: 3
})
YIELD node1, node2, similarity
RETURN algo.asNode(node1).name AS Person1, algo.asNode(node2).name AS Person2, similarity
ORDER BY similarity DESC, Person1, Person2
Person1  Person2  similarity 

"Alice" 
"Dave" 
1.0 
"Dave" 
"Alice" 
1.0 
"Bob" 
"Alice" 
0.6666666666666666 
3 rows 
Degree cutoff is a lower limit on the node degree for a node to be considered in the comparisons. This value can not be lower than 1.
The following will ignore nodes with less than 3 LIKES relationships:
CALL algo.nodeSimilarity.stream('Person  Instrument', 'LIKES', {
direction: 'OUTGOING',
degreeCutOff: 3
})
YIELD node1, node2, similarity
RETURN algo.asNode(node1).name AS Person1, algo.asNode(node2).name AS Person2, similarity
ORDER BY Person1
Person1  Person2  similarity 

"Alice" 
"Dave" 
1.0 
"Dave" 
"Alice" 
1.0 
2 rows 
Similarity cutoff is a lower limit for the similarity score to be present in the result.
The default value is very small (1E42
) to exclude results with a similarity score of 0.
Setting similarity cutoff to 0 may yield a very large result set, increased runtime and memory consumption. 
The following will ignore node pairs with a similarty score less than 0.5:
CALL algo.nodeSimilarity.stream('Person  Instrument', 'LIKES', {
direction: 'OUTGOING',
similarityCutoff: 0.5
})
YIELD node1, node2, similarity
RETURN algo.asNode(node1).name AS Person1, algo.asNode(node2).name AS Person2, similarity
ORDER BY Person1
Person1  Person2  similarity 

"Alice" 
"Dave" 
1.0 
"Alice" 
"Bob" 
0.6666666666666666 
"Bob" 
"Dave" 
0.6666666666666666 
"Bob" 
"Alice" 
0.6666666666666666 
"Dave" 
"Alice" 
1.0 
"Dave" 
"Bob" 
0.6666666666666666 
6 rows 