Node Similarity
Neo4j Graph Analytics for Snowflake is in Public Preview and is not intended for production use. |
This section describes the Node Similarity algorithm in Neo4j Graph Analytics for Snowflake. The algorithm is based on the Jaccard and Overlap similarity metrics.
Introduction
The Node Similarity algorithm compares a set of nodes based on the nodes they are connected to. Two nodes are considered similar if they share many of the same neighbors. Node Similarity computes pair-wise similarities based on the Jaccard metric, also known as the Jaccard Similarity Score, the Overlap coefficient, also known as the Szymkiewicz–Simpson coefficient, and the Cosine Similarity score. The first two are most frequently associated with unweighted sets, whereas Cosine with weighted input.
Given two sets A
and B
, the Jaccard Similarity is computed using the following formula:
The Overlap coefficient is computed using the following formula:
The cosine similarity score is computed using the following formula, where entries are implicitly given a weight of 1
when A,B are unweighted:
The input of this algorithm is a bipartite, connected graph containing two disjoint node sets. Each relationship starts from a node in the first node set and ends at a node in the second node set.
The Node Similarity algorithm compares each node that has outgoing relationships with each other such node.
For every node n
, we collect the outgoing neighborhood N(n)
of that node, that is, all nodes m
such that there is a relationship from n
to m
.
For each pair n
, m
, the algorithm computes a similarity for that pair that equals the outcome of the selected similarity metric for N(n)
and N(m)
.
Node Similarity has time complexity O(n3) and space complexity O(n2). We compute and store neighbour sets in time and space O(n2), then compute pairwise similarity scores in time O(n3).
In order to bound memory usage you can specify an explicit limit on the number of results to output per node, this is the 'topK' parameter. It can be set to any value, except 0. You will lose precision in the overall computation of course, and running time is unaffected - we still have to compute results before potentially throwing them away.
The output of the algorithm are new relationships between pairs of the first node set. Similarity scores are expressed via relationship properties.
For more information on this algorithm, see:
Syntax
CALL Neo4j_Graph_Analytics.graph.node_similarity(
'X64_CPU_L', (1)
{
'project': {...}, (2)
'compute': {...}, (3)
'write': {...} (4)
}
);
1 | Compute pool selector. |
2 | Project config. |
3 | Compute config. |
4 | Write config. |
Name | Type | Default | Optional | Description |
---|---|---|---|---|
computePoolSelector |
String |
|
no |
The selector for the compute pool on which to run the Node Similarity job. |
configuration |
Map |
|
no |
Configuration for graph project, algorithm compute and result write back. |
The configuration map consists of the following three entries.
For more details on below Project configuration, refer to the Project documentation. |
Name | Type |
---|---|
nodeTables |
List of node tables. |
relationshipTables |
Map of relationship types to relationship tables. |
Name | Type | Default | Optional | Description |
---|---|---|---|---|
mutateProperty |
String |
|
yes |
The relationship property that will be written back to the Snowflake database. |
mutateRelationshipType |
String |
|
yes |
The relationship type used for the relationships written back to the Snowflake database. |
similarityCutoff |
Float |
|
yes |
Lower limit for the similarity score to be present in the result. Values must be between 0 and 1. |
degreeCutoff |
Integer |
|
yes |
Inclusive lower bound on the node degree for a node to be considered in the comparisons. This value can not be lower than 1. |
upperDegreeCutoff |
Integer |
|
yes |
Inclusive upper bound on the node degree for a node to be considered in the comparisons. This value can not be lower than 1. |
topK |
Integer |
|
yes |
Limit on the number of scores per node. The K largest results are returned. This value cannot be lower than 1. |
bottomK |
Integer |
|
yes |
Limit on the number of scores per node. The K smallest results are returned. This value cannot be lower than 1. |
topN |
Integer |
|
yes |
Global limit on the number of scores computed. The N largest total results are returned. This value cannot be negative, a value of 0 means no global limit. |
bottomN |
Integer |
|
yes |
Global limit on the number of scores computed. The N smallest total results are returned. This value cannot be negative, a value of 0 means no global limit. |
relationshipWeightProperty |
String |
|
yes |
Name of the relationship property to use as weights. If unspecified, the algorithm runs unweighted. |
similarityMetric |
String |
|
yes |
The metric used to compute similarity.
Can be either |
useComponents |
Boolean or String |
|
yes |
If enabled, Node Similarity will use components to improve the performance of the computation, skipping comparisons of nodes in different components.
Set to |
For more details on below Write configuration, refer to the Write documentation. |
Name | Type | Default | Optional | Description |
---|---|---|---|---|
sourceLabel |
String |
|
no |
Node label in the in-memory graph for start nodes of relationships to be written back. |
targetLabel |
String |
|
no |
Node label in the in-memory graph for end nodes of relationships to be written back. |
outputTable |
String |
|
no |
Table in Snowflake database to which relationships are written. |
relationshipType |
String |
|
yes |
The relationship type that will be written back to the Snowflake database. |
relationshipProperty |
String |
|
yes |
The relationship property that will be written back to the Snowflake database. |
Examples
In this section we will show examples of running the Node Similarity algorithm on a concrete graph. The intention is to illustrate what the results look like and to provide a guide in how to make use of the algorithm in a real setting. We will do this on a small knowledge graph of a handful nodes connected in a particular pattern. The example graph looks like this:

CREATE OR REPLACE TABLE EXAMPLE_DB.DATA_SCHEMA.PERSONS (NODEID STRING);
INSERT INTO EXAMPLE_DB.DATA_SCHEMA.PERSONS VALUES
('Alice'),
('Bob'),
('Carol'),
('Dave'),
('Eve');
CREATE OR REPLACE TABLE EXAMPLE_DB.DATA_SCHEMA.INSTRUMENTS (NODEID STRING);
INSERT INTO EXAMPLE_DB.DATA_SCHEMA.INSTRUMENTS VALUES
('Guitar'),
('Synthesizer'),
('Bongos'),
('Trumpet');
CREATE OR REPLACE TABLE EXAMPLE_DB.DATA_SCHEMA.LIKES (SOURCENODEID STRING, TARGETNODEID STRING, WEIGHT FLOAT);
INSERT INTO EXAMPLE_DB.DATA_SCHEMA.LIKES VALUES
('Alice', 'Guitar', NULL),
('Alice', 'Synthesizer', NULL),
('Alice', 'Bongos', 0.5),
('Bob', 'Guitar', NULL),
('Bob', 'Synthesizer', NULL),
('Carol', 'Bongos', NULL),
('Dave', 'Guitar', NULL),
('Dave', 'Trumpet', 1.5),
('Dave', 'Bongos', NULL);
This bipartite graph has two node sets, Person nodes and Instrument nodes. The two node sets are connected via LIKES relationships. Each relationship starts at a Person node and ends at an Instrument node.
In the example, we want to use the Node Similarity algorithm to compare people based on the instruments they like.
The Node Similarity algorithm will only compute similarity for nodes that have a degree of at least 1. In the example graph, the Eve node will not be compared to other Person nodes.
In the following examples we will demonstrate using the Node Similarity algorithm on this graph.
Run job
Running a Node Similarity job involves the three steps: Project, Compute and Write.
To run the query, there is a required setup of grants for the application, your consumer role and your environment. Please see the Getting started page for more on this.
We also assume that the application name is the default Neo4j_Graph_Analytics. If you chose a different app name during installation, please replace it with that.
CALL Neo4j_Graph_Analytics.graph.node_similarity('CPU_X64_XS', {
'project': {
'defaultTablePrefix': 'EXAMPLE_DB.DATA_SCHEMA',
'nodeTables': ['PERSONS', 'INSTRUMENTS'],
'relationshipTables': {
'sourceTable': 'PERSONS',
'targetTable': 'INSTRUMENTS'
}
},
'compute': {
'mutateProperty': 'score',
'mutateRelationshipType': 'SIMILAR'
},
'write': [{
'outputTable': 'EXAMPLE_DB.DATA_SCHEMA.PERSONS_SIMILARITY',
'sourceLabel': 'PERSONS',
'targetLabel': 'PERSONS',
'relationshipType': 'SIMILAR',
'relationshipProperty': 'score'
}]
});
JOB_ID | JOB_START | JOB_END | JOB_RESULT |
---|---|---|---|
job_547003e336b44e83b9716fda49069336 |
2025-04-30 06:32:57.635000 |
2025-04-30 06:33:05.199000 |
{ "node_similarity_1": { "computeMillis": 22, "configuration": { "bottomK": 10, "bottomN": 0, "concurrency": 2, "degreeCutoff": 1, "jobId": "38eaf0c7-301f-4b56-8f4d-1372d2642ee2", "logProgress": true, "mutateProperty": "score", "mutateRelationshipType": "SIMILAR", "nodeLabels": [ "" ], "relationshipTypes": [ "" ], "similarityCutoff": 1.000000000000000e-42, "similarityMetric": "JACCARD", "sudo": false, "topK": 10, "topN": 0, "upperDegreeCutoff": 2147483647, "useComponents": false }, "mutateMillis": 186, "nodesCompared": 4, "postProcessingMillis": 0, "preProcessingMillis": 7, "relationshipsWritten": 10, "similarityDistribution": { "max": 0.6666679382324218, "mean": 0.41666641235351565, "min": 0.25, "p1": 0.25, "p10": 0.25, "p100": 0.6666660308837891, "p25": 0.3333320617675781, "p5": 0.25, "p50": 0.3333320617675781, "p75": 0.5000019073486328, "p90": 0.6666660308837891, "p95": 0.6666660308837891, "p99": 0.6666660308837891, "stdDev": 0.14907148283512542 } }, "project_1": { "graphName": "snowgraph", "nodeCount": 9, "nodeMillis": 734, "relationshipCount": 9, "relationshipMillis": 525, "totalMillis": 1259 }, "write_relationship_type_1": { "exportMillis": 2133, "outputTable": "EXAMPLE_DB.DATA_SCHEMA.PERSONS_SIMILARITY", "relationshipProperty": "score", "relationshipType": "SIMILAR", "relationshipsExported": 10 } } |
The returned result contains information about the job execution and result distribution. Additionally, each similarity score computed for the compared node pairs has been written back to the Snowflake database. We can query it like so:
SELECT * FROM EXAMPLE_DB.DATA_SCHEMA.PERSONS_SIMILARITY ORDER BY SCORE DESC;
Which shows the computation results as stored in the database:
SOURCENODEID | TARGETNODEID | SCORE |
---|---|---|
Alice |
Bob |
0.6666666666666666 |
Bob |
Alice |
0.6666666666666666 |
Alice |
Dave |
0.5 |
Dave |
Alice |
0.5 |
Alice |
Carol |
0.3333333333333333 |
Carol |
Alice |
0.3333333333333333 |
Carol |
Dave |
0.3333333333333333 |
Dave |
Carol |
0.3333333333333333 |
Bob |
Dave |
0.25 |
Dave |
Bob |
0.25 |
We use default values for the procedure configuration parameter. TopK is set to 10, topN is set to 0. Because of that the result set contains the top 10 similarity scores for each node.
If we would like to instead compare the Instruments to each other, we would then project the |