HashGNN
Introduction
HashGNN is a node embedding algorithm which resembles Graph Neural Networks (GNN) but does not include a model or require training.
The neural networks of GNNs are replaced by random hash functions, in the flavor of the min-hash
locality sensitive hashing.
Thus, HashGNN combines ideas of GNNs and fast randomized algorithms.
The Neo4j Graph Analytics for Snowflake implementation of HashGNN is based on the paper "Hashing-Accelerated Graph Neural Networks for Link Prediction", and further introduces a few improvements and generalizations.
The generalizations include support for embedding heterogeneous graphs; relationships of different types are associated with different hash functions, which allows for preserving relationship-typed graph topology.
Moreover, a way to specify how much embeddings are updated using features from neighboring nodes versus features from the same node can be configured via neighborInfluence
.
The runtime of this algorithm is significantly lower than that of GNNs in general, but can still give comparable embedding quality for certain graphs as shown in the original paper. Moreover, the heterogeneous generalization also gives comparable results when compared to the paper "Graph Transformer Networks" when benchmarked on the same datasets.
The execution does not require GPUs as GNNs typically use, and parallelizes well across many CPU cores.
The algorithm
To clarify how HashGNN works, we will walk through a virtual example below of a three-node graph for the reader who is curious about the details of the feature selection and prefers to learn from examples.
The HashGNN algorithm can only run on binary features. Therefore, there is an optional first step to transform (possibly non-binary) input features into binary features as part of the algorithm.
For a number of iterations, a new binary embedding is computed for each node using the embeddings of the previous iteration. In the first iteration, the previous embeddings are the input feature vectors or the binarized input vectors.
During one iteration, each node embedding vector is constructed by taking K
random samples.
The random sampling is carried out by successively selecting features with lowest min-hash values.
Features of each node itself and of its neighbors are both considered.
There are three types of hash functions involved: 1) a function applied to a node’s own features, 2) a function applied to a subset of neighbors' features, 3) a function applied to all neighbors' features to select the subset for hash function 2).
For each iteration and sampling round k<K
, new hash functions are used, and the third function also varies depending on the relationship type connecting to the neighbor it is being applied on.
The sampling is consistent in the sense that if nodes (a
) and (b
) have identical or similar local graphs, the samples for (a
) and (b
) are also identical or similar.
By local graph, we mean the subgraph with features and relationship types, containing all nodes at most iterations
hops away.
The number K
is called embeddingDensity
in the configuration of the algorithm.
The algorithm ends with another optional step that maps the binary embeddings to dense vectors.
Features
The original HashGNN algorithm assumes that nodes have binary features as input, and produces binary embedding vectors as output (unless output densification is opted for). Since this is not always the case for real-world graphs, our algorithm also comes with options to binarize node properties, or generate binary features from scratch.
Using binary node properties as features
If your node properties have only 0 or 1 values (or arrays of such values), you can use them directly as input to the HashGNN algorithm.
To do that, you provide them as featureProperties
in the compute configuration.
Feature generation
To use the feature generation, specify a map including dimension
and densityLevel
for the generateFeatures
compute configuration parameter.
This will generate dimension
number of features, where nodes have approximately densityLevel
features switched on.
The active features for each node are selected uniformly at random with replacement.
Although the active features are random, the feature vector for a node acts as an approximately unique signature for that node.
This is akin to onehot encoding of the node IDs, but approximate in that it has a much lower dimension than the node count of the graph.
Please note that while using feature generation, it is not supported to supply any featureProperties
which otherwise is mandatory.
Feature binarization
Feature binarization uses hyperplane rounding and is configured via featureProperties
and a map parameter binarizeFeatures
containing threshold
and dimension
.
The hyperplane rounding uses hyperplanes defined by vectors filled with Gaussian random values.
The dimension
parameter determines the number of generated binary features that the input features are transformed into.
For each hyperplane (one for each dimension
) and node we compute the dot product of the node’s input feature vector and the normal vector of the hyperplane.
If this dot product is larger than the given threshold
, the node gets the feature corresponding to that hyperplane.
Although hyperplane rounding can be applied to a binary input, it is often best to use the already binary input directly.
However, sometimes using binarization with a different dimension
than the number of input features can be useful to either act as dimensionality reduction or introduce redundancy that can be leveraged by HashGNN.
The hyperplane rounding may not work well if the input features are of different magnitudes since those of larger magnitudes will influence the generated binary features more. If this is not the intended behavior for your application we recommend normalizing your node properties (by feature dimension) prior to running HashGNN. |
Neighbor influence
The parameter neighborInfluence
determines how prone the algorithm is to select neighbors' features over features from the same node.
The default value of neighborInfluence
is 1.0
and with this value, on average a feature will be selected from the neighbors 50%
of the time.
Increasing the value leads to neighbors being selected more often.
The probability of selecting a feature from the neighbors as a function of neighborInfluence
has a hockey-stick-like shape, somewhat similar to the shape of y=log(x)
or y=C - 1/x
.
This implies that the probability is more sensitive for low values of neighborInfluence
.
Heterogeneity support
The Neo4j Graph Analytics for Snowflake implementation of HashGNN provides a new generalization to heterogeneous graphs in that it can distinguish between different relationship types.
To enable the heterogeneous support set heterogeneous
to true.
The generalization works as the original HashGNN algorithm, but whenever a hash function is applied to a feature of a neighbor node, the algorithm uses a hash function that depends not only on the iteration and on a number k < embeddingDensity
, but also on the type of the relationship connecting to the neighbor.
Consider an example where HashGNN is run with one iteration, and we have (a)-[:R]→(x), (b)-[:R]→(x)
and (c)-[:S]→(x)
.
Assume that a feature f
of (x)
is selected for (a)
and the hash value is very small.
This will make it very likely that the feature is also selected for (b)
.
There will however be no correlation to f
being selected for (c)
when considering the relationship (c)-[:S]→(x)
, because a different hash function is used for S
.
We can conclude that nodes with similar neighborhoods (including node properties and relationship types) get similar embeddings, while nodes that have less similar neighborhoods get less similar embeddings.
An advantage of running heterogeneous HashGNN to running a homogenous embedding such as FastRP is that it is not necessary to manually select multiple projections or create meta-path graphs before running FastRP on these multiple graphs. With the heterogeneous algorithm, the full heterogeneous graph can be used in a single execution.
Node property schema for heterogeneous graphs
Heterogeneous graphs typically have different node properties for different node labels.
HashGNN assumes that all nodes have the same allowed features.
Use therefore a default value of 0
for in each graph projection.
This works both in the binary input case and when binarization is applied, because having a binary feature with value 0
behaves as if not having the feature.
The 0
values are represented in a sparse format, so the memory overhead of storing 0
values for many nodes has a low overhead.
Orientation
Choosing the right orientation when creating the graph may have a large impact.
HashGNN works for any orientation, and the choice of orientation is problem specific.
Given a directed relationship type, you may pick one orientation, or use two projections with NATURAL
and REVERSE
.
Using the analogy with GNN’s, using a different relationship type for the reversed relationships leads to using a different set of weights when considering a relationship vis-à-vis the reversed relationship.
For HashGNN’s this means instead using different min-hash functions for the two relationships.
For example, in a citation network, a paper citing another paper is very different from the paper being cited.
Output densification
Since binary embeddings need to be of higher dimension than dense floating point embeddings to encode the same amount of information, binary embeddings require more memory and longer training time for downstream models.
The output embeddings can be optionally densified, by using random projection, similar to what is done to initialize FastRP with node properties.
This behavior is activated by specifying outputDimension
.
Output densification can improve runtime and memory of downstream tasks at the cost of introducing approximation error due to the random nature of the projection.
The larger the outputDimension
, the lower the approximation error and performance savings.
Tuning algorithm parameters
In order to improve the embedding quality using HashGNN on one of your graphs, it is possible to tune the algorithm parameters. This process of finding the best parameters for your specific use case and graph is typically referred to as hyperparameter tuning. We will go through each of the compute configuration parameters and explain how they behave.
Iterations
The maximum number of hops between a node and other nodes that affect its embedding is equal to the number of iterations of HashGNN which is configured with iterations
.
This is analogous to the number of layers in a GNN or the number of iterations in FastRP.
Often a value of 2
to 4
is sufficient, but sometimes more iterations are useful.
Embedding density
The embeddingDensity
parameter is what the original paper denotes by k
.
For each iteration of HashGNN, k
features are selected from the previous iteration’s embeddings for the same node and for its neighbors.
The selected features are represented as a set, so the number of distinct selected features may be smaller than k
.
The higher this parameter is set, the longer it will take to run the algorithm, and the runtime increases in a linear fashion.
To a large extent, higher values give better embeddings.
As a loose guideline, one may try to set embeddingDensity
to 128, 256, 512, or roughly 25%-50% of the embedding dimension, i.e. the number of binary features.
Feature generation
The dimension
parameter determines the number of binary features when feature generation is applied.
A high dimension increases expressiveness but requires more data in order to be useful and can lead to the curse of high dimensionality for downstream machine learning tasks.
Additionally, more computation resources will be required.
However, binary embeddings only have a single bit of information per dimension.
In contrast, dense Float
embeddings have 64 bits of information per dimension.
Consequently, in order to obtain similarly good embeddings with HashGNN as with an algorithm that produces dense embeddings (e.g. FastRP or GraphSAGE) one typically needs a significantly higher dimension.
Some values to consider trying for densityLevel
are very low values such as 1
or 2
, or increase as appropriate.
Feature binarization
The dimension
parameter determines the number of binary features when binarization is applied.
A high dimension increases expressiveness, but also the sparsity of features.
Therefore, a higher dimension should also be coupled with higher embeddingDensity
and/or lower threshold
.
Higher dimension also leads to longer training times of downstream models and higher memory footprint.
Increasing the threshold leads to sparser feature vectors.
However, binary embeddings only have a single bit of information per dimension.
In contrast, dense Float
embeddings have 64 bits of information per dimension.
Consequently, in order to obtain similarly good embeddings with HashGNN as with an algorithm that produces dense embeddings (e.g. FastRP or GraphSAGE) one typically needs a significantly higher dimension.
The default threshold of 0
leads to fairly many features being active for each node.
Often sparse feature vectors are better, and it may therefore be useful to increase the threshold beyond the default.
One heuristic for choosing a good threshold is based on using the average and standard deviation of the hyperplane dot products plus with the node feature vectors.
For example, one can set the threshold to the average plus two times the standard deviation.
To obtain these values, run HashGNN and see the database logs where you read them off.
Then you can use those values to reconfigure the threshold accordingly.
Neighbor influence
As explained above, the default value is a reasonable starting point.
If using a hyperparameter tuning library, this parameter may favorably be transformed by a function with increasing derivative such as the exponential function, or a function of the type a/(b - x)
.
The probability of selecting (and keeping throughout the iterations) a feature from different nodes depends on neighborInfluence
and the number of hops to the node.
Therefore, neighborInfluence
should be re-tuned when iterations
is changed.
Heterogeneous
In general, there is a large amount of information to store about paths containing multiple relationship types in a heterogeneous graph, so with many iterations and relationship types, a very high embedding dimension may be necessary. This is especially true for unsupervised embedding algorithms such as HashGNN. Therefore, caution should be taken when using many iterations in the heterogeneous mode.
Random seed
The random seed has a special role in this algorithm.
Other than making all steps of the algorithm deterministic, the randomSeed
parameter determines which (to some degree) hash functions are used inside the algorithm.
This is important since it greatly affects which features are sampled each iteration.
The hashing plays a similar role to the (typically neural) transformations in each layer of Graph Neural Networks, which tells us something about how important the hash functions are.
Indeed, one can often see a significant difference in the quality of the node embeddings output from the algorithm when only the randomSeed
is different in the compute configuration.
For these reasons, it can actually make sense to tune the random seed parameter. Note that it should be tuned as a categorical (i.e. non-ordinal) number, meaning that values 1 and 2 can be considered just as similar or different as 1 and 100. A good way to start doing this is to choose 5 - 10 arbitrary integers (e.g. values 1, 2, 3, 4 and 5) as the candidates for the random seed.
randomSeed
co-depends on several compute configuration parameters, and in particular on the neighborInfluence
parameter which also directly influences which hash functions are used.
Therefore, if neighborInfluence
is changed, likely the randomSeed
parameter needs to be retuned.
Syntax
This section covers the syntax used to execute the HashGNN algorithm.
CALL Neo4j_Graph_Analytics.graph.hashgnn(
'CPU_X64_XS', (1)
{
['defaultTablePrefix': '...',] (2)
'project': {...}, (3)
'compute': {...}, (4)
'write': {...} (5)
}
);
1 | Compute pool selector. |
2 | Optional prefix for table references. |
3 | Project config. |
4 | Compute config. |
5 | Write config. |
Name | Type | Default | Optional | Description |
---|---|---|---|---|
computePoolSelector |
String |
|
no |
The selector for the compute pool on which to run the HashGNN job. |
configuration |
Map |
|
no |
Configuration for graph project, algorithm compute and result write back. |
The configuration map consists of the following three entries.
For more details on below Project configuration, refer to the Project documentation. |
Name | Type |
---|---|
nodeTables |
List of node tables. |
relationshipTables |
Map of relationship types to relationship tables. |
Name | Type | Default | Optional | Description |
---|---|---|---|---|
featureProperties |
List of String |
|
yes |
The names of the node properties that should be used as input features. All property names must exist in the projected graph and be of type Float or List of Float. |
iterations |
Integer |
|
no |
The number of iterations to run HashGNN. Must be at least 1. |
embeddingDensity |
Integer |
|
no |
The number of features to sample per node in each iteration. Called |
heterogeneous |
Boolean |
|
yes |
Whether different relationship types should be treated differently. |
neighborInfluence |
Float |
|
yes |
Controls how often neighbors' features are sampled in each iteration relative to sampling the node’s own features. Must be non-negative. |
binarizeFeatures |
Map |
|
yes |
A map with keys |
generateFeatures |
Map |
|
yes |
A map with keys |
outputDimension |
Integer |
|
yes |
If given, the embeddings are projected randomly into |
randomSeed |
Integer |
|
yes |
A random seed which is used for all randomness in computing the embeddings. |
For more details on below Write configuration, refer to the Write documentation. |
Name | Type | Default | Optional | Description |
---|---|---|---|---|
nodeLabel |
String |
|
no |
Node label in the in-memory graph from which to write a node property. |
nodeProperty |
String |
|
yes |
The node property that will be written back to the Snowflake database. |
outputTable |
String |
|
no |
Table in Snowflake database to which node properties are written. |
Example
In this section we will show examples of running the HashGNN algorithm on a concrete graph. The intention is to illustrate what the results look like and to provide a guide in how to make use of the algorithm in a real setting. We will do this on a small social-network graph of a handful of nodes connected in a particular pattern.
CREATE OR REPLACE TABLE EXAMPLE_DB.DATA_SCHEMA.PERSONS (NODEID VARCHAR, AGE INT, EXPERIENCE FLOAT, HIPSTER INT);
ALTER TABLE EXAMPLE_DB.DATA_SCHEMA.PERSONS ADD COLUMN SOURNESS FLOAT DEFAULT 0.0;
ALTER TABLE EXAMPLE_DB.DATA_SCHEMA.PERSONS ADD COLUMN SWEETNESS FLOAT DEFAULT 0.0;
ALTER TABLE EXAMPLE_DB.DATA_SCHEMA.PERSONS ADD COLUMN TROPICAL INT DEFAULT 0;
INSERT INTO EXAMPLE_DB.DATA_SCHEMA.PERSONS (NODEID, AGE, EXPERIENCE, HIPSTER) VALUES
('Dan', 18, 0.63, 0),
('Annie', 12, 0.05, 0),
('Matt', 22, 0.42, 0),
('Jeff', 51, 0.12, 0),
('Brie', 31, 0.06, 0),
('John', 65, 0.23, 1),
('Brie', 4, 1.0, 0);
CREATE OR REPLACE TABLE EXAMPLE_DB.DATA_SCHEMA.FRUITS (NODEID VARCHAR, TROPICAL INT, SOURNESS FLOAT, SWEETNESS FLOAT);
ALTER TABLE EXAMPLE_DB.DATA_SCHEMA.FRUITS ADD COLUMN EXPERIENCE FLOAT DEFAULT 0.0;
INSERT INTO EXAMPLE_DB.DATA_SCHEMA.FRUITS (NODEID, TROPICAL, SOURNESS, SWEETNESS) VALUES
('Apple', 0, 0.3, 0.6),
('Banana', 1, 0.1, 0.9),
('Mango', 1, 0.3, 1.0),
('Plum', 0, 0.5, 0.8);
CREATE OR REPLACE TABLE EXAMPLE_DB.DATA_SCHEMA.LIKES (SOURCENODEID VARCHAR, TARGETNODEID VARCHAR);
INSERT INTO EXAMPLE_DB.DATA_SCHEMA.LIKES VALUES
('Dan', 'Apple'),
('Annie', 'Banan'),
('Matt', 'Mango'),
('Jeff', 'Mango'),
('Brie', 'Banana'),
('Elsa', 'Plum'),
('John', 'Plum');
CREATE OR REPLACE TABLE EXAMPLE_DB.DATA_SCHEMA.KNOWS (SOURCENODEID VARCHAR, TARGETNODEID VARCHAR);
INSERT INTO EXAMPLE_DB.DATA_SCHEMA.KNOWS VALUES
('Dan', 'Annie'),
('Dan', 'Matt'),
('Annie', 'Matt'),
('Annie', 'Jeff'),
('Annie', 'Brie'),
('Matt', 'Brie'),
('Brie', 'Elsa'),
('Brie', 'Jeff'),
('John', 'Jeff');
This graph has two node tables, person nodes and fruit nodes. The two node sets are connected via LIKES relationships, and there are also KNOWS relationships between the person nodes.
Please note that we add some node table columns with only default values. The reason is that HashGNN requires all nodes to have the same set of features, and we want to use the node columns as features.
Run with binarization
To begin with we will run the algorithm only on the person nodes, using the AGE
and EXPERIENCE
columns as features.
Since these properties are not binary, we will use the binarization feature of HashGNN to transform them into binary features.
To run the query, there is a required setup of grants for the application, your consumer role and your environment. Please see the Getting started page for more on this.
We also assume that the application name is the default Neo4j_Graph_Analytics. If you chose a different app name during installation, please replace it with that.
CALL Neo4j_Graph_Analytics.graph.hashgnn('CPU_X64_XS', {
'defaultTablePrefix': 'EXAMPLE_DB.DATA_SCHEMA',
'project': {
'nodeTables': [ 'PERSONS' ],
'relationshipTables': {
'KNOWS': {
'sourceTable': 'PERSONS',
'targetTable': 'PERSONS',
'orientation': 'UNDIRECTED'
}
}
},
'compute': {
'iterations': 1,
'embeddingDensity': 2,
'binarizeFeatures': {'dimension': 4, 'threshold': 32},
'featureProperties': ['AGE', 'EXPERIENCE'],
'randomSeed': 42
},
'write': [{
'nodeLabel': 'PERSONS',
'outputTable': 'PERSON_EMBEDDINGS'
}]
});
JOB_ID | JOB_START | JOB_END | JOB_RESULT |
---|---|---|---|
job_fe099995c12b431cbf5fa46d4a88a30f |
2025-08-06 07:33:11.282 |
2025-08-06 07:33:17.678 |
{ "hashgnn_1": { "computeMillis": 32, "configuration": { "binarizeFeatures": { "dimension": 4, "threshold": 32 }, "concurrency": 6, "embeddingDensity": 2, "featureProperties": [ "AGE", "EXPERIENCE" ], "heterogeneous": false, "iterations": 1, "jobId": "bc9f591d-b233-42c2-ae47-07135a006974", "logProgress": true, "mutateProperty": "hashgnn", "neighborInfluence": 1, "nodeLabels": [ "*" ], "randomSeed": 42, "relationshipTypes": [ "*" ], "sudo": false }, "mutateMillis": 2, "nodeCount": 6, "nodePropertiesWritten": 6, "preProcessingMillis": 7 }, "project_1": { "graphName": "snowgraph", "nodeCount": 6, "nodeMillis": 218, "relationshipCount": 16, "relationshipMillis": 462, "totalMillis": 680 }, "write_node_property_1": { "copyIntoTableMillis": 1510, "exportMillis": 2411, "nodeLabel": "PERSONS", "nodeProperty": "hashgnn", "outputTable": "EXAMPLE_DB.DATA_SCHEMA.PERSON_EMBEDDINGS", "propertiesExported": 6, "stageUploadMillis": 658 } } |
The returned result contains information about the job execution and result distribution. Additionally, the embedding for each of the nodes has been written back to Snowflake. We can query it like so:
SELECT * FROM EXAMPLE_DB.DATA_SCHEMA.PERSON_EMBEDDINGS;
NODEID | HASHGNN |
---|---|
Dan |
[ 0.000000000000000e+00, 0.000000000000000e+00, 0.000000000000000e+00, 0.000000000000000e+00 ] |
Annie |
[ 1.000000000000000e+00, 0.000000000000000e+00, 1.000000000000000e+00, 0.000000000000000e+00 ] |
Matt |
[ 1.000000000000000e+00, 1.000000000000000e+00, 0.000000000000000e+00, 0.000000000000000e+00 ] |
Jeff |
[ 1.000000000000000e+00, 0.000000000000000e+00, 0.000000000000000e+00, 0.000000000000000e+00 ] |
Brie |
[ 1.000000000000000e+00, 0.000000000000000e+00, 0.000000000000000e+00, 0.000000000000000e+00 ] |
John |
[ 1.000000000000000e+00, 0.000000000000000e+00, 0.000000000000000e+00, 0.000000000000000e+00 ] |
The results of the algorithm are not very intuitively interpretable, as the node embedding format is a mathematical abstraction of the node within its neighborhood, designed for machine learning .
What we can see is that the embeddings have four elements (as configured using binarizeFeatures.dimension
).
Due to the random nature of the algorithm the results will vary between the runs, unless |
Run without binarization
Next we will run the algorithm also only on the person nodes, using the HIPSTER
column as features.
Since these properties are binary, we will not use the binarization feature of HashGNN.
To run the query, there is a required setup of grants for the application, your consumer role and your environment. Please see the Getting started page for more on this.
We also assume that the application name is the default Neo4j_Graph_Analytics. If you chose a different app name during installation, please replace it with that.
CALL Neo4j_Graph_Analytics.graph.hashgnn('CPU_X64_XS', {
'defaultTablePrefix': 'EXAMPLE_DB.DATA_SCHEMA',
'project': {
'nodeTables': [ 'PERSONS' ],
'relationshipTables': {
'KNOWS': {
'sourceTable': 'PERSONS',
'targetTable': 'PERSONS',
'orientation': 'UNDIRECTED'
}
}
},
'compute': {
'iterations': 1,
'embeddingDensity': 2,
'featureProperties': ['HIPSTER'],
'randomSeed': 123
},
'write': [{
'nodeLabel': 'PERSONS',
'outputTable': 'PERSON_EMBEDDINGS'
}]
});
JOB_ID | JOB_START | JOB_END | JOB_RESULT |
---|---|---|---|
job_92acd5e9bc374455bdd1a5a3361168c9 |
2025-08-06 07:39:46.436 |
2025-08-06 07:39:52.459 |
{ "hashgnn_1": { "computeMillis": 34, "configuration": { "concurrency": 6, "embeddingDensity": 2, "featureProperties": [ "HIPSTER" ], "heterogeneous": false, "iterations": 1, "jobId": "71425d87-f242-4776-99c7-23c90dceb946", "logProgress": true, "mutateProperty": "hashgnn", "neighborInfluence": 1, "nodeLabels": [ "*" ], "randomSeed": 123, "relationshipTypes": [ "*" ], "sudo": false }, "mutateMillis": 1, "nodeCount": 6, "nodePropertiesWritten": 6, "preProcessingMillis": 9 }, "project_1": { "graphName": "snowgraph", "nodeCount": 6, "nodeMillis": 139, "relationshipCount": 16, "relationshipMillis": 300, "totalMillis": 439 }, "write_node_property_1": { "copyIntoTableMillis": 1002, "exportMillis": 1865, "nodeLabel": "PERSONS", "nodeProperty": "hashgnn", "outputTable": "EXAMPLE_DB.DATA_SCHEMA.PERSON_EMBEDDINGS", "propertiesExported": 6, "stageUploadMillis": 562 } } |
The returned result contains information about the job execution and result distribution. Additionally, the embedding for each of the nodes has been written back to Snowflake. We can query it like so:
SELECT * FROM EXAMPLE_DB.DATA_SCHEMA.PERSON_EMBEDDINGS;
NODEID | HASHGNN |
---|---|
Dan |
[ 0.000000000000000e+00 ] |
Annie |
[ 0.000000000000000e+00 ] |
Matt |
[ 0.000000000000000e+00 ] |
Jeff |
[ 1.000000000000000e+00 ] |
Brie |
[ 0.000000000000000e+00 ] |
John |
[ 1.000000000000000e+00 ] |
In this example the embedding dimension becomes 1
, because without binarization it is the number of features given, which is 1
due to the single 'HIPSTER' column.
Run with feature generation
Next we will run the algorithm also only on the person nodes, using no features from the node tables, but instead generating random binary features. This is useful when the nodes do not have any features, or when the features are not useful.
To run the query, there is a required setup of grants for the application, your consumer role and your environment. Please see the Getting started page for more on this.
We also assume that the application name is the default Neo4j_Graph_Analytics. If you chose a different app name during installation, please replace it with that.
CALL Neo4j_Graph_Analytics.graph.hashgnn('CPU_X64_XS', {
'defaultTablePrefix': 'EXAMPLE_DB.DATA_SCHEMA',
'project': {
'nodeTables': [ 'PERSONS' ],
'relationshipTables': {
'KNOWS': {
'sourceTable': 'PERSONS',
'targetTable': 'PERSONS',
'orientation': 'UNDIRECTED'
}
}
},
'compute': {
'iterations': 1,
'embeddingDensity': 2,
'generateFeatures': {'dimension': 6, 'densityLevel': 1},
'randomSeed': 42
},
'write': [{
'nodeLabel': 'PERSONS',
'outputTable': 'PERSON_EMBEDDINGS'
}]
});
JOB_ID | JOB_START | JOB_END | JOB_RESULT |
---|---|---|---|
job_65555fca32dd4088a4d18b2b888a1b96 |
2025-08-06 07:43:16.691 |
2025-08-06 07:43:23.157 |
{ "hashgnn_1": { "computeMillis": 25, "configuration": { "concurrency": 6, "embeddingDensity": 2, "featureProperties": [], "generateFeatures": { "densityLevel": 1, "dimension": 6 }, "heterogeneous": false, "iterations": 1, "jobId": "7117cf04-0d01-4777-b964-f72dabd93c09", "logProgress": true, "mutateProperty": "hashgnn", "neighborInfluence": 1, "nodeLabels": [ "*" ], "randomSeed": 42, "relationshipTypes": [ "*" ], "sudo": false }, "mutateMillis": 1, "nodeCount": 6, "nodePropertiesWritten": 6, "preProcessingMillis": 7 }, "project_1": { "graphName": "snowgraph", "nodeCount": 6, "nodeMillis": 182, "relationshipCount": 16, "relationshipMillis": 413, "totalMillis": 595 }, "write_node_property_1": { "copyIntoTableMillis": 1012, "exportMillis": 1825, "nodeLabel": "PERSONS", "nodeProperty": "hashgnn", "outputTable": "EXAMPLE_DB.DATA_SCHEMA.PERSON_EMBEDDINGS", "propertiesExported": 6, "stageUploadMillis": 545 } } |
The returned result contains information about the job execution and result distribution. Additionally, the embedding for each of the nodes has been written back to Snowflake. We can query it like so:
SELECT * FROM EXAMPLE_DB.DATA_SCHEMA.PERSON_EMBEDDINGS;
NODEID | HASHGNN |
---|---|
Dan |
[ 0.000000000000000e+00, 0.000000000000000e+00, 0.000000000000000e+00, 0.000000000000000e+00, 1.000000000000000e+00, 0.000000000000000e+00 ] |
Annie |
[ 0.000000000000000e+00, 0.000000000000000e+00, 0.000000000000000e+00, 0.000000000000000e+00, 1.000000000000000e+00, 0.000000000000000e+00 ] |
Matt |
[ 0.000000000000000e+00, 0.000000000000000e+00, 0.000000000000000e+00, 0.000000000000000e+00, 1.000000000000000e+00, 0.000000000000000e+00 ] |
Jeff |
[ 0.000000000000000e+00, 0.000000000000000e+00, 0.000000000000000e+00, 0.000000000000000e+00, 1.000000000000000e+00, 0.000000000000000e+00 ] |
Brie |
[ 0.000000000000000e+00, 0.000000000000000e+00, 0.000000000000000e+00, 0.000000000000000e+00, 1.000000000000000e+00, 1.000000000000000e+00 ] |
John |
[ 0.000000000000000e+00, 0.000000000000000e+00, 1.000000000000000e+00, 1.000000000000000e+00, 0.000000000000000e+00, 0.000000000000000e+00 ] |
And as we can see, each node has at least one feature active.
The density is about 50%, and no node has more than two features active (limited by the embeddingDensity
).
Run on heterogeneous graph
Lastly, we will run the algorithm on the heterogeneous graph, also including fruit nodes, and using the EXPERIENCE
, TROPICAL
, SOURNESS
and SWEETNESS
columns as features.
To run the query, there is a required setup of grants for the application, your consumer role and your environment. Please see the Getting started page for more on this.
We also assume that the application name is the default Neo4j_Graph_Analytics. If you chose a different app name during installation, please replace it with that.
CALL Neo4j_Graph_Analytics.graph.hashgnn('CPU_X64_XS', {
'defaultTablePrefix': 'EXAMPLE_DB.DATA_SCHEMA',
'project': {
'nodeTables': [ 'PERSONS', 'FRUITS' ],
'relationshipTables': {
'KNOWS': {
'sourceTable': 'PERSONS',
'targetTable': 'PERSONS',
'orientation': 'UNDIRECTED'
},
'LIKES': {
'sourceTable': 'PERSONS',
'targetTable': 'FRUITS',
'orientation': 'UNDIRECTED'
}
}
},
'compute': {
'heterogeneous': true,
'iterations': 2,
'embeddingDensity': 4,
'binarizeFeatures': {'dimension': 6, 'threshold': 0.2},
'featureProperties': ['EXPERIENCE', 'SOURNESS', 'SWEETNESS', 'TROPICAL'],
'randomSeed': 42
},
'write': [
{
'nodeLabel': 'PERSONS',
'outputTable': 'PERSON_EMBEDDINGS'
},
{
'nodeLabel': 'FRUITS',
'outputTable': 'FRUIT_EMBEDDINGS'
}
]
});
JOB_ID | JOB_START | JOB_END | JOB_RESULT |
---|---|---|---|
job_2fbffb14f7c6402588192cbe8eee793f |
2025-08-06 07:47:04.670 |
2025-08-06 07:47:13.095 |
{ "hashgnn_1": { "computeMillis": 47, "configuration": { "binarizeFeatures": { "dimension": 6, "threshold": 0.2 }, "concurrency": 6, "embeddingDensity": 4, "featureProperties": [ "EXPERIENCE", "SOURNESS", "SWEETNESS", "TROPICAL" ], "heterogeneous": true, "iterations": 2, "jobId": "b7a65d45-22b5-4884-bd4f-84c692355d5f", "logProgress": true, "mutateProperty": "hashgnn", "neighborInfluence": 1, "nodeLabels": [ "*" ], "randomSeed": 42, "relationshipTypes": [ "*" ], "sudo": false }, "mutateMillis": 1, "nodeCount": 10, "nodePropertiesWritten": 10, "preProcessingMillis": 10 }, "project_1": { "graphName": "snowgraph", "nodeCount": 10, "nodeMillis": 242, "relationshipCount": 26, "relationshipMillis": 333, "totalMillis": 575 }, "write_node_property_1": { "copyIntoTableMillis": 1015, "exportMillis": 1809, "nodeLabel": "PERSONS", "nodeProperty": "hashgnn", "outputTable": "EXAMPLE_DB.DATA_SCHEMA.PERSON_EMBEDDINGS", "propertiesExported": 6, "stageUploadMillis": 550 }, "write_node_property_2": { "copyIntoTableMillis": 936, "exportMillis": 1583, "nodeLabel": "FRUITS", "nodeProperty": "hashgnn", "outputTable": "EXAMPLE_DB.DATA_SCHEMA.FRUIT_EMBEDDINGS", "propertiesExported": 4, "stageUploadMillis": 428 } } |
The returned result contains information about the job execution and result distribution. Additionally, the embedding for each of the nodes has been written back to Snowflake. Let us inspect the embeddings for the person nodes:
SELECT * FROM EXAMPLE_DB.DATA_SCHEMA.PERSON_EMBEDDINGS;
NODEID | HASHGNN |
---|---|
Dan |
[ 1.000000000000000e+00, 0.000000000000000e+00, 1.000000000000000e+00, 0.000000000000000e+00, 0.000000000000000e+00, 0.000000000000000e+00 ] |
Annie |
[ 1.000000000000000e+00, 1.000000000000000e+00, 1.000000000000000e+00, 0.000000000000000e+00, 0.000000000000000e+00, 0.000000000000000e+00 ] |
Matt |
[ 1.000000000000000e+00, 0.000000000000000e+00, 1.000000000000000e+00, 0.000000000000000e+00, 0.000000000000000e+00, 0.000000000000000e+00 ] |
Jeff |
[ 1.000000000000000e+00, 1.000000000000000e+00, 0.000000000000000e+00, 0.000000000000000e+00, 0.000000000000000e+00, 0.000000000000000e+00 ] |
Brie |
[ 1.000000000000000e+00, 1.000000000000000e+00, 1.000000000000000e+00, 0.000000000000000e+00, 0.000000000000000e+00, 0.000000000000000e+00 ] |
John |
[ 1.000000000000000e+00, 1.000000000000000e+00, 0.000000000000000e+00, 0.000000000000000e+00, 0.000000000000000e+00, 0.000000000000000e+00 ] |
We could inspect FRUIT_EMBEDDINGS
and find similar results, but we will not do that here for brevity.
Virtual example
Perhaps the below example is best enjoyed with a pen and paper.
Let say we have a node a
with feature f1
, a node b
with feature f2
and a node c
with features f1
and f3
.
The graph structure is a—b—c
.
We imagine running HashGNN for one iteration with embeddingDensity=2
.
For simplicity, we will assume that the hash functions return some made up numbers as we go.
During the first iteration and k=0
, we compute an embedding for (a)
.
A hash value for f1
turns out to be 7
.
Since (b)
is a neighbor of (a)
, we generate a value for its feature f2
which turns out to be 11
.
The value 7
is sampled from a hash function which we call "one" and 11
from a hash function "two".
Thus f1
is added to the new features for (a)
since it has a smaller hash value.
We repeat for k=1
and this time the hash values are 4
and 2
, so now f2
is added as a feature to (a)
.
We now consider (b)
.
The feature f2
gets hash value 8
using hash function "one".
Looking at the neighbor (a)
, we sample a hash value for f1
which becomes 5
using hash function "two".
Since (c)
has more than one feature, we also have to select one of the two features f1
and f3
before considering the "winning" feature as before as input to hash function "two".
We use a third hash function "three" for this purpose and f3
gets the smaller value of 1
.
We now compute a hash of f3
using "two" and it becomes 6
.
Since 5
is smaller than 6
, f1
is the "winning" neighbor feature for (b)
, and since 5
is also smaller than 8
, it is the overall "winning" feature.
Therefore, we add f1
to the embedding of (b)
.
We proceed similarly with k=1
and f1
is selected again.
Since the embeddings consist of binary features, this second addition has no effect.
We omit the details of computing the embedding of (c)
.
After the 2 sampling rounds, the iteration is complete and since there is only one iteration, we are done.
Each node has a binary embedding that contains some subset of the original binary features.
In particular, (a)
has features f1
and f2
, (b)
has only the feature f1
.