HashGNN

Introduction

HashGNN is a node embedding algorithm which resembles Graph Neural Networks (GNN) but does not include a model or require training. The neural networks of GNNs are replaced by random hash functions, in the flavor of the min-hash locality sensitive hashing. Thus, HashGNN combines ideas of GNNs and fast randomized algorithms.

The Neo4j Graph Analytics for Snowflake implementation of HashGNN is based on the paper "Hashing-Accelerated Graph Neural Networks for Link Prediction", and further introduces a few improvements and generalizations. The generalizations include support for embedding heterogeneous graphs; relationships of different types are associated with different hash functions, which allows for preserving relationship-typed graph topology. Moreover, a way to specify how much embeddings are updated using features from neighboring nodes versus features from the same node can be configured via neighborInfluence.

The runtime of this algorithm is significantly lower than that of GNNs in general, but can still give comparable embedding quality for certain graphs as shown in the original paper. Moreover, the heterogeneous generalization also gives comparable results when compared to the paper "Graph Transformer Networks" when benchmarked on the same datasets.

The execution does not require GPUs as GNNs typically use, and parallelizes well across many CPU cores.

The algorithm

To clarify how HashGNN works, we will walk through a virtual example below of a three-node graph for the reader who is curious about the details of the feature selection and prefers to learn from examples.

The HashGNN algorithm can only run on binary features. Therefore, there is an optional first step to transform (possibly non-binary) input features into binary features as part of the algorithm.

For a number of iterations, a new binary embedding is computed for each node using the embeddings of the previous iteration. In the first iteration, the previous embeddings are the input feature vectors or the binarized input vectors.

During one iteration, each node embedding vector is constructed by taking K random samples. The random sampling is carried out by successively selecting features with lowest min-hash values. Features of each node itself and of its neighbors are both considered.

There are three types of hash functions involved: 1) a function applied to a node’s own features, 2) a function applied to a subset of neighbors' features, 3) a function applied to all neighbors' features to select the subset for hash function 2). For each iteration and sampling round k<K, new hash functions are used, and the third function also varies depending on the relationship type connecting to the neighbor it is being applied on.

The sampling is consistent in the sense that if nodes (a) and (b) have identical or similar local graphs, the samples for (a) and (b) are also identical or similar. By local graph, we mean the subgraph with features and relationship types, containing all nodes at most iterations hops away.

The number K is called embeddingDensity in the configuration of the algorithm.

The algorithm ends with another optional step that maps the binary embeddings to dense vectors.

Features

The original HashGNN algorithm assumes that nodes have binary features as input, and produces binary embedding vectors as output (unless output densification is opted for). Since this is not always the case for real-world graphs, our algorithm also comes with options to binarize node properties, or generate binary features from scratch.

Using binary node properties as features

If your node properties have only 0 or 1 values (or arrays of such values), you can use them directly as input to the HashGNN algorithm. To do that, you provide them as featureProperties in the compute configuration.

Feature generation

To use the feature generation, specify a map including dimension and densityLevel for the generateFeatures compute configuration parameter. This will generate dimension number of features, where nodes have approximately densityLevel features switched on. The active features for each node are selected uniformly at random with replacement. Although the active features are random, the feature vector for a node acts as an approximately unique signature for that node. This is akin to onehot encoding of the node IDs, but approximate in that it has a much lower dimension than the node count of the graph. Please note that while using feature generation, it is not supported to supply any featureProperties which otherwise is mandatory.

Feature binarization

Feature binarization uses hyperplane rounding and is configured via featureProperties and a map parameter binarizeFeatures containing threshold and dimension. The hyperplane rounding uses hyperplanes defined by vectors filled with Gaussian random values. The dimension parameter determines the number of generated binary features that the input features are transformed into. For each hyperplane (one for each dimension) and node we compute the dot product of the node’s input feature vector and the normal vector of the hyperplane. If this dot product is larger than the given threshold, the node gets the feature corresponding to that hyperplane.

Although hyperplane rounding can be applied to a binary input, it is often best to use the already binary input directly. However, sometimes using binarization with a different dimension than the number of input features can be useful to either act as dimensionality reduction or introduce redundancy that can be leveraged by HashGNN.

The hyperplane rounding may not work well if the input features are of different magnitudes since those of larger magnitudes will influence the generated binary features more. If this is not the intended behavior for your application we recommend normalizing your node properties (by feature dimension) prior to running HashGNN.

Neighbor influence

The parameter neighborInfluence determines how prone the algorithm is to select neighbors' features over features from the same node. The default value of neighborInfluence is 1.0 and with this value, on average a feature will be selected from the neighbors 50% of the time. Increasing the value leads to neighbors being selected more often. The probability of selecting a feature from the neighbors as a function of neighborInfluence has a hockey-stick-like shape, somewhat similar to the shape of y=log(x) or y=C - 1/x. This implies that the probability is more sensitive for low values of neighborInfluence.

Heterogeneity support

The Neo4j Graph Analytics for Snowflake implementation of HashGNN provides a new generalization to heterogeneous graphs in that it can distinguish between different relationship types. To enable the heterogeneous support set heterogeneous to true. The generalization works as the original HashGNN algorithm, but whenever a hash function is applied to a feature of a neighbor node, the algorithm uses a hash function that depends not only on the iteration and on a number k < embeddingDensity, but also on the type of the relationship connecting to the neighbor. Consider an example where HashGNN is run with one iteration, and we have (a)-[:R]→(x), (b)-[:R]→(x) and (c)-[:S]→(x). Assume that a feature f of (x) is selected for (a) and the hash value is very small. This will make it very likely that the feature is also selected for (b). There will however be no correlation to f being selected for (c) when considering the relationship (c)-[:S]→(x), because a different hash function is used for S. We can conclude that nodes with similar neighborhoods (including node properties and relationship types) get similar embeddings, while nodes that have less similar neighborhoods get less similar embeddings.

An advantage of running heterogeneous HashGNN to running a homogenous embedding such as FastRP is that it is not necessary to manually select multiple projections or create meta-path graphs before running FastRP on these multiple graphs. With the heterogeneous algorithm, the full heterogeneous graph can be used in a single execution.

Node property schema for heterogeneous graphs

Heterogeneous graphs typically have different node properties for different node labels. HashGNN assumes that all nodes have the same allowed features. Use therefore a default value of 0 for in each graph projection. This works both in the binary input case and when binarization is applied, because having a binary feature with value 0 behaves as if not having the feature. The 0 values are represented in a sparse format, so the memory overhead of storing 0 values for many nodes has a low overhead.

Orientation

Choosing the right orientation when creating the graph may have a large impact. HashGNN works for any orientation, and the choice of orientation is problem specific. Given a directed relationship type, you may pick one orientation, or use two projections with NATURAL and REVERSE. Using the analogy with GNN’s, using a different relationship type for the reversed relationships leads to using a different set of weights when considering a relationship vis-à-vis the reversed relationship. For HashGNN’s this means instead using different min-hash functions for the two relationships. For example, in a citation network, a paper citing another paper is very different from the paper being cited.

Output densification

Since binary embeddings need to be of higher dimension than dense floating point embeddings to encode the same amount of information, binary embeddings require more memory and longer training time for downstream models. The output embeddings can be optionally densified, by using random projection, similar to what is done to initialize FastRP with node properties. This behavior is activated by specifying outputDimension. Output densification can improve runtime and memory of downstream tasks at the cost of introducing approximation error due to the random nature of the projection. The larger the outputDimension, the lower the approximation error and performance savings.

Tuning algorithm parameters

In order to improve the embedding quality using HashGNN on one of your graphs, it is possible to tune the algorithm parameters. This process of finding the best parameters for your specific use case and graph is typically referred to as hyperparameter tuning. We will go through each of the compute configuration parameters and explain how they behave.

Iterations

The maximum number of hops between a node and other nodes that affect its embedding is equal to the number of iterations of HashGNN which is configured with iterations. This is analogous to the number of layers in a GNN or the number of iterations in FastRP. Often a value of 2 to 4 is sufficient, but sometimes more iterations are useful.

Embedding density

The embeddingDensity parameter is what the original paper denotes by k. For each iteration of HashGNN, k features are selected from the previous iteration’s embeddings for the same node and for its neighbors. The selected features are represented as a set, so the number of distinct selected features may be smaller than k. The higher this parameter is set, the longer it will take to run the algorithm, and the runtime increases in a linear fashion. To a large extent, higher values give better embeddings. As a loose guideline, one may try to set embeddingDensity to 128, 256, 512, or roughly 25%-50% of the embedding dimension, i.e. the number of binary features.

Feature generation

The dimension parameter determines the number of binary features when feature generation is applied. A high dimension increases expressiveness but requires more data in order to be useful and can lead to the curse of high dimensionality for downstream machine learning tasks. Additionally, more computation resources will be required. However, binary embeddings only have a single bit of information per dimension. In contrast, dense Float embeddings have 64 bits of information per dimension. Consequently, in order to obtain similarly good embeddings with HashGNN as with an algorithm that produces dense embeddings (e.g. FastRP or GraphSAGE) one typically needs a significantly higher dimension. Some values to consider trying for densityLevel are very low values such as 1 or 2, or increase as appropriate.

Feature binarization

The dimension parameter determines the number of binary features when binarization is applied. A high dimension increases expressiveness, but also the sparsity of features. Therefore, a higher dimension should also be coupled with higher embeddingDensity and/or lower threshold. Higher dimension also leads to longer training times of downstream models and higher memory footprint. Increasing the threshold leads to sparser feature vectors.

However, binary embeddings only have a single bit of information per dimension. In contrast, dense Float embeddings have 64 bits of information per dimension. Consequently, in order to obtain similarly good embeddings with HashGNN as with an algorithm that produces dense embeddings (e.g. FastRP or GraphSAGE) one typically needs a significantly higher dimension.

The default threshold of 0 leads to fairly many features being active for each node. Often sparse feature vectors are better, and it may therefore be useful to increase the threshold beyond the default. One heuristic for choosing a good threshold is based on using the average and standard deviation of the hyperplane dot products plus with the node feature vectors. For example, one can set the threshold to the average plus two times the standard deviation. To obtain these values, run HashGNN and see the database logs where you read them off. Then you can use those values to reconfigure the threshold accordingly.

Neighbor influence

As explained above, the default value is a reasonable starting point. If using a hyperparameter tuning library, this parameter may favorably be transformed by a function with increasing derivative such as the exponential function, or a function of the type a/(b - x). The probability of selecting (and keeping throughout the iterations) a feature from different nodes depends on neighborInfluence and the number of hops to the node. Therefore, neighborInfluence should be re-tuned when iterations is changed.

Heterogeneous

In general, there is a large amount of information to store about paths containing multiple relationship types in a heterogeneous graph, so with many iterations and relationship types, a very high embedding dimension may be necessary. This is especially true for unsupervised embedding algorithms such as HashGNN. Therefore, caution should be taken when using many iterations in the heterogeneous mode.

Random seed

The random seed has a special role in this algorithm. Other than making all steps of the algorithm deterministic, the randomSeed parameter determines which (to some degree) hash functions are used inside the algorithm. This is important since it greatly affects which features are sampled each iteration. The hashing plays a similar role to the (typically neural) transformations in each layer of Graph Neural Networks, which tells us something about how important the hash functions are. Indeed, one can often see a significant difference in the quality of the node embeddings output from the algorithm when only the randomSeed is different in the compute configuration.

For these reasons, it can actually make sense to tune the random seed parameter. Note that it should be tuned as a categorical (i.e. non-ordinal) number, meaning that values 1 and 2 can be considered just as similar or different as 1 and 100. A good way to start doing this is to choose 5 - 10 arbitrary integers (e.g. values 1, 2, 3, 4 and 5) as the candidates for the random seed.

randomSeed co-depends on several compute configuration parameters, and in particular on the neighborInfluence parameter which also directly influences which hash functions are used. Therefore, if neighborInfluence is changed, likely the randomSeed parameter needs to be retuned.

Syntax

This section covers the syntax used to execute the HashGNN algorithm.

Run HashGNN.
CALL Neo4j_Graph_Analytics.graph.hashgnn(
  'CPU_X64_XS',                    (1)
  {
    ['defaultTablePrefix': '...',] (2)
    'project': {...},              (3)
    'compute': {...},              (4)
    'write':   {...}               (5)
  }
);
1 Compute pool selector.
2 Optional prefix for table references.
3 Project config.
4 Compute config.
5 Write config.
Table 1. Parameters
Name Type Default Optional Description

computePoolSelector

String

n/a

no

The selector for the compute pool on which to run the HashGNN job.

configuration

Map

{}

no

Configuration for graph project, algorithm compute and result write back.

The configuration map consists of the following three entries.

For more details on below Project configuration, refer to the Project documentation.
Table 2. Project configuration
Name Type

nodeTables

List of node tables.

relationshipTables

Map of relationship types to relationship tables.

Table 3. Compute configuration
Name Type Default Optional Description

featureProperties

List of String

[]

yes

The names of the node properties that should be used as input features. All property names must exist in the projected graph and be of type Float or List of Float.

iterations

Integer

n/a

no

The number of iterations to run HashGNN. Must be at least 1.

embeddingDensity

Integer

n/a

no

The number of features to sample per node in each iteration. Called K in the original paper. Must be at least 1.

heterogeneous

Boolean

false

yes

Whether different relationship types should be treated differently.

neighborInfluence

Float

1.0

yes

Controls how often neighbors' features are sampled in each iteration relative to sampling the node’s own features. Must be non-negative.

binarizeFeatures

Map

n/a

yes

A map with keys dimension and threshold. If given, features are transformed into dimension binary features via hyperplane rounding. Increasing threshold makes the output more sparse, and it defaults to 0. The value of dimension must be at least 1.

generateFeatures

Map

n/a

yes

A map with keys dimension and densityLevel. Should be given if and only if featureProperties is empty. If given, dimension binary features are generated with approximately densityLevel active features per node. Both must be at least 1 and densityLevel at most dimension.

outputDimension

Integer

n/a

yes

If given, the embeddings are projected randomly into outputDimension dense features. Must be at least 1.

randomSeed

Integer

n/a

yes

A random seed which is used for all randomness in computing the embeddings.

For more details on below Write configuration, refer to the Write documentation.
Table 4. Write configuration
Name Type Default Optional Description

nodeLabel

String

n/a

no

Node label in the in-memory graph from which to write a node property.

nodeProperty

String

'hashgnn'

yes

The node property that will be written back to the Snowflake database.

outputTable

String

n/a

no

Table in Snowflake database to which node properties are written.

Example

In this section we will show examples of running the HashGNN algorithm on a concrete graph. The intention is to illustrate what the results look like and to provide a guide in how to make use of the algorithm in a real setting. We will do this on a small social-network graph of a handful of nodes connected in a particular pattern.

The following SQL statement will create the example graph tables in Snowflake:
CREATE OR REPLACE TABLE EXAMPLE_DB.DATA_SCHEMA.PERSONS (NODEID VARCHAR, AGE INT, EXPERIENCE FLOAT, HIPSTER INT);
ALTER TABLE EXAMPLE_DB.DATA_SCHEMA.PERSONS ADD COLUMN SOURNESS FLOAT DEFAULT 0.0;
ALTER TABLE EXAMPLE_DB.DATA_SCHEMA.PERSONS ADD COLUMN SWEETNESS FLOAT DEFAULT 0.0;
ALTER TABLE EXAMPLE_DB.DATA_SCHEMA.PERSONS ADD COLUMN TROPICAL INT DEFAULT 0;
INSERT INTO EXAMPLE_DB.DATA_SCHEMA.PERSONS (NODEID, AGE, EXPERIENCE, HIPSTER) VALUES
  ('Dan', 18, 0.63, 0),
  ('Annie', 12, 0.05, 0),
  ('Matt', 22, 0.42, 0),
  ('Jeff', 51, 0.12, 0),
  ('Brie', 31, 0.06, 0),
  ('John', 65, 0.23, 1),
  ('Brie', 4, 1.0, 0);

CREATE OR REPLACE TABLE EXAMPLE_DB.DATA_SCHEMA.FRUITS (NODEID VARCHAR, TROPICAL INT, SOURNESS FLOAT, SWEETNESS FLOAT);
ALTER TABLE EXAMPLE_DB.DATA_SCHEMA.FRUITS ADD COLUMN EXPERIENCE FLOAT DEFAULT 0.0;
INSERT INTO EXAMPLE_DB.DATA_SCHEMA.FRUITS (NODEID, TROPICAL, SOURNESS, SWEETNESS) VALUES
  ('Apple', 0, 0.3, 0.6),
  ('Banana', 1, 0.1, 0.9),
  ('Mango', 1, 0.3, 1.0),
  ('Plum', 0, 0.5, 0.8);

CREATE OR REPLACE TABLE EXAMPLE_DB.DATA_SCHEMA.LIKES (SOURCENODEID VARCHAR, TARGETNODEID VARCHAR);
INSERT INTO EXAMPLE_DB.DATA_SCHEMA.LIKES VALUES
  ('Dan', 'Apple'),
  ('Annie', 'Banan'),
  ('Matt', 'Mango'),
  ('Jeff', 'Mango'),
  ('Brie', 'Banana'),
  ('Elsa', 'Plum'),
  ('John',  'Plum');

CREATE OR REPLACE TABLE EXAMPLE_DB.DATA_SCHEMA.KNOWS (SOURCENODEID VARCHAR, TARGETNODEID VARCHAR);
INSERT INTO EXAMPLE_DB.DATA_SCHEMA.KNOWS VALUES
  ('Dan', 'Annie'),
  ('Dan', 'Matt'),
  ('Annie', 'Matt'),
  ('Annie', 'Jeff'),
  ('Annie', 'Brie'),
  ('Matt', 'Brie'),
  ('Brie',  'Elsa'),
  ('Brie', 'Jeff'),
  ('John', 'Jeff');

This graph has two node tables, person nodes and fruit nodes. The two node sets are connected via LIKES relationships, and there are also KNOWS relationships between the person nodes.

Please note that we add some node table columns with only default values. The reason is that HashGNN requires all nodes to have the same set of features, and we want to use the node columns as features.

Run with binarization

To begin with we will run the algorithm only on the person nodes, using the AGE and EXPERIENCE columns as features. Since these properties are not binary, we will use the binarization feature of HashGNN to transform them into binary features.

To run the query, there is a required setup of grants for the application, your consumer role and your environment. Please see the Getting started page for more on this.

We also assume that the application name is the default Neo4j_Graph_Analytics. If you chose a different app name during installation, please replace it with that.

The following will run the algorithm on person nodes with binarization:
CALL Neo4j_Graph_Analytics.graph.hashgnn('CPU_X64_XS', {
    'defaultTablePrefix': 'EXAMPLE_DB.DATA_SCHEMA',
    'project': {
        'nodeTables': [ 'PERSONS' ],
        'relationshipTables': {
            'KNOWS': {
                'sourceTable': 'PERSONS',
                'targetTable': 'PERSONS',
                'orientation': 'UNDIRECTED'
            }
        }
    },
    'compute': {
        'iterations': 1,
        'embeddingDensity': 2,
        'binarizeFeatures': {'dimension': 4, 'threshold': 32},
        'featureProperties': ['AGE', 'EXPERIENCE'],
        'randomSeed': 42
    },
    'write': [{
        'nodeLabel': 'PERSONS',
        'outputTable': 'PERSON_EMBEDDINGS'
    }]
});
Table 5. Results
JOB_ID JOB_START JOB_END JOB_RESULT

job_fe099995c12b431cbf5fa46d4a88a30f

2025-08-06 07:33:11.282

2025-08-06 07:33:17.678

 {
  "hashgnn_1": {
    "computeMillis": 32,
    "configuration": {
      "binarizeFeatures": {
        "dimension": 4,
        "threshold": 32
      },
      "concurrency": 6,
      "embeddingDensity": 2,
      "featureProperties": [
        "AGE",
        "EXPERIENCE"
      ],
      "heterogeneous": false,
      "iterations": 1,
      "jobId": "bc9f591d-b233-42c2-ae47-07135a006974",
      "logProgress": true,
      "mutateProperty": "hashgnn",
      "neighborInfluence": 1,
      "nodeLabels": [
        "*"
      ],
      "randomSeed": 42,
      "relationshipTypes": [
        "*"
      ],
      "sudo": false
    },
    "mutateMillis": 2,
    "nodeCount": 6,
    "nodePropertiesWritten": 6,
    "preProcessingMillis": 7
  },
  "project_1": {
    "graphName": "snowgraph",
    "nodeCount": 6,
    "nodeMillis": 218,
    "relationshipCount": 16,
    "relationshipMillis": 462,
    "totalMillis": 680
  },
  "write_node_property_1": {
    "copyIntoTableMillis": 1510,
    "exportMillis": 2411,
    "nodeLabel": "PERSONS",
    "nodeProperty": "hashgnn",
    "outputTable": "EXAMPLE_DB.DATA_SCHEMA.PERSON_EMBEDDINGS",
    "propertiesExported": 6,
    "stageUploadMillis": 658
  }
}

The returned result contains information about the job execution and result distribution. Additionally, the embedding for each of the nodes has been written back to Snowflake. We can query it like so:

SELECT * FROM EXAMPLE_DB.DATA_SCHEMA.PERSON_EMBEDDINGS;
Table 6. Results
NODEID HASHGNN

Dan

[ 0.000000000000000e+00, 0.000000000000000e+00, 0.000000000000000e+00, 0.000000000000000e+00 ]

Annie

[ 1.000000000000000e+00, 0.000000000000000e+00, 1.000000000000000e+00, 0.000000000000000e+00 ]

Matt

[ 1.000000000000000e+00, 1.000000000000000e+00, 0.000000000000000e+00, 0.000000000000000e+00 ]

Jeff

[ 1.000000000000000e+00, 0.000000000000000e+00, 0.000000000000000e+00, 0.000000000000000e+00 ]

Brie

[ 1.000000000000000e+00, 0.000000000000000e+00, 0.000000000000000e+00, 0.000000000000000e+00 ]

John

[ 1.000000000000000e+00, 0.000000000000000e+00, 0.000000000000000e+00, 0.000000000000000e+00 ]

The results of the algorithm are not very intuitively interpretable, as the node embedding format is a mathematical abstraction of the node within its neighborhood, designed for machine learning . What we can see is that the embeddings have four elements (as configured using binarizeFeatures.dimension).

Due to the random nature of the algorithm the results will vary between the runs, unless randomSeed is specified.

Run without binarization

Next we will run the algorithm also only on the person nodes, using the HIPSTER column as features. Since these properties are binary, we will not use the binarization feature of HashGNN.

To run the query, there is a required setup of grants for the application, your consumer role and your environment. Please see the Getting started page for more on this.

We also assume that the application name is the default Neo4j_Graph_Analytics. If you chose a different app name during installation, please replace it with that.

The following will run the algorithm on person nodes without binarization:
CALL Neo4j_Graph_Analytics.graph.hashgnn('CPU_X64_XS', {
    'defaultTablePrefix': 'EXAMPLE_DB.DATA_SCHEMA',
    'project': {
        'nodeTables': [ 'PERSONS' ],
        'relationshipTables': {
            'KNOWS': {
                'sourceTable': 'PERSONS',
                'targetTable': 'PERSONS',
                'orientation': 'UNDIRECTED'
            }
        }
    },
    'compute': {
        'iterations': 1,
        'embeddingDensity': 2,
        'featureProperties': ['HIPSTER'],
        'randomSeed': 123
    },
    'write': [{
        'nodeLabel': 'PERSONS',
        'outputTable': 'PERSON_EMBEDDINGS'
    }]
});
Table 7. Results
JOB_ID JOB_START JOB_END JOB_RESULT

job_92acd5e9bc374455bdd1a5a3361168c9

2025-08-06 07:39:46.436

2025-08-06 07:39:52.459

 {
  "hashgnn_1": {
    "computeMillis": 34,
    "configuration": {
      "concurrency": 6,
      "embeddingDensity": 2,
      "featureProperties": [
        "HIPSTER"
      ],
      "heterogeneous": false,
      "iterations": 1,
      "jobId": "71425d87-f242-4776-99c7-23c90dceb946",
      "logProgress": true,
      "mutateProperty": "hashgnn",
      "neighborInfluence": 1,
      "nodeLabels": [
        "*"
      ],
      "randomSeed": 123,
      "relationshipTypes": [
        "*"
      ],
      "sudo": false
    },
    "mutateMillis": 1,
    "nodeCount": 6,
    "nodePropertiesWritten": 6,
    "preProcessingMillis": 9
  },
  "project_1": {
    "graphName": "snowgraph",
    "nodeCount": 6,
    "nodeMillis": 139,
    "relationshipCount": 16,
    "relationshipMillis": 300,
    "totalMillis": 439
  },
  "write_node_property_1": {
    "copyIntoTableMillis": 1002,
    "exportMillis": 1865,
    "nodeLabel": "PERSONS",
    "nodeProperty": "hashgnn",
    "outputTable": "EXAMPLE_DB.DATA_SCHEMA.PERSON_EMBEDDINGS",
    "propertiesExported": 6,
    "stageUploadMillis": 562
  }
}

The returned result contains information about the job execution and result distribution. Additionally, the embedding for each of the nodes has been written back to Snowflake. We can query it like so:

SELECT * FROM EXAMPLE_DB.DATA_SCHEMA.PERSON_EMBEDDINGS;
Table 8. Results
NODEID HASHGNN

Dan

[ 0.000000000000000e+00 ]

Annie

[ 0.000000000000000e+00 ]

Matt

[ 0.000000000000000e+00 ]

Jeff

[ 1.000000000000000e+00 ]

Brie

[ 0.000000000000000e+00 ]

John

[ 1.000000000000000e+00 ]

In this example the embedding dimension becomes 1, because without binarization it is the number of features given, which is 1 due to the single 'HIPSTER' column.

Run with feature generation

Next we will run the algorithm also only on the person nodes, using no features from the node tables, but instead generating random binary features. This is useful when the nodes do not have any features, or when the features are not useful.

To run the query, there is a required setup of grants for the application, your consumer role and your environment. Please see the Getting started page for more on this.

We also assume that the application name is the default Neo4j_Graph_Analytics. If you chose a different app name during installation, please replace it with that.

The following will run the algorithm on person nodes with feature generation:
CALL Neo4j_Graph_Analytics.graph.hashgnn('CPU_X64_XS', {
    'defaultTablePrefix': 'EXAMPLE_DB.DATA_SCHEMA',
    'project': {
        'nodeTables': [ 'PERSONS' ],
        'relationshipTables': {
            'KNOWS': {
                'sourceTable': 'PERSONS',
                'targetTable': 'PERSONS',
                'orientation': 'UNDIRECTED'
            }
        }
    },
    'compute': {
        'iterations': 1,
        'embeddingDensity': 2,
        'generateFeatures': {'dimension': 6, 'densityLevel': 1},
        'randomSeed': 42
    },
    'write': [{
        'nodeLabel': 'PERSONS',
        'outputTable': 'PERSON_EMBEDDINGS'
    }]
});
Table 9. Results
JOB_ID JOB_START JOB_END JOB_RESULT

job_65555fca32dd4088a4d18b2b888a1b96

2025-08-06 07:43:16.691

2025-08-06 07:43:23.157

 {
  "hashgnn_1": {
    "computeMillis": 25,
    "configuration": {
      "concurrency": 6,
      "embeddingDensity": 2,
      "featureProperties": [],
      "generateFeatures": {
        "densityLevel": 1,
        "dimension": 6
      },
      "heterogeneous": false,
      "iterations": 1,
      "jobId": "7117cf04-0d01-4777-b964-f72dabd93c09",
      "logProgress": true,
      "mutateProperty": "hashgnn",
      "neighborInfluence": 1,
      "nodeLabels": [
        "*"
      ],
      "randomSeed": 42,
      "relationshipTypes": [
        "*"
      ],
      "sudo": false
    },
    "mutateMillis": 1,
    "nodeCount": 6,
    "nodePropertiesWritten": 6,
    "preProcessingMillis": 7
  },
  "project_1": {
    "graphName": "snowgraph",
    "nodeCount": 6,
    "nodeMillis": 182,
    "relationshipCount": 16,
    "relationshipMillis": 413,
    "totalMillis": 595
  },
  "write_node_property_1": {
    "copyIntoTableMillis": 1012,
    "exportMillis": 1825,
    "nodeLabel": "PERSONS",
    "nodeProperty": "hashgnn",
    "outputTable": "EXAMPLE_DB.DATA_SCHEMA.PERSON_EMBEDDINGS",
    "propertiesExported": 6,
    "stageUploadMillis": 545
  }
}

The returned result contains information about the job execution and result distribution. Additionally, the embedding for each of the nodes has been written back to Snowflake. We can query it like so:

SELECT * FROM EXAMPLE_DB.DATA_SCHEMA.PERSON_EMBEDDINGS;
Table 10. Results
NODEID HASHGNN

Dan

[ 0.000000000000000e+00, 0.000000000000000e+00, 0.000000000000000e+00, 0.000000000000000e+00, 1.000000000000000e+00, 0.000000000000000e+00 ]

Annie

[ 0.000000000000000e+00, 0.000000000000000e+00, 0.000000000000000e+00, 0.000000000000000e+00, 1.000000000000000e+00, 0.000000000000000e+00 ]

Matt

[ 0.000000000000000e+00, 0.000000000000000e+00, 0.000000000000000e+00, 0.000000000000000e+00, 1.000000000000000e+00, 0.000000000000000e+00 ]

Jeff

[ 0.000000000000000e+00, 0.000000000000000e+00, 0.000000000000000e+00, 0.000000000000000e+00, 1.000000000000000e+00, 0.000000000000000e+00 ]

Brie

[ 0.000000000000000e+00, 0.000000000000000e+00, 0.000000000000000e+00, 0.000000000000000e+00, 1.000000000000000e+00, 1.000000000000000e+00 ]

John

[ 0.000000000000000e+00, 0.000000000000000e+00, 1.000000000000000e+00, 1.000000000000000e+00, 0.000000000000000e+00, 0.000000000000000e+00 ]

And as we can see, each node has at least one feature active. The density is about 50%, and no node has more than two features active (limited by the embeddingDensity).

Run on heterogeneous graph

Lastly, we will run the algorithm on the heterogeneous graph, also including fruit nodes, and using the EXPERIENCE, TROPICAL, SOURNESS and SWEETNESS columns as features.

To run the query, there is a required setup of grants for the application, your consumer role and your environment. Please see the Getting started page for more on this.

We also assume that the application name is the default Neo4j_Graph_Analytics. If you chose a different app name during installation, please replace it with that.

The following will run the algorithm on the heterogeneous graph:
CALL Neo4j_Graph_Analytics.graph.hashgnn('CPU_X64_XS', {
    'defaultTablePrefix': 'EXAMPLE_DB.DATA_SCHEMA',
    'project': {
        'nodeTables': [ 'PERSONS', 'FRUITS' ],
        'relationshipTables': {
            'KNOWS': {
                'sourceTable': 'PERSONS',
                'targetTable': 'PERSONS',
                'orientation': 'UNDIRECTED'
            },
            'LIKES': {
                'sourceTable': 'PERSONS',
                'targetTable': 'FRUITS',
                'orientation': 'UNDIRECTED'
            }
        }
    },
    'compute': {
        'heterogeneous': true,
        'iterations': 2,
        'embeddingDensity': 4,
        'binarizeFeatures': {'dimension': 6, 'threshold': 0.2},
        'featureProperties': ['EXPERIENCE', 'SOURNESS', 'SWEETNESS', 'TROPICAL'],
        'randomSeed': 42
    },
    'write': [
        {
            'nodeLabel': 'PERSONS',
            'outputTable': 'PERSON_EMBEDDINGS'
        },
        {
            'nodeLabel': 'FRUITS',
            'outputTable': 'FRUIT_EMBEDDINGS'
        }
    ]
});
Table 11. Results
JOB_ID JOB_START JOB_END JOB_RESULT

job_2fbffb14f7c6402588192cbe8eee793f

2025-08-06 07:47:04.670

2025-08-06 07:47:13.095

 {
  "hashgnn_1": {
    "computeMillis": 47,
    "configuration": {
      "binarizeFeatures": {
        "dimension": 6,
        "threshold": 0.2
      },
      "concurrency": 6,
      "embeddingDensity": 4,
      "featureProperties": [
        "EXPERIENCE",
        "SOURNESS",
        "SWEETNESS",
        "TROPICAL"
      ],
      "heterogeneous": true,
      "iterations": 2,
      "jobId": "b7a65d45-22b5-4884-bd4f-84c692355d5f",
      "logProgress": true,
      "mutateProperty": "hashgnn",
      "neighborInfluence": 1,
      "nodeLabels": [
        "*"
      ],
      "randomSeed": 42,
      "relationshipTypes": [
        "*"
      ],
      "sudo": false
    },
    "mutateMillis": 1,
    "nodeCount": 10,
    "nodePropertiesWritten": 10,
    "preProcessingMillis": 10
  },
  "project_1": {
    "graphName": "snowgraph",
    "nodeCount": 10,
    "nodeMillis": 242,
    "relationshipCount": 26,
    "relationshipMillis": 333,
    "totalMillis": 575
  },
  "write_node_property_1": {
    "copyIntoTableMillis": 1015,
    "exportMillis": 1809,
    "nodeLabel": "PERSONS",
    "nodeProperty": "hashgnn",
    "outputTable": "EXAMPLE_DB.DATA_SCHEMA.PERSON_EMBEDDINGS",
    "propertiesExported": 6,
    "stageUploadMillis": 550
  },
  "write_node_property_2": {
    "copyIntoTableMillis": 936,
    "exportMillis": 1583,
    "nodeLabel": "FRUITS",
    "nodeProperty": "hashgnn",
    "outputTable": "EXAMPLE_DB.DATA_SCHEMA.FRUIT_EMBEDDINGS",
    "propertiesExported": 4,
    "stageUploadMillis": 428
  }
}

The returned result contains information about the job execution and result distribution. Additionally, the embedding for each of the nodes has been written back to Snowflake. Let us inspect the embeddings for the person nodes:

SELECT * FROM EXAMPLE_DB.DATA_SCHEMA.PERSON_EMBEDDINGS;
Table 12. Results
NODEID HASHGNN

Dan

[ 1.000000000000000e+00, 0.000000000000000e+00, 1.000000000000000e+00, 0.000000000000000e+00, 0.000000000000000e+00, 0.000000000000000e+00 ]

Annie

[ 1.000000000000000e+00, 1.000000000000000e+00, 1.000000000000000e+00, 0.000000000000000e+00, 0.000000000000000e+00, 0.000000000000000e+00 ]

Matt

[ 1.000000000000000e+00, 0.000000000000000e+00, 1.000000000000000e+00, 0.000000000000000e+00, 0.000000000000000e+00, 0.000000000000000e+00 ]

Jeff

[ 1.000000000000000e+00, 1.000000000000000e+00, 0.000000000000000e+00, 0.000000000000000e+00, 0.000000000000000e+00, 0.000000000000000e+00 ]

Brie

[ 1.000000000000000e+00, 1.000000000000000e+00, 1.000000000000000e+00, 0.000000000000000e+00, 0.000000000000000e+00, 0.000000000000000e+00 ]

John

[ 1.000000000000000e+00, 1.000000000000000e+00, 0.000000000000000e+00, 0.000000000000000e+00, 0.000000000000000e+00, 0.000000000000000e+00 ]

We could inspect FRUIT_EMBEDDINGS and find similar results, but we will not do that here for brevity.

Virtual example

Perhaps the below example is best enjoyed with a pen and paper.

Let say we have a node a with feature f1, a node b with feature f2 and a node c with features f1 and f3. The graph structure is a—​b—​c. We imagine running HashGNN for one iteration with embeddingDensity=2. For simplicity, we will assume that the hash functions return some made up numbers as we go.

During the first iteration and k=0, we compute an embedding for (a). A hash value for f1 turns out to be 7. Since (b) is a neighbor of (a), we generate a value for its feature f2 which turns out to be 11. The value 7 is sampled from a hash function which we call "one" and 11 from a hash function "two". Thus f1 is added to the new features for (a) since it has a smaller hash value. We repeat for k=1 and this time the hash values are 4 and 2, so now f2 is added as a feature to (a).

We now consider (b). The feature f2 gets hash value 8 using hash function "one". Looking at the neighbor (a), we sample a hash value for f1 which becomes 5 using hash function "two". Since (c) has more than one feature, we also have to select one of the two features f1 and f3 before considering the "winning" feature as before as input to hash function "two". We use a third hash function "three" for this purpose and f3 gets the smaller value of 1. We now compute a hash of f3 using "two" and it becomes 6. Since 5 is smaller than 6, f1 is the "winning" neighbor feature for (b), and since 5 is also smaller than 8, it is the overall "winning" feature. Therefore, we add f1 to the embedding of (b). We proceed similarly with k=1 and f1 is selected again. Since the embeddings consist of binary features, this second addition has no effect.

We omit the details of computing the embedding of (c).

After the 2 sampling rounds, the iteration is complete and since there is only one iteration, we are done. Each node has a binary embedding that contains some subset of the original binary features. In particular, (a) has features f1 and f2, (b) has only the feature f1.