GraphSAGE node embedding training
GraphSAGE can be used as an unsupervised algorithm to generate embeddings for nodes in a graph. This page provides instructions for how to use the GraphSAGE node embedding training endpoint.
Syntax
This section covers the syntax used to execute the GraphSAGE node embedding training algorithm.
CALL graph.gs_unsup_train(
'CPU_X64_XS', (1)
{
['defaultTablePrefix': '...',] (2)
'project': {...}, (3)
'compute': {...}, (4)
}
);
1 | Compute pool selector. |
2 | Optional prefix for table references. |
3 | Project config. |
4 | Compute config. |
Name | Type | Default | Optional | Description |
---|---|---|---|---|
computePoolSelector |
String |
|
no |
The selector for the compute pool on which to run the GraphSAGE node embedding training job. |
configuration |
Map |
|
no |
Configuration for graph project, algorithm compute and result write back. |
For this algorithm we strongly recommend using a GPU compute pool, unless the dataset is very small and the model shallow.
The configuration map consists of the following three entries.
For more details on below Project configuration, refer to the Project documentation. |
Name | Type |
---|---|
nodeTables |
List of node tables. |
relationshipTables |
Map of relationship types to relationship tables. |
Please note that in order for GraphSAGE to properly propagate updates of node embeddings, each type of node must be the target of at least one relationship type.
The orientation
parameter can be useful to add reverse direction relationships for types of nodes that are only the source of relationships (using the "REVERSE" or "UNDIRECTED" orientations).
Name | Type | Default | Optional | Description |
---|---|---|---|---|
numWalks |
Integer |
|
yes |
The number of random walks to perform for each node in the graph |
walkDepth |
Integer |
|
yes |
The number of steps in every random walk |
negSamplingRatio |
Float |
|
yes |
The ratio of negative to positive samples to sample for training |
modelname |
String |
|
no |
The name of the model to train (must be unique) |
numEpochs |
Integer |
|
no |
The number of epochs to train the model |
numSamples |
List of Integer |
|
no |
The number of neighbors to sample for each layer. Note that this also determines the number of layers |
hiddenChannels |
Integer |
|
yes |
The node embedding dimension of the model layers' outputs |
activation |
String |
|
yes |
The activation function to use. Valid values are "relu" and "sigmoid" |
aggregator |
String |
|
yes |
The neighborhood embedding aggregator to use. Valid values are "mean" and "max" |
learningRate |
Float |
|
yes |
The learning rate for the optimizer |
dropout |
Float |
|
yes |
The dropout probability for each layer. Must be a value >= 0.0 and < 1.0 |
layerNormalization |
Boolean |
|
yes |
Whether to apply layer normalization between the model layers |
epochsPerCheckpoint |
Integer |
|
yes |
The number of epochs between saving model checkpoints |
randomSeed |
Integer |
|
yes |
A number used to seed all randomness of the computation |
batchSize |
Integer |
|
yes |
The number of target nodes to train on in each batch. If not provided, the algorithm will automatically infer the maximally allowed batch size within the constraints of available memory |
lossReduction |
String |
|
yes |
The reduction method to apply to the loss. Valid values are "mean" and "sum". If not provided, the reduction method will "mean" if explicit |
Example
For our example we will use an IMDB dataset with actors, directors, movies, and genres. These all have keywords associated with them, which we will use as features for the nodes. They are connected by relationships where actors act in movies and directors direct movies.
We have a database called imdb
that contains the tables:
-
actor
with columnsnodeid
andplot_keywords
-
movie
with columnsnodeid
andplot_keywords
-
director
with columnsnodeid
andplot_keywords
-
acted_in
with columnssourcenodeid
andtargetnodeid
that representactor
andmovie
node IDs -
directed_in
with columnssourcenodeid
andtargetnodeid
that representdirector
andmovie
node IDs
The plot_keywords
columns contain keywords associated with the nodes, encoded as vectors of floats.
You can upload this dataset to your snowflake account by following the instructions at github: neo4j-product-examples/snowflake-graph-analytics.
The training query
In the following query we train a GraphSAGE model for node embeddings on the dataset [1].
We train for 10 epochs, with two hidden layers.
To run the query, there is a required setup of grants for the application, your consumer role and your environment. Please see the Getting started page for more on this.
We also assume that the application name is the default Neo4j_Graph_Analytics. If you chose a different app name during installation, please replace it with that.
CALL Neo4j_Graph_Analytics.graph.gs_unsup_train('GPU_NV_S', {
'defaultTablePrefix': 'imdb.gml',
'project': {
'nodeTables': ['actor', 'director', 'movie'],
'relationshipTables': {
'acted_in': {
'sourceTable': 'actor',
'targetTable': 'movie',
'orientation': 'UNDIRECTED'
},
'directed_in': {
'sourceTable': 'director',
'targetTable': 'movie',
'orientation': 'UNDIRECTED'
}
}
},
'compute': {
'modelname': 'unsup-imdb',
'numEpochs': 10,
'numSamples': [20, 20]
}
});
The above query should produce a row with empty job result.
JOB_ID |
JOB_START |
JOB_END |
JOB_RESULT |
job_c047364f8c3c4dc19f1e06fc3711483f |
2025-04-29 12:39:08.215 |
2025-04-29 12:42:12.820 |
{} |
genre
of Movies when computing the embeddings because not all movies have genres, and moreover, using genre would make the embeddings inappropriate to use for predicting movie genres. A way to remove the genre property is to create a snowflake view of the movie
table and select only nodeid
and plot_keywords
columns. Also, remember to grant the application SELECT
privilege on the updated movie view. For simplicity, we keep the name movie
and assume it does not have a genre
column.