GraphSAGE node classification training

GraphSAGE is a graph neural network (GNN) architecture that can be used as a supervised algorithm to predict class labels of nodes in a graph. This section provides instructions for how to use the GraphSAGE endpoint for training a model for node classification using Neo4j Graph Analytics for Snowflake.

Syntax

This section covers the syntax used to execute the GraphSAGE node classification training algorithm.

Run GraphSAGE node classification training.

CALL graph.gs_nc_train(
  'CPU_X64_XS',                    (1)
  {
    ['defaultTablePrefix': '...',] (2)
    'project': {...},              (3)
    'compute': {...},              (4)
  }
);

1	Compute pool selector.
2	Optional prefix for table references.
3	Project config.
4	Compute config.

Table 1. Parameters
Name	Type	Default	Optional	Description
computePoolSelector	String	`n/a`	no	The selector for the compute pool on which to run the GraphSAGE node classification training job.
configuration	Map	`{}`	no	Configuration for graph project, algorithm compute and result write back.

For this algorithm we strongly recommend using a GPU compute pool, unless the dataset is very small and the model shallow.

The configuration map consists of the following three entries.

For more details on below Project configuration, refer to the Project documentation.

Table 2. Project configuration
Name	Type
nodeTables	List of node tables.
relationshipTables	Map of relationship types to relationship tables.

Please note that in order for GraphSAGE to properly propagate updates of node embeddings, each type of node must be the target of at least one relationship type. The orientation parameter can be useful to add reverse direction relationships for types of nodes that are only the source of relationships (using the "REVERSE" or "UNDIRECTED" orientations).

Table 3. Compute configuration
Name	Type	Default	Optional	Description
targetLabel	String	`n/a`	no	The node label (i.e. type) to train to predict on
targetProperty	String	`n/a`	no	The node property to train to predict, represented by a column in the input node table of the specified 'target_label'
modelname	String	`n/a`	no	The name of the model to train (must be unique)
numEpochs	Integer	`n/a`	no	The number of epochs to train the model
numSamples	List of Integer	`n/a`	no	The number of neighbors to sample for each layer. Note that this also determines the number of layers
hiddenChannels	Integer	`256`	yes	The node embedding dimension of the model layers' outputs
activation	String	`"relu"`	yes	The activation function to use. Valid values are "relu" and "sigmoid"
aggregator	String	`"mean"`	yes	The neighborhood embedding aggregator to use. Valid values are "mean" and "max"
learningRate	Float	`0.001`	yes	The learning rate for the optimizer
dropout	Float	`0.1`	yes	The dropout probability for each layer. Must be a value >= 0.0 and < 1.0
layerNormalization	Boolean	`true`	yes	Whether to apply layer normalization between the model layers
epochsPerCheckpoint	Integer	`max(numEpochs / 10, 1)`	yes	The number of epochs between saving model checkpoints
randomSeed	Integer	`A random integer`	yes	A number used to seed all randomness of the computation
splitRatios	Map	`{"TRAIN": 0.6, "TEST": 0.2, "VALID": 0.2}`	yes	The ratios as a map to split the target nodes of the input graph into training, test, and validation sets. The keys must be "TRAIN", "TEST" and "VALID". The sum of the values must be 1.0
epochsPerVal	Integer	`0`	yes	The number of epochs between evaluating the model on the validation set. If set to 0, the model will not be evaluated on the validation set
trainBatchSize	Integer	`Automatically inferred`	yes	The number of target nodes to train on in each batch. If not provided, the algorithm will automatically infer the maximally allowed batch size within the constraints of available memory
evalBatchSize	Integer	`train batch size`	yes	The batch size to use for evaluation
classWeights	Boolean or Map	`false`	yes	Whether to use class weights to balance the training data. If set to true, class weights will be calculated based on the distribution of the target labels in the training set. If set to a map, the map must contain the class weight for each target class label

Example

For our example we will use an IMDB dataset with actors, directors, movies, and genres. These all have keywords associated with them, which we will use as features for the nodes. They are connected by relationships where actors act in movies and directors direct movies. The goal is to predict the genre of movies.

We have a database called imdb that contains the tables:

actor with columns nodeid and plot_keywords
movie with columns nodeid, plot_keywords and genre
director with columns nodeid and plot_keywords
acted_in with columns sourcenodeid and targetnodeid that represent actor and movie node IDs
directed_in with columns sourcenodeid and targetnodeid that represent director and movie node IDs

The plot_keywords columns contain keywords associated with the nodes, encoded as vectors of floats. The genre column contains the target class labels for the movie nodes, which we want to predict.

You can upload this dataset to your snowflake account by following the instructions at github: neo4j-product-examples/snowflake-graph-analytics.

The training query

In the following query we train a GraphSAGE model for node classification on the dataset. We train for 10 epochs, with two hidden layers, and use class weights to balance the class distribution.

To run the query, there is a required setup of grants for the application, your consumer role and your environment. Please see the Getting started page for more on this.

We also assume that the application name is the default Neo4j_Graph_Analytics. If you chose a different app name during installation, please replace it with that.

CALL Neo4j_Graph_Analytics.graph.gs_nc_train('GPU_NV_S', {
    'defaultTablePrefix': 'imdb.gml',
    'project': {
        'nodeTables': ['actor', 'director', 'movie'],
        'relationshipTables': {
            'acted_in': {
                'sourceTable': 'actor',
                'targetTable': 'movie',
                'orientation': 'UNDIRECTED'
            },
            'directed_in': {
                'sourceTable': 'director',
                'targetTable': 'movie',
                'orientation': 'UNDIRECTED'
            }
        }
    },
    'compute': {
        'modelname': 'nc-imdb',
        'numEpochs': 10,
        'numSamples': [20, 20],
        'targetLabel': 'movie',
        'targetProperty': 'genre',
        'classWeights': true
    }
});

The above query should produce a result similar to the one below. The numerical results may vary.

JOB_ID

JOB_START

JOB_END

JOB_RESULT

job_63b8083fc8ef463ab38cd95d2ac345ea

2025-04-29 12:06:28.791

2025-04-29 12:07:10.318

{ "metrics": { "test_acc": 0.7441860437393188, "test_f1_macro": 0.7236689925193787, "test_f1_micro": 0.7441860437393188, "train_acc": 0.9911160469055176, "train_f1_macro": 0.9900508522987366, "train_f1_micro": 0.9911160469055176 } }