GraphSAGE

This section describes the GraphSAGE node embedding algorithm in the Neo4j Graph Data Science library.

GraphSAGE is an inductive algorithm for computing node embeddings. GraphSAGE is using node feature information to generate node embeddings on unseen nodes or graphs. Instead of training individual embeddings for each node, the algorithm learns a function that generates embeddings by sampling and aggregating features from a node’s local neighborhood.

The algorithm is defined for UNDIRECTED graphs.

For more information on this algorithm see:

1. Syntax

GraphSAGE syntax per mode
Run GraphSAGE in train mode on a named graph.
CALL gds.beta.graphSage.train(
  graphName: String,
  configuration: Map
) YIELD
  graphName: String,
  graphCreateConfig: Map,
  modelInfo: Map,
  configuration: Map,
  trainMillis: Integer
Table 1. Parameters
Name Type Default Optional Description

graphName

String

n/a

no

The name of a graph stored in the catalog.

configuration

Map

{}

yes

Configuration for algorithm-specifics and/or graph filtering.

Table 2. General configuration for algorithm execution on a named graph.
Name Type Default Optional Description

modelName

String

n/a

no

The name of a GraphSAGE model in the model catalog.

nodeLabels

List of String

['*']

yes

Filter the named graph using the given node labels.

relationshipTypes

List of String

['*']

yes

Filter the named graph using the given relationship types.

concurrency

Integer

4

yes

The number of concurrent threads used for running the algorithm.

Table 3. Algorithm specific configuration
Name Type Default Optional Description

modelName

String

n/a

no

The name of the model to train, must not exist in the Model Catalog.

featureProperties

List of String

n/a

no

The names of the node properties that should be used as input features. All property names must exist in the in-memory graph and be of type Float or List of Float.

embeddingDimension

Integer

64

yes

The dimension of the generated node embeddings as well as their hidden layer representations.

aggregator

String

"mean"

yes

The aggregator to be used by the layers. Supported values are "mean" and "pool".

activationFunction

String

"sigmoid"

yes

The activation function to be used in the model architecture. Supported values are "sigmoid" and "relu".

sampleSizes

List of Integer

[25, 10]

yes

A list of Integer values, the size of the list determines the number of layers and the values determine how many nodes will be sampled by the layers.

projectedFeatureDimension

Integer

n/a

yes

The dimension of the projected featureProperties. This enables multi-label GraphSage, where each label can have a subset of the featureProperties.

batchSize

Integer

100

yes

The number of nodes per batch.

tolerance

Float

1e-4

yes

Tolerance used for the early convergence of an epoch.

learningRate

Float

0.1

yes

The learning rate determines the step size at each iteration while moving toward a minimum of a loss function.

epochs

Integer

1

yes

Number of times to traverse the graph.

maxIterations

Integer

10

yes

Maximum number of weight updates per batch. Batches can also converge early based on tolerance.

searchDepth

Integer

5

yes

Maximum depth of the RandomWalks to sample nearby nodes for the training.

negativeSampleWeight

Integer

20

yes

The weight of the negative samples. Higher values increase the impact of negative samples in the loss.

relationshipWeightProperty

String

null

yes

Name of the relationship property to use as weights. If unspecified, the algorithm runs unweighted.

randomSeed

Integer

random

yes

A random seed which is used to control the randomness in computing the embeddings.

Table 4. Results
Name Type Description

graphName

String

The name of the in-memory graph used during training.

graphCreateConfig

Map

Configuration used to create in-memory graph. Only has value if anonymous graph was used.

modelInfo

Map

Details of the trained model.

configuration

Map

The configuration used to run the procedure.

trainMillis

Integer

Milliseconds to train the model.

Table 5. Details on modelInfo
Name Type Description

name

String

The name of the trained model.

type

String

The type of the trained model. Always graphSage.

metrics

Map

Metrics related to running the training, details in the table below.

Table 6. Metrics collected during training
Name Type Description

ranEpochs

Integer

The number of ran epochs during training.

epochLosses

List

Ordered list of the losses after each epoch.

didConverge

Boolean

Indicates if the training has converged.

Run GraphSAGE in stream mode on a named graph.
CALL gds.beta.graphSage.stream(
  graphName: String,
  configuration: Map
) YIELD
  nodeId: Integer,
  embedding: List
Table 7. Parameters
Name Type Default Optional Description

graphName

String

n/a

no

The name of a graph stored in the catalog.

configuration

Map

{}

yes

Configuration for algorithm-specifics and/or graph filtering.

Table 8. General configuration for algorithm execution on a named graph.
Name Type Default Optional Description

modelName

String

n/a

no

The name of a GraphSAGE model in the model catalog.

nodeLabels

List of String

['*']

yes

Filter the named graph using the given node labels.

relationshipTypes

List of String

['*']

yes

Filter the named graph using the given relationship types.

concurrency

Integer

4

yes

The number of concurrent threads used for running the algorithm.

Table 9. Algorithm specific configuration
Name Type Default Optional Description

batchSize

Integer

100

yes

The number of nodes per batch.

Table 10. Results
Name Type Description

nodeId

Integer

The Neo4j node ID.

embedding

List of Float

The computed node embedding.

Run GraphSAGE in mutate mode on a graph stored in the catalog.
CALL gds.beta.graphSage.mutate(
  graphName: String,
  configuration: Map
)
YIELD
  nodeCount: Integer,
  nodePropertiesWritten: Integer,
  createMillis: Integer,
  computeMillis: Integer,
  mutateMillis: Integer,
  configuration: Map
Table 11. Parameters
Name Type Default Optional Description

graphName

String

n/a

no

The name of a graph stored in the catalog.

configuration

Map

{}

yes

Configuration for algorithm-specifics and/or graph filtering.

Table 12. General configuration for algorithm execution on a named graph.
Name Type Default Optional Description

modelName

String

n/a

no

The name of a GraphSAGE model in the model catalog.

nodeLabels

List of String

['*']

yes

Filter the named graph using the given node labels.

relationshipTypes

List of String

['*']

yes

Filter the named graph using the given relationship types.

concurrency

Integer

4

yes

The number of concurrent threads used for running the algorithm.

mutateProperty

String

n/a

no

The node property in the GDS graph to which the embedding is written.

Table 13. Algorithm specific configuration
Name Type Default Optional Description

batchSize

Integer

100

yes

The number of nodes per batch.

Table 14. Results
Name Type Description

nodeCount

Integer

The number of nodes processed.

nodePropertiesWritten

Integer

The number of node properties written.

createMillis

Integer

Milliseconds for loading data.

computeMillis

Integer

Milliseconds for running the algorithm.

mutateMillis

Integer

Milliseconds for writing result data back to the in-memory graph.

configuration

Map

The configuration used for running the algorithm.

Run GraphSAGE in write mode on a graph stored in the catalog.
CALL gds.beta.graphSage.write(
  graphName: String,
  configuration: Map
)
YIELD
  nodeCount: Integer,
  nodePropertiesWritten: Integer,
  createMillis: Integer,
  computeMillis: Integer,
  writeMillis: Integer,
  configuration: Map
Table 15. Parameters
Name Type Default Optional Description

graphName

String

n/a

no

The name of a graph stored in the catalog.

configuration

Map

{}

yes

Configuration for algorithm-specifics and/or graph filtering.

Table 16. General configuration for algorithm execution on a named graph.
Name Type Default Optional Description

modelName

String

n/a

no

The name of a GraphSAGE model in the model catalog.

nodeLabels

List of String

['*']

yes

Filter the named graph using the given node labels.

relationshipTypes

List of String

['*']

yes

Filter the named graph using the given relationship types.

concurrency

Integer

4

yes

The number of concurrent threads used for running the algorithm. Also provides the default value for 'writeConcurrency'.

writeConcurrency

Integer

value of 'concurrency'

yes

The number of concurrent threads used for writing the result to Neo4j.

writeProperty

String

n/a

no

The node property in the Neo4j database to which the embedding is written.

Table 17. Algorithm specific configuration
Name Type Default Optional Description

batchSize

Integer

100

yes

The number of nodes per batch.

Table 18. Results
Name Type Description

nodeCount

Integer

The number of nodes processed.

nodePropertiesWritten

Integer

The number of node properties written.

createMillis

Integer

Milliseconds for loading data.

computeMillis

Integer

Milliseconds for running the algorithm.

writeMillis

Integer

Milliseconds for writing result data back to Neo4j.

configuration

Map

The configuration used for running the algorithm.

1.1. Anonymous graphs

It is also possible to execute the algorithm on a graph that is projected in conjunction with the algorithm execution. In this case, the graph does not have a name, and we call it anonymous. When executing over an anonymous graph the configuration map contains a graph projection configuration as well as an algorithm configuration. All execution modes support execution on anonymous graphs, although we only show syntax and mode-specific configuration for the write mode for brevity.

For more information on syntax variants, see Syntax overview.

Run GraphSAGE in write mode on an anonymous graph.
CALL gds.beta.graphSage.write(
  configuration: Map
)
YIELD
  createMillis: Integer,
  computeMillis: Integer,
  writeMillis: Integer,
  nodeCount: Integer,
  nodePropertiesWritten: Integer,
  configuration: Map
Table 19. General configuration for algorithm execution on an anonymous graph.
Name Type Default Optional Description

nodeProjection

String, List of String or Map

null

yes

The node projection used for anonymous graph creation via a Native projection.

relationshipProjection

String, List of String or Map

null

yes

The relationship projection used for anonymous graph creation a Native projection.

nodeQuery

String

null

yes

The Cypher query used to select the nodes for anonymous graph creation via a Cypher projection.

relationshipQuery

String

null

yes

The Cypher query used to select the relationships for anonymous graph creation via a Cypher projection.

nodeProperties

String, List of String or Map

null

yes

The node properties to project during anonymous graph creation.

relationshipProperties

String, List of String or Map

null

yes

The relationship properties to project during anonymous graph creation.

concurrency

Integer

4

yes

The number of concurrent threads used for running the algorithm. Also provides the default value for 'readConcurrency' and 'writeConcurrency'.

readConcurrency

Integer

value of 'concurrency'

yes

The number of concurrent threads used for creating the graph.

writeConcurrency

Integer

value of 'concurrency'

yes

WRITE mode only: The number of concurrent threads used for writing the result.

writeProperty

String

n/a

no

WRITE mode only: The node property to which the embedding is written to.

Table 20. Algorithm specific configuration
Name Type Default Optional Description

batchSize

Integer

100

yes

The number of nodes per batch.

The results are the same as for running write mode with a named graph, see the write mode syntax above.

2. Examples

In this section we will show examples of running the GraphSAGE algorithm on a concrete graph. The intention is to illustrate what the results look like and to provide a guide in how to make use of the algorithm in a real setting. We will do this on a small friends network graph of a handful nodes connected in a particular pattern. The example graph looks like this:

Visualization of the example graph
The following Cypher statement will create the example graph in the Neo4j database:
CREATE
  // Persons
  (  dan:Person {name: 'Dan',   age: 20, heightAndWeight: [185, 75]}),
  (annie:Person {name: 'Annie', age: 12, heightAndWeight: [124, 42]}),
  ( matt:Person {name: 'Matt',  age: 67, heightAndWeight: [170, 80]}),
  ( jeff:Person {name: 'Jeff',  age: 45, heightAndWeight: [192, 85]}),
  ( brie:Person {name: 'Brie',  age: 27, heightAndWeight: [176, 57]}),
  ( elsa:Person {name: 'Elsa',  age: 32, heightAndWeight: [158, 55]}),
  ( john:Person {name: 'John',  age: 35, heightAndWeight: [172, 76]}),

  (dan)-[:KNOWS {relWeight: 1.0}]->(annie),
  (dan)-[:KNOWS {relWeight: 1.6}]->(matt),
  (annie)-[:KNOWS {relWeight: 0.1}]->(matt),
  (annie)-[:KNOWS {relWeight: 3.0}]->(jeff),
  (annie)-[:KNOWS {relWeight: 1.2}]->(brie),
  (matt)-[:KNOWS {relWeight: 10.0}]->(brie),
  (brie)-[:KNOWS {relWeight: 1.0}]->(elsa),
  (brie)-[:KNOWS {relWeight: 2.2}]->(jeff),
  (john)-[:KNOWS {relWeight: 5.0}]->(jeff)
CALL gds.graph.create(
  'persons',
  {
    Person: {
      label: 'Person',
      properties: ['age', 'heightAndWeight']
    }
  }, {
    KNOWS: {
      type: 'KNOWS',
      orientation: 'UNDIRECTED',
      properties: ['relWeight']
    }
})
The algorithm is defined for UNDIRECTED graphs.

2.1. Train

Before we are able to generate node embeddings we need to train a model and store it in the model catalog. Below is an example of how to do that.

The names specified in the featureProperties configuration parameter must exist in the in-memory graph.
CALL gds.beta.graphSage.train(
  'persons',
  {
    modelName: 'exampleTrainModel',
    featureProperties: ['age', 'heightAndWeight'],
    aggregator: 'mean',
    activationFunction: 'sigmoid',
    sampleSizes: [25, 10]
  }
) YIELD modelInfo as info
RETURN
  info.name as modelName,
  info.metrics.didConverge as didConverge,
  info.metrics.ranEpochs as ranEpochs,
  info.metrics.epochLosses as epochLosses
Table 21. Results
modelName didConverge ranEpochs epochLosses

exampleTrainModel

true

1

[186.0494816886275, 186.04946806237382]

Due to the random initialisation of the weight variables the results may vary between different runs.

Looking at the results we can draw the following conclusions, the training converged after a single epoch, the losses are almost identical. Tuning the algorithm parameters, such as trying out different sampleSizes, searchDepth, embeddingDimension or batchSize can improve the losses. For different datasets, GraphSAGE may require different train parameters for producing good models.

The trained model is automatically registered in the model catalog.

2.2. Train with multiple node labels

In this section we describe how to train on a graph with multiple labels. The different labels may have different sets of properties. To run on such a graph, GraphSAGE is run in multi-label mode, in which the feature properties are projected into a common feature space. Therefore, all nodes have feature vectors of the same dimension after the projection.

The projection for a label is linear and given by a matrix of weights. The weights for each label are learned jointly with the other weights of the GraphSAGE model.

In the multi-label mode, the following is applied prior to the usual aggregation layers:

  1. A property representing the label is added to the feature properties for that label

  2. The feature properties for each label are projected into a feature vector of a shared dimension

The projected feature dimension is configured with projectedFeatureDimension, and specifying it enables the multi-label mode.

The feature properties used for a label are those present in the featureProperties configuration parameter which exist in the graph for that label. In the multi-label mode, it is no longer required that all labels have all the specified properties.

2.2.1. Assumptions

  • A requirement for multi-label mode is that each node belongs to exactly one label.

  • A GraphSAGE model trained in this mode must be applied on graphs with the same schema with regards to node labels and properties.

2.2.2. Examples

In order to demonstrate GraphSAGE with multiple labels, we add instruments and relationships of type LIKE between person and instrument to the example graph.

Visualization of the multi-label example graph
The following Cypher statement will extend the example graph in the Neo4j database:
MATCH
  (dan:Person {name: "Dan"}),
  (annie:Person {name: "Annie"}),
  (matt:Person {name: "Matt"}),
  (brie:Person {name: "Brie"}),
  (john:Person {name: "John"})
CREATE
  (guitar:Instrument {name: 'Guitar', cost: 1337.0}),
  (synth:Instrument {name: 'Synthesizer', cost: 1337.0}),
  (bongos:Instrument {name: 'Bongos', cost: 42.0}),
  (trumpet:Instrument {name: 'Trumpet', cost: 1337.0}),
  (dan)-[:LIKES]->(guitar),
  (dan)-[:LIKES]->(synth),
  (dan)-[:LIKES]->(bongos),
  (annie)-[:LIKES]->(guitar),
  (annie)-[:LIKES]->(synth),
  (matt)-[:LIKES]->(bongos),
  (brie)-[:LIKES]->(guitar),
  (brie)-[:LIKES]->(synth),
  (brie)-[:LIKES]->(bongos),
  (john)-[:LIKES]->(trumpet)
CALL gds.graph.create(
  'persons_with_instruments',
  {
    Person: {
      label: 'Person',
      properties: ['age', 'heightAndWeight']
    },
    Instrument: {
      label: 'Instrument',
      properties: ['cost']
    }
  }, {
    KNOWS: {
      type: 'KNOWS',
      orientation: 'UNDIRECTED'
    },
    LIKES: {
      type: 'LIKES',
      orientation: 'UNDIRECTED'
    }
})

We can now run GraphSAGE in multi-label mode on that graph by specifying the projectedFeatureDimension parameter. Multi-label GraphSAGE removes the requirement, that each node in the in-memory graph must have all featureProperties. However, the projections are independent per label and even if two labels have the same featureProperty they are considered as different features before projection. The projectedFeatureDimension equals the maximum length of the feature-array, i.e., age and cost both are scalar features plus the list feature heightAndWeight which has a length of two. For each node its unique labels properties is projected using a label specific projection to vector space of dimension projectedFeatureDimension. Note that the cost feature is only defined for the instrument nodes, while age and heightAndWeight are only defined for persons.

CALL gds.beta.graphSage.train(
  'persons_with_instruments',
  {
    modelName: 'multiLabelModel',
    featureProperties: ['age', 'heightAndWeight', 'cost'],
    projectedFeatureDimension: 4
  }
)

2.3. Train with relationship weights

The GraphSAGE implementation supports training using relationship weights. Greater relationship weight between nodes signifies that the nodes should have more similar embedding values.

The following Cypher query trains a GraphSAGE model using relationship weights
CALL gds.beta.graphSage.train(
  'persons',
  {
    modelName: 'weightedTrainedModel',
    featureProperties: ['age', 'heightAndWeight'],
    relationshipWeightProperty: 'relWeight',
    nodeLabels: ['Person'],
    relationshipTypes: ['KNOWS']
  }
)

2.4. Train when there are no node properties present in the graph

In the case when you have a graph that does not have node properties we recommend to use existing algorithm in mutate mode to create node properties. Good candidates are Centrality algorithms or Community algorithms.

The following example illustrates calling Degree Centrality in mutate mode and then using the mutated property as feature of GraphSAGE training. For the purpose of this example we are going to use the Persons graph, but we will not load any properties to the in-memory graph.

Create the in-memory graph without loading any node properties
CALL gds.graph.create(
  'noPropertiesGraph',
  'Person', {
    KNOWS: {
      type: 'KNOWS',
      orientation: 'UNDIRECTED'
    }
})
Run DegreeCentrality mutate to create a new property for each node
CALL gds.degree.mutate(
  'noPropertiesGraph',
  {
    mutateProperty: 'degree'
  }
) YIELD nodePropertiesWritten
Run GraphSAGE train using the property produced by DegreeCentrality as feature property
CALL gds.beta.graphSage.train(
  'noPropertiesGraph',
  {
    modelName: 'myModel',
    featureProperties: ['degree']
  }
)
YIELD trainMillis
RETURN trainMillis

gds.degree.mutate will create a new node property degree for each of the nodes in the in-memory graph, which then can be used as featureProperty in the GraphSAGE.train mode.

Using separate algorithms to produce featureProperties can also be very useful to capture graph topology properties.

2.5. Stream

To generate embeddings and stream them back to the client we can use the stream mode. We must first train a model, which we do using the gds.beta.graphSage.train procedure.

CALL gds.beta.graphSage.train(
  'persons',
  {
    modelName: 'graphSage',
    featureProperties: ['age', 'heightAndWeight'],
    embeddingDimension: 3,
    randomSeed: 19
  }
)

Once we have trained a model (named 'graphSage') we can use it to generate and stream the embeddings.

CALL gds.beta.graphSage.stream(
  'persons',
  {
    modelName: 'graphSage'
  }
)
YIELD nodeId, embedding
Table 22. Results
nodeId embedding

0

[0.5285002502143177, 0.4682181762801141, 0.7081378570737874]

1

[0.5285002502147674, 0.46821817628034773, 0.7081378570732975]

2

[0.5285002502143014, 0.46821817628010554, 0.7081378570738053]

3

[0.5285002502129178, 0.46821817627938667, 0.7081378570753134]

4

[0.5285002502572376, 0.46821817630241636, 0.7081378570270093]

5

[0.5285002503196665, 0.46821817633485613, 0.7081378569589678]

6

[0.528500250213112, 0.46821817627948753, 0.7081378570751017]

Due to the random initialisation of the weight variables the results may vary slightly between the runs.

2.6. Mutate

The model trained as part of the stream example can be reused to write the results to the in-memory graph using the mutate mode of the procedure. Below is an example of how to achieve this.

CALL gds.beta.graphSage.mutate(
  'persons',
  {
    mutateProperty: 'inMemoryEmbedding',
    modelName: 'graphSage'
  }
) YIELD
  nodeCount,
  nodePropertiesWritten
Table 23. Results
nodeCount nodePropertiesWritten

7

7

2.7. Write

The model trained as part of the stream example can be reused to write the results to Neo4j. Below is an example of how to achieve this.

CALL gds.beta.graphSage.write(
  'persons',
  {
    writeProperty: 'embedding',
    modelName: 'graphSage'
  }
) YIELD
  nodeCount,
  nodePropertiesWritten
Table 24. Results
nodeCount nodePropertiesWritten

7

7

3. Caveats

If you are embedding a graph that has an isolated node, the aggregation step in GraphSAGE can only draw information from the node itself. When all the properties of that node are 0.0, and the activation function is relu, this leads to an all-zero vector for that node. However, since GraphSAGE normalizes node embeddings using the L2-norm, and a zero vector cannot be normalized, we assign all-zero embeddings to such nodes under these special circumstances. In scenarios where you generate all-zero embeddings for orphan nodes, that may have impacts on downstream tasks such as nearest neighbor or other similarity algorithms. It may be more appropriate to filter out these disconnected nodes prior to running GraphSAGE.

When running gds.beta.graphSage.train.estimate, the feature dimension is computed as if each feature property is scalar.