Node2Vec

This section describes the Node2Vec node embedding algorithm in the Neo4j Graph Data Science library.

Node2Vec is a node embedding algorithm that computes a vector representation of a node based on random walks in the graph. The neighborhood is sampled through random walks. Using a number of random neighborhood samples, the algorithm trains a single hidden layer neural network. The neural network is trained to predict the likelihood that a node will occur in a walk based on the occurrence of another node.

For more information on this algorithm, see:

1. Random Walks

A main concept of the Node2Vec algorithm are the second order random walks. A random walk simulates a traversal of the graph in which the traversed relationships are chosen at random. In a classic random walk, each relationship has the same, possibly weighted, probability of being picked. This probability is not influenced by the previously visited nodes. The concept of second order random walks, however, tries to model the transition probability based on the currently visited node v, the node t visited before the current one, and the node x which is the target of a candidate relationship. Node2Vec random walks are thus influenced by two parameters: the returnFactor and the inOutFactor:

  • The returnFactor is used if t equals x, i.e., the random walk returns to the previously visited node.

  • The inOutFactor is used if the distance from t to x is equal to 2, i.e., the walk traverses further away from the node t

Visuzalition of random walk parameters

The probabilities for traversing a relationship during a random walk can be further influenced by specifying a relationshipWeightProperty. A relationship property value greater than 1 will increase the likelihood of a relationship being traversed, a property value between 0 and 1 will decrease that probability.

For every node in the graph Node2Vec generates a series of random walks with the particular node as start node. The number of random walks per node can be influenced by the walkPerNode configuration parameters, the walk length is controlled by the walkLength parameter.

2. Syntax

Node2Vec syntax per mode
Run Node2Vec in stream mode on a named graph.
CALL gds.beta.node2vec.stream(
  graphName: String,
  configuration: Map
) YIELD
  nodeId: Integer,
  embedding: List of Float
Table 1. Parameters
Name Type Default Optional Description

graphName

String

n/a

no

The name of a graph stored in the catalog.

configuration

Map

{}

yes

Configuration for algorithm-specifics and/or graph filtering.

Table 2. General configuration for algorithm execution on a named graph.
Name Type Default Optional Description

nodeLabels

List of String

['*']

yes

Filter the named graph using the given node labels.

relationshipTypes

List of String

['*']

yes

Filter the named graph using the given relationship types.

concurrency

Integer

4

yes

The number of concurrent threads used for running the algorithm.

Table 3. Algorithm specific configuration
Name Type Default Optional Description

walkLength

Integer

80

yes

The number of steps in a single random walk.

walksPerNode

Integer

10

yes

The number of random walks generated for each node.

inOutFactor

Float

1.0

yes

Tendency of the random walk to stay close to the start node or fan out in the graph. Higher value means stay local.

returnFactor

Float

1.0

yes

Tendency of the random walk to return to the last visited node. A value below 1.0 means a higher tendency.

relationshipWeightProperty

String

null

yes

Name of the relationship property to use as weights to influence the probabilities of the random walks. The weights need to be >= 0. If unspecified, the algorithm runs unweighted.

windowSize

Integer

10

yes

Size of the context window when training the neural network.

negativeSamplingRate

Integer

5

yes

Number of negative samples to produce for each positive sample.

positiveSamplingFactor

Float

0.001

yes

Factor for influencing the distribution for positive samples. A higher value increases the probability that frequent nodes are down-sampled.

negativeSamplingExponent

Float

0.75

yes

Exponent applied to the node frequency to obtain the negative sampling distribution. A value of 1.0 samples proportionally to the frequency. A value of 0.0 samples each node equally.

embeddingDimension

Integer

128

yes

Size of the computed node embeddings.

iterations

Integer

1

yes

Number of training iterations.

initialLearningRate

Float

0.01

yes

Learning rate used initially for training the neural network. The learning rate decreases after each training iteration.

minLearningRate

Float

0.0001

yes

Lower bound for learning rate as it is decreased during training.

randomSeed

Integer

random

yes

Seed value for the random number generator used to generate the random walks.

walkBufferSize

Integer

1000

yes

The number of random walks to complete before starting training.

Table 4. Results
Name Type Description

nodeId

Integer

The Neo4j node ID.

embedding

List of Float

The computed node embedding.

Run Node2Vec in mutate mode on a graph stored in the catalog.
CALL gds.beta.node2vec.mutate(
  graphName: String,
  configuration: Map
)
YIELD
  createMillis: Integer,
  computeMillis: Integer,
  postProcessingMillis: Integer,
  mutateMillis: Integer,
  nodeCount: Integer,
  nodePropertiesWritten: Integer,
  configuration: Map
Table 5. Parameters
Name Type Default Optional Description

graphName

String

n/a

no

The name of a graph stored in the catalog.

configuration

Map

{}

yes

Configuration for algorithm-specifics and/or graph filtering.

Table 6. General configuration for algorithm execution on a named graph.
Name Type Default Optional Description

nodeLabels

List of String

['*']

yes

Filter the named graph using the given node labels.

relationshipTypes

List of String

['*']

yes

Filter the named graph using the given relationship types.

concurrency

Integer

4

yes

The number of concurrent threads used for running the algorithm.

mutateProperty

String

n/a

no

The node property in the GDS graph to which the embedding is written.

Table 7. Algorithm specific configuration
Name Type Default Optional Description

walkLength

Integer

80

yes

The number of steps in a single random walk.

walksPerNode

Integer

10

yes

The number of random walks generated for each node.

inOutFactor

Float

1.0

yes

Tendency of the random walk to stay close to the start node or fan out in the graph. Higher value means stay local.

returnFactor

Float

1.0

yes

Tendency of the random walk to return to the last visited node. A value below 1.0 means a higher tendency.

relationshipWeightProperty

String

null

yes

Name of the relationship property to use as weights to influence the probabilities of the random walks. The weights need to be >= 0. If unspecified, the algorithm runs unweighted.

windowSize

Integer

10

yes

Size of the context window when training the neural network.

negativeSamplingRate

Integer

5

yes

Number of negative samples to produce for each positive sample.

positiveSamplingFactor

Float

0.001

yes

Factor for influencing the distribution for positive samples. A higher value increases the probability that frequent nodes are down-sampled.

negativeSamplingExponent

Float

0.75

yes

Exponent applied to the node frequency to obtain the negative sampling distribution. A value of 1.0 samples proportionally to the frequency. A value of 0.0 samples each node equally.

embeddingDimension

Integer

128

yes

Size of the computed node embeddings.

iterations

Integer

1

yes

Number of training iterations.

initialLearningRate

Float

0.01

yes

Learning rate used initially for training the neural network. The learning rate decreases after each training iteration.

minLearningRate

Float

0.0001

yes

Lower bound for learning rate as it is decreased during training.

randomSeed

Integer

random

yes

Seed value for the random number generator used to generate the random walks.

walkBufferSize

Integer

1000

yes

The number of random walks to complete before starting training.

Table 8. Results
Name Type Description

nodeCount

Integer

The number of nodes processed.

nodePropertiesWritten

Integer

The number of node properties written.

createMillis

Integer

Milliseconds for loading data.

computeMillis

Integer

Milliseconds for running the algorithm.

mutateMillis

Integer

Milliseconds for adding properties to the in-memory graph.

postProcessingMillis

Integer

Milliseconds for post-processing of the results.

configuration

Map

The configuration used for running the algorithm.

Run Node2Vec in write mode on a graph stored in the catalog.
CALL gds.beta.node2vec.write(
  graphName: String,
  configuration: Map
)
YIELD
  createMillis: Integer,
  computeMillis: Integer,
  writeMillis: Integer,
  nodeCount: Integer,
  nodePropertiesWritten: Integer,
  configuration: Map
Table 9. Parameters
Name Type Default Optional Description

graphName

String

n/a

no

The name of a graph stored in the catalog.

configuration

Map

{}

yes

Configuration for algorithm-specifics and/or graph filtering.

Table 10. General configuration for algorithm execution on a named graph.
Name Type Default Optional Description

nodeLabels

List of String

['*']

yes

Filter the named graph using the given node labels.

relationshipTypes

List of String

['*']

yes

Filter the named graph using the given relationship types.

concurrency

Integer

4

yes

The number of concurrent threads used for running the algorithm. Also provides the default value for 'writeConcurrency'.

writeConcurrency

Integer

value of 'concurrency'

yes

The number of concurrent threads used for writing the result to Neo4j.

writeProperty

String

n/a

no

The node property in the Neo4j database to which the embedding is written.

Table 11. Algorithm specific configuration
Name Type Default Optional Description

walkLength

Integer

80

yes

The number of steps in a single random walk.

walksPerNode

Integer

10

yes

The number of random walks generated for each node.

inOutFactor

Float

1.0

yes

Tendency of the random walk to stay close to the start node or fan out in the graph. Higher value means stay local.

returnFactor

Float

1.0

yes

Tendency of the random walk to return to the last visited node. A value below 1.0 means a higher tendency.

relationshipWeightProperty

String

null

yes

Name of the relationship property to use as weights to influence the probabilities of the random walks. The weights need to be >= 0. If unspecified, the algorithm runs unweighted.

windowSize

Integer

10

yes

Size of the context window when training the neural network.

negativeSamplingRate

Integer

5

yes

Number of negative samples to produce for each positive sample.

positiveSamplingFactor

Float

0.001

yes

Factor for influencing the distribution for positive samples. A higher value increases the probability that frequent nodes are down-sampled.

negativeSamplingExponent

Float

0.75

yes

Exponent applied to the node frequency to obtain the negative sampling distribution. A value of 1.0 samples proportionally to the frequency. A value of 0.0 samples each node equally.

embeddingDimension

Integer

128

yes

Size of the computed node embeddings.

iterations

Integer

1

yes

Number of training iterations.

initialLearningRate

Float

0.01

yes

Learning rate used initially for training the neural network. The learning rate decreases after each training iteration.

minLearningRate

Float

0.0001

yes

Lower bound for learning rate as it is decreased during training.

randomSeed

Integer

random

yes

Seed value for the random number generator used to generate the random walks.

walkBufferSize

Integer

1000

yes

The number of random walks to complete before starting training.

Table 12. Results
Name Type Description

nodeCount

Integer

The number of nodes processed.

nodePropertiesWritten

Integer

The number of node properties written.

createMillis

Integer

Milliseconds for loading data.

computeMillis

Integer

Milliseconds for running the algorithm.

writeMillis

Integer

Milliseconds for writing result data back to Neo4j.

configuration

Map

The configuration used for running the algorithm.

2.1. Anonymous graphs

It is also possible to execute the algorithm on a graph that is projected in conjunction with the algorithm execution. In this case, the graph does not have a name, and we call it anonymous. When executing over an anonymous graph the configuration map contains a graph projection configuration as well as an algorithm configuration. All execution modes support execution on anonymous graphs, although we only show syntax and mode-specific configuration for the write mode for brevity.

For more information on syntax variants, see Syntax overview.

Run Node2Vec in write mode on an anonymous graph.
CALL gds.beta.node2vec.write(
  configuration: Map
)
YIELD
  createMillis: Integer,
  computeMillis: Integer,
  writeMillis: Integer,
  nodeCount: Integer,
  nodePropertiesWritten: Integer,
  configuration: Map
Table 13. General configuration for algorithm execution on an anonymous graph.
Name Type Default Optional Description

nodeProjection

String, List of String or Map

null

yes

The node projection used for anonymous graph creation via a Native projection.

relationshipProjection

String, List of String or Map

null

yes

The relationship projection used for anonymous graph creation a Native projection.

nodeQuery

String

null

yes

The Cypher query used to select the nodes for anonymous graph creation via a Cypher projection.

relationshipQuery

String

null

yes

The Cypher query used to select the relationships for anonymous graph creation via a Cypher projection.

nodeProperties

String, List of String or Map

null

yes

The node properties to project during anonymous graph creation.

relationshipProperties

String, List of String or Map

null

yes

The relationship properties to project during anonymous graph creation.

concurrency

Integer

4

yes

The number of concurrent threads used for running the algorithm. Also provides the default value for 'readConcurrency' and 'writeConcurrency'.

readConcurrency

Integer

value of 'concurrency'

yes

The number of concurrent threads used for creating the graph.

writeConcurrency

Integer

value of 'concurrency'

yes

WRITE mode only: The number of concurrent threads used for writing the result.

writeProperty

String

n/a

no

WRITE mode only: The node property to which the embedding is written to.

Table 14. Algorithm specific configuration
Name Type Default Optional Description

walkLength

Integer

80

yes

The number of steps in a single random walk.

walksPerNode

Integer

10

yes

The number of random walks generated for each node.

inOutFactor

Float

1.0

yes

Tendency of the random walk to stay close to the start node or fan out in the graph. Higher value means stay local.

returnFactor

Float

1.0

yes

Tendency of the random walk to return to the last visited node. A value below 1.0 means a higher tendency.

relationshipWeightProperty

String

null

yes

Name of the relationship property to use as weights to influence the probabilities of the random walks. The weights need to be >= 0. If unspecified, the algorithm runs unweighted.

windowSize

Integer

10

yes

Size of the context window when training the neural network.

negativeSamplingRate

Integer

5

yes

Number of negative samples to produce for each positive sample.

positiveSamplingFactor

Float

0.001

yes

Factor for influencing the distribution for positive samples. A higher value increases the probability that frequent nodes are down-sampled.

negativeSamplingExponent

Float

0.75

yes

Exponent applied to the node frequency to obtain the negative sampling distribution. A value of 1.0 samples proportionally to the frequency. A value of 0.0 samples each node equally.

embeddingDimension

Integer

128

yes

Size of the computed node embeddings.

iterations

Integer

1

yes

Number of training iterations.

initialLearningRate

Float

0.01

yes

Learning rate used initially for training the neural network. The learning rate decreases after each training iteration.

minLearningRate

Float

0.0001

yes

Lower bound for learning rate as it is decreased during training.

randomSeed

Integer

random

yes

Seed value for the random number generator used to generate the random walks.

walkBufferSize

Integer

1000

yes

The number of random walks to complete before starting training.

The results are the same as for running write mode with a named graph, see the write mode syntax above.

3. Examples

Consider the graph created by the following Cypher statement:

CREATE (alice:Person {name: 'Alice'})
CREATE (bob:Person {name: 'Bob'})
CREATE (carol:Person {name: 'Carol'})
CREATE (dave:Person {name: 'Dave'})
CREATE (eve:Person {name: 'Eve'})
CREATE (guitar:Instrument {name: 'Guitar'})
CREATE (synth:Instrument {name: 'Synthesizer'})
CREATE (bongos:Instrument {name: 'Bongos'})
CREATE (trumpet:Instrument {name: 'Trumpet'})

CREATE (alice)-[:LIKES]->(guitar)
CREATE (alice)-[:LIKES]->(synth)
CREATE (alice)-[:LIKES]->(bongos)
CREATE (bob)-[:LIKES]->(guitar)
CREATE (bob)-[:LIKES]->(synth)
CREATE (carol)-[:LIKES]->(bongos)
CREATE (dave)-[:LIKES]->(guitar)
CREATE (dave)-[:LIKES]->(synth)
CREATE (dave)-[:LIKES]->(bongos);
CALL gds.graph.create('myGraph', ['Person', 'Instrument'], 'LIKES');
Run the Node2Vec algorithm on myGraph
CALL gds.beta.node2vec.stream('myGraph', {embeddingDimension: 2})
YIELD nodeId, embedding
RETURN nodeId, embedding
Table 15. Results
nodeId embedding

0

[-0.14295829832553864, 0.08884537220001221]

1

[0.016700705513358116, 0.2253911793231964]

2

[-0.06589698046445847, 0.042405471205711365]

3

[0.05862073227763176, 0.1193704605102539]

4

[0.10888434946537018, -0.18204474449157715]

5

[0.16728264093399048, 0.14098615944385529]

6

[-0.007779224775731564, 0.02114257402718067]

7

[-0.213893860578537, 0.06195802614092827]

8

[0.2479933649301529, -0.137322798371315]