Node2Vec

This feature is in the beta tier. For more information on feature tiers, see API Tiers.

Glossary

Directed: Directed trait. The algorithm is well-defined on a directed graph.
Directed: Directed trait. The algorithm ignores the direction of the graph.
Directed: Directed trait. The algorithm does not run on a directed graph.
Undirected: Undirected trait. The algorithm is well-defined on an undirected graph.
Undirected: Undirected trait. The algorithm ignores the undirectedness of the graph.
Heterogeneous nodes: Heterogeneous nodes fully supported. The algorithm has the ability to distinguish between nodes of different types.
Heterogeneous nodes: Heterogeneous nodes allowed. The algorithm treats all selected nodes similarly regardless of their label.
Heterogeneous relationships: Heterogeneous relationships fully supported. The algorithm has the ability to distinguish between relationships of different types.
Heterogeneous relationships: Heterogeneous relationships allowed. The algorithm treats all selected relationships similarly regardless of their type.
Weighted relationships: Weighted trait. The algorithm supports a relationship property to be used as weight, specified via the relationshipWeightProperty configuration parameter.
Weighted relationships: Weighted trait. The algorithm treats each relationship as equally important, discarding the value of any relationship weight.

Node2Vec is a node embedding algorithm that computes a vector representation of a node based on random walks in the graph. The neighborhood is sampled through random walks. Using a number of random neighborhood samples, the algorithm trains a single hidden layer neural network. The neural network is trained to predict the likelihood that a node will occur in a walk based on the occurrence of another node.

For more information on this algorithm, see:

Random Walks

A main concept of the Node2Vec algorithm are the second order random walks. A random walk simulates a traversal of the graph in which the traversed relationships are chosen at random. In a classic random walk, each relationship has the same, possibly weighted, probability of being picked. This probability is not influenced by the previously visited nodes. The concept of second order random walks, however, tries to model the transition probability based on the currently visited node v, the node t visited before the current one, and the node x which is the target of a candidate relationship. Node2Vec random walks are thus influenced by two parameters: the returnFactor and the inOutFactor:

The returnFactor is used if t equals x, i.e., the random walk returns to the previously visited node.
The inOutFactor is used if the distance from t to x is equal to 2, i.e., the walk traverses further away from the node t

The probabilities for traversing a relationship during a random walk can be further influenced by specifying a relationshipWeightProperty. A relationship property value greater than 1 will increase the likelihood of a relationship being traversed, a property value between 0 and 1 will decrease that probability.

For every node in the graph Node2Vec generates a series of random walks with the particular node as start node. The number of random walks per node can be influenced by the walkPerNode configuration parameters, the walk length is controlled by the walkLength parameter.

Usage in machine learning pipelines

At this time, using Node2Vec as a node property step in a machine learning pipeline (like Link prediction pipelines and Node property prediction) is not well supported, at least if the end goal is to apply a prediction model using its embeddings.

In order for a machine learning model to be able to make useful predictions, it is important that features produced during prediction are of a similar distribution to the features produced during training of the model. Moreover, node property steps (whether Node2Vec or not) added to a pipeline are executed both during training, and during the prediction by the trained model. It is therefore problematic when a pipeline contains an embedding step which yields all too dissimilar embeddings during training and prediction.

The final embeddings produced by Node2Vec depends on the randomness in generating the initial node embedding vectors as well as the random walks taken in the computation. At this time, Node2Vec will produce non-deterministic results even if the randomSeed configuration parameter is set. So since embeddings will not be deterministic between runs, Node2Vec should not be used as a node property step in a pipeline at this time, unless the purpose is experimental and only the train mode is used.

It may still be useful to use Node2Vec node embeddings as features in a pipeline if they are produced outside the pipeline, as long as one is aware of the data leakage risks of not using the dataset split in the pipeline.

Syntax

Node2Vec syntax per mode

Run Node2Vec in stream mode on a named graph.

CALL gds.node2vec.stream(
  graphName: String,
  configuration: Map
) YIELD
  nodeId: Integer,
  embedding: List of Float

Table 1. Parameters
Name	Type	Default	Optional	Description
graphName	String	`n/a`	no	The name of a graph stored in the catalog.
configuration	Map	`{}`	yes	Configuration for algorithm-specifics and/or graph filtering.

Table 2. Configuration
Name	Type	Default	Optional	Description
nodeLabels	List of String	`['*']`	yes	Filter the named graph using the given node labels. Nodes with any of the given labels will be included.
relationshipTypes	List of String	`['*']`	yes	Filter the named graph using the given relationship types. Relationships with any of the given types will be included.
concurrency	Integer	`4`	yes	The number of concurrent threads used for running the algorithm.
jobId	String	`Generated internally`	yes	An ID that can be provided to more easily track the algorithm’s progress.
logProgress	Boolean	`true`	yes	If disabled the progress percentage will not be logged.
walkLength	Integer	`80`	yes	The number of steps in a single random walk.
walksPerNode	Integer	`10`	yes	The number of random walks generated for each node.
inOutFactor	Float	`1.0`	yes	Tendency of the random walk to stay close to the start node or fan out in the graph. Higher value means stay local.
returnFactor	Float	`1.0`	yes	Tendency of the random walk to return to the last visited node. A value below 1.0 means a higher tendency.
relationshipWeightProperty	String	`null`	yes	Name of the relationship property to use as weights to influence the probabilities of the random walks. The weights need to be >= 0. If unspecified, the algorithm runs unweighted.
windowSize	Integer	`10`	yes	Size of the context window when training the neural network.
negativeSamplingRate	Integer	`5`	yes	Number of negative samples to produce for each positive sample.
positiveSamplingFactor	Float	`0.001`	yes	Factor for influencing the distribution for positive samples. A higher value increases the probability that frequent nodes are down-sampled.
negativeSamplingExponent	Float	`0.75`	yes	Exponent applied to the node frequency to obtain the negative sampling distribution. A value of 1.0 samples proportionally to the frequency. A value of 0.0 samples each node equally.
embeddingDimension	Integer	`128`	yes	Size of the computed node embeddings.
embeddingInitializer	String	`NORMALIZED`	yes	Method to initialize embeddings. Values are sampled uniformly from a range `[-a, a]`. With `NORMALIZED`, `a=0.5/embeddingDimension` and with `UNIFORM` instead `a=1`.
iterations	Integer	`1`	yes	Number of training iterations.
initialLearningRate	Float	`0.01`	yes	Learning rate used initially for training the neural network. The learning rate decreases after each training iteration.
minLearningRate	Float	`0.0001`	yes	Lower bound for learning rate as it is decreased during training.
randomSeed	Integer	`random`	yes	Seed value used to generate the random walks, which are used as the training set of the neural network. Note, that the generated embeddings are still nondeterministic.
walkBufferSize	Integer	`1000`	yes	The number of random walks to complete before starting training.

Table 3. Results
Name	Type	Description
`nodeId`	Integer	The Neo4j node ID.
`embedding`	List of Float	The computed node embedding.

Run Node2Vec in mutate mode on a graph stored in the catalog.

CALL gds.node2vec.mutate(
  graphName: String,
  configuration: Map
)
YIELD
  preProcessingMillis: Integer,
  computeMillis: Integer,
  postProcessingMillis: Integer,
  mutateMillis: Integer,
  nodeCount: Integer,
  nodePropertiesWritten: Integer,
  lossPerIteration: List of Float,
  configuration: Map

Table 4. Parameters
Name	Type	Default	Optional	Description
graphName	String	`n/a`	no	The name of a graph stored in the catalog.
configuration	Map	`{}`	yes	Configuration for algorithm-specifics and/or graph filtering.

Table 5. Configuration
Name	Type	Default	Optional	Description
mutateProperty	String	`n/a`	no	The node property in the GDS graph to which the embedding is written.
nodeLabels	List of String	`['*']`	yes	Filter the named graph using the given node labels.
relationshipTypes	List of String	`['*']`	yes	Filter the named graph using the given relationship types.
concurrency	Integer	`4`	yes	The number of concurrent threads used for running the algorithm.
jobId	String	`Generated internally`	yes	An ID that can be provided to more easily track the algorithm’s progress.
walkLength	Integer	`80`	yes	The number of steps in a single random walk.
walksPerNode	Integer	`10`	yes	The number of random walks generated for each node.
inOutFactor	Float	`1.0`	yes	Tendency of the random walk to stay close to the start node or fan out in the graph. Higher value means stay local.
returnFactor	Float	`1.0`	yes	Tendency of the random walk to return to the last visited node. A value below 1.0 means a higher tendency.
relationshipWeightProperty	String	`null`	yes	Name of the relationship property to use as weights to influence the probabilities of the random walks. The weights need to be >= 0. If unspecified, the algorithm runs unweighted.
windowSize	Integer	`10`	yes	Size of the context window when training the neural network.
negativeSamplingRate	Integer	`5`	yes	Number of negative samples to produce for each positive sample.
positiveSamplingFactor	Float	`0.001`	yes	Factor for influencing the distribution for positive samples. A higher value increases the probability that frequent nodes are down-sampled.
negativeSamplingExponent	Float	`0.75`	yes	Exponent applied to the node frequency to obtain the negative sampling distribution. A value of 1.0 samples proportionally to the frequency. A value of 0.0 samples each node equally.
embeddingDimension	Integer	`128`	yes	Size of the computed node embeddings.
embeddingInitializer	String	`NORMALIZED`	yes	Method to initialize embeddings. Values are sampled uniformly from a range `[-a, a]`. With `NORMALIZED`, `a=0.5/embeddingDimension` and with `UNIFORM` instead `a=1`.
iterations	Integer	`1`	yes	Number of training iterations.
initialLearningRate	Float	`0.01`	yes	Learning rate used initially for training the neural network. The learning rate decreases after each training iteration.
minLearningRate	Float	`0.0001`	yes	Lower bound for learning rate as it is decreased during training.
randomSeed	Integer	`random`	yes	Seed value used to generate the random walks, which are used as the training set of the neural network. Note, that the generated embeddings are still nondeterministic.
walkBufferSize	Integer	`1000`	yes	The number of random walks to complete before starting training.

Table 6. Results
Name	Type	Description
nodeCount	Integer	The number of nodes processed.
nodePropertiesWritten	Integer	The number of node properties written.
preProcessingMillis	Integer	Milliseconds for preprocessing the data.
computeMillis	Integer	Milliseconds for running the algorithm.
mutateMillis	Integer	Milliseconds for adding properties to the projected graph.
postProcessingMillis	Integer	Milliseconds for post-processing of the results.
lossPerIteration	List of Float	The sum of the losses registered per training iteration.
configuration	Map	The configuration used for running the algorithm.

Run Node2Vec in write mode on a graph stored in the catalog.

CALL gds.node2vec.write(
  graphName: String,
  configuration: Map
)
YIELD
  preProcessingMillis: Integer,
  computeMillis: Integer,
  writeMillis: Integer,
  nodeCount: Integer,
  nodePropertiesWritten: Integer,
  lossPerIteration: List of Float,
  configuration: Map

Table 7. Parameters
Name	Type	Default	Optional	Description
graphName	String	`n/a`	no	The name of a graph stored in the catalog.
configuration	Map	`{}`	yes	Configuration for algorithm-specifics and/or graph filtering.

Table 8. Configuration
Name	Type	Default	Optional	Description
nodeLabels	List of String	`['*']`	yes	Filter the named graph using the given node labels. Nodes with any of the given labels will be included.
relationshipTypes	List of String	`['*']`	yes	Filter the named graph using the given relationship types. Relationships with any of the given types will be included.
concurrency	Integer	`4`	yes	The number of concurrent threads used for running the algorithm.
jobId	String	`Generated internally`	yes	An ID that can be provided to more easily track the algorithm’s progress.
logProgress	Boolean	`true`	yes	If disabled the progress percentage will not be logged.
writeConcurrency	Integer	`value of 'concurrency'`	yes	The number of concurrent threads used for writing the result to Neo4j.
writeProperty	String	`n/a`	no	The node property in the Neo4j database to which the embedding is written.
walkLength	Integer	`80`	yes	The number of steps in a single random walk.
walksPerNode	Integer	`10`	yes	The number of random walks generated for each node.
inOutFactor	Float	`1.0`	yes	Tendency of the random walk to stay close to the start node or fan out in the graph. Higher value means stay local.
returnFactor	Float	`1.0`	yes	Tendency of the random walk to return to the last visited node. A value below 1.0 means a higher tendency.
relationshipWeightProperty	String	`null`	yes	Name of the relationship property to use as weights to influence the probabilities of the random walks. The weights need to be >= 0. If unspecified, the algorithm runs unweighted.
windowSize	Integer	`10`	yes	Size of the context window when training the neural network.
negativeSamplingRate	Integer	`5`	yes	Number of negative samples to produce for each positive sample.
positiveSamplingFactor	Float	`0.001`	yes	Factor for influencing the distribution for positive samples. A higher value increases the probability that frequent nodes are down-sampled.
negativeSamplingExponent	Float	`0.75`	yes	Exponent applied to the node frequency to obtain the negative sampling distribution. A value of 1.0 samples proportionally to the frequency. A value of 0.0 samples each node equally.
embeddingDimension	Integer	`128`	yes	Size of the computed node embeddings.
embeddingInitializer	String	`NORMALIZED`	yes	Method to initialize embeddings. Values are sampled uniformly from a range `[-a, a]`. With `NORMALIZED`, `a=0.5/embeddingDimension` and with `UNIFORM` instead `a=1`.
iterations	Integer	`1`	yes	Number of training iterations.
initialLearningRate	Float	`0.01`	yes	Learning rate used initially for training the neural network. The learning rate decreases after each training iteration.
minLearningRate	Float	`0.0001`	yes	Lower bound for learning rate as it is decreased during training.
randomSeed	Integer	`random`	yes	Seed value used to generate the random walks, which are used as the training set of the neural network. Note, that the generated embeddings are still nondeterministic.
walkBufferSize	Integer	`1000`	yes	The number of random walks to complete before starting training.

Table 9. Results
Name	Type	Description
nodeCount	Integer	The number of nodes processed.
nodePropertiesWritten	Integer	The number of node properties written.
preProcessingMillis	Integer	Milliseconds for preprocessing the data.
computeMillis	Integer	Milliseconds for running the algorithm.
writeMillis	Integer	Milliseconds for writing result data back to Neo4j.
lossPerIteration	List of Float	The sum of the losses registered per training iteration.
configuration	Map	The configuration used for running the algorithm.

Examples

All the examples below should be run in an empty database.

The examples use native projections as the norm, although Cypher projections can be used as well.

Consider the graph created by the following Cypher statement:

CREATE (alice:Person {name: 'Alice'})
CREATE (bob:Person {name: 'Bob'})
CREATE (carol:Person {name: 'Carol'})
CREATE (dave:Person {name: 'Dave'})
CREATE (eve:Person {name: 'Eve'})
CREATE (guitar:Instrument {name: 'Guitar'})
CREATE (synth:Instrument {name: 'Synthesizer'})
CREATE (bongos:Instrument {name: 'Bongos'})
CREATE (trumpet:Instrument {name: 'Trumpet'})

CREATE (alice)-[:LIKES]->(guitar)
CREATE (alice)-[:LIKES]->(synth)
CREATE (alice)-[:LIKES]->(bongos)
CREATE (bob)-[:LIKES]->(guitar)
CREATE (bob)-[:LIKES]->(synth)
CREATE (carol)-[:LIKES]->(bongos)
CREATE (dave)-[:LIKES]->(guitar)
CREATE (dave)-[:LIKES]->(synth)
CREATE (dave)-[:LIKES]->(bongos);

CALL gds.graph.project('myGraph', ['Person', 'Instrument'], 'LIKES');

Run the Node2Vec algorithm on myGraph

CALL gds.node2vec.stream('myGraph', {embeddingDimension: 2})
YIELD nodeId, embedding
RETURN nodeId, embedding

Table 10. Results
nodeId	embedding
0	[-0.14295829832553864, 0.08884537220001221]
1	[0.016700705513358116, 0.2253911793231964]
2	[-0.06589698046445847, 0.042405471205711365]
3	[0.05862073227763176, 0.1193704605102539]
4	[0.10888434946537018, -0.18204474449157715]
5	[0.16728264093399048, 0.14098615944385529]
6	[-0.007779224775731564, 0.02114257402718067]
7	[-0.213893860578537, 0.06195802614092827]
8	[0.2479933649301529, -0.137322798371315]