Node2Vec

Node2Vec is a node embedding algorithm that computes a vector representation of a node based on random walks in the graph. The neighborhood is sampled through random walks. Using a number of random neighborhood samples, the algorithm trains a single hidden layer neural network. The neural network is trained to predict the likelihood that a node will occur in a walk based on the occurrence of another node.

For more information on this algorithm, see:

Random Walks

A main concept of the Node2Vec algorithm is that of second order random walks. A random walk simulates a traversal of the graph in which the traversed relationships are chosen at random. In a classic random walk, each relationship has the same, possibly weighted, probability of being picked. This probability is not influenced by the previously visited nodes. The concept of second order random walks, however, tries to model the transition probability based on the currently visited node v, the node t visited before the current one, and the node x which is the target of a candidate relationship. Node2Vec random walks are thus influenced by two parameters: the returnFactor and the inOutFactor:

  • The returnFactor is used if t equals x, i.e., the random walk returns to the previously visited node.

  • The inOutFactor is used if the distance from t to x is equal to 2, i.e., the walk traverses further away from the node t

Visuzalition of random walk parameters

The probabilities for traversing a relationship during a random walk can be further influenced by specifying a relationshipWeightProperty. A relationship property value greater than 1 will increase the likelihood of a relationship being traversed, a property value between 0 and 1 will decrease that probability.

For every node in the graph Node2Vec generates a series of random walks with the particular node as start node. The number of random walks per node can be influenced by the walkPerNode configuration parameters, the walk length is controlled by the walkLength parameter.

Syntax

This section covers the syntax used to execute the Node2Vec algorithm.

Run Node2Vec.
CALL Neo4j_Graph_Analytics.graph.node2vec(
  'CPU_X64_XS',                    (1)
  {
    ['defaultTablePrefix': '...',] (2)
    'project': {...},              (3)
    'compute': {...},              (4)
    'write':   {...}               (5)
  }
);
1 Compute pool selector.
2 Optional prefix for table references.
3 Project config.
4 Compute config.
5 Write config.
Table 1. Parameters
Name Type Default Optional Description

computePoolSelector

String

n/a

no

The selector for the compute pool on which to run the Node2Vec job.

configuration

Map

{}

no

Configuration for graph project, algorithm compute and result write back.

The configuration map consists of the following three entries.

For more details on below Project configuration, refer to the Project documentation.
Table 2. Project configuration
Name Type

nodeTables

List of node tables.

relationshipTables

Map of relationship types to relationship tables.

Table 3. Compute configuration
Name Type Default Optional Description

walkLength

Integer

80

yes

The number of steps in a single random walk.

walksPerNode

Integer

10

yes

The number of random walks generated for each node.

inOutFactor

Float

1.0

yes

Tendency of the random walk to stay close to the start node or fan out in the graph. Higher value means stay local.

returnFactor

Float

1.0

yes

Tendency of the random walk to return to the last visited node. A value below 1.0 means a higher tendency.

relationshipWeightProperty

String

null

yes

Name of the relationship property to use as weights to influence the probabilities of the random walks. The weights need to be >= 0. If unspecified, the algorithm runs unweighted.

windowSize

Integer

10

yes

Size of the context window when training the neural network.

negativeSamplingRate

Integer

5

yes

Number of negative samples to produce for each positive sample.

positiveSamplingFactor

Float

0.001

yes

Factor for influencing the distribution for positive samples. A higher value increases the probability that frequent nodes are down-sampled.

negativeSamplingExponent

Float

0.75

yes

Exponent applied to the node frequency to obtain the negative sampling distribution. A value of 1.0 samples proportionally to the frequency. A value of 0.0 samples each node equally.

embeddingDimension

Integer

128

yes

Size of the computed node embeddings.

embeddingInitializer

String

NORMALIZED

yes

Method to initialize embeddings. Values are sampled uniformly from a range [-a, a]. With NORMALIZED, a=0.5/embeddingDimension and with UNIFORM instead a=1.

iterations

Integer

1

yes

Number of training iterations. Higher iterations still sample more random walks, and therefore the set of walks will generally become more representative of the entire graph.

initialLearningRate

Float

0.01

yes

Learning rate used initially for training the neural network. The learning rate decreases after each training iteration.

minLearningRate

Float

0.0001

yes

Lower bound for learning rate as it is decreased during training.

randomSeed

Integer

random

yes

Seed value used to generate the random walks, which are used as the training set of the neural network. Note, that the generated embeddings are still nondeterministic.

walkBufferSize

Integer

1000

yes

The number of random walks to complete before starting training.

For more details on below Write configuration, refer to the Write documentation.
Table 4. Write configuration
Name Type Default Optional Description

nodeLabel

String

n/a

no

Node label in the in-memory graph from which to write a node property.

nodeProperty

String

'node2vec'

yes

The node property that will be written back to the Snowflake database.

outputTable

String

n/a

no

Table in Snowflake database to which node properties are written.

Example

In this section we will show examples of running the Node2Vec algorithm on a concrete graph. The intention is to illustrate what the results look like and to provide a guide in how to make use of the algorithm in a real setting. We will do this on a small knowledge graph of a handful of nodes, connected in a particular pattern. The example graph looks like this:

Visualization of the example graph
The following SQL statement will create the example graph tables in the Snowflake database:
CREATE OR REPLACE TABLE EXAMPLE_DB.DATA_SCHEMA.PERSONS (NODEID VARCHAR);
INSERT INTO EXAMPLE_DB.DATA_SCHEMA.PERSONS VALUES
  ('Alice'),
  ('Bob'),
  ('Carol'),
  ('Dave'),
  ('Eve');

CREATE OR REPLACE TABLE EXAMPLE_DB.DATA_SCHEMA.INSTRUMENTS (NODEID VARCHAR);
INSERT INTO EXAMPLE_DB.DATA_SCHEMA.INSTRUMENTS VALUES
  ('Guitar'),
  ('Synthesizer'),
  ('Bongos'),
  ('Trumpet');

CREATE OR REPLACE TABLE EXAMPLE_DB.DATA_SCHEMA.LIKES (SOURCENODEID VARCHAR, TARGETNODEID VARCHAR, WEIGHT FLOAT);
INSERT INTO EXAMPLE_DB.DATA_SCHEMA.LIKES VALUES
  ('Alice', 'Guitar',      1.0),
  ('Alice', 'Synthesizer', 1.0),
  ('Alice', 'Bongos',      0.5),
  ('Bob',   'Guitar',      1.0),
  ('Bob',   'Synthesizer', 1.0),
  ('Carol', 'Bongos',      1.0),
  ('Dave',  'Guitar',      1.0),
  ('Dave',  'Trumpet',     1.5),
  ('Dave',  'Bongos',      1.0);

This bipartite graph has two node sets, Person nodes and Instrument nodes. The two node sets are connected via LIKES relationships. Each relationship starts at a Person node and ends at an Instrument node.

Run job

To run the query, there is a required setup of grants for the application, your consumer role and your environment. Please see the Getting started page for more on this.

We also assume that the application name is the default Neo4j_Graph_Analytics. If you chose a different app name during installation, please replace it with that.

The following will run a Node2Vec job:
CALL Neo4j_Graph_Analytics.graph.node2vec('CPU_X64_XS', {
    'defaultTablePrefix': 'EXAMPLE_DB.DATA_SCHEMA',
    'project': {
        'nodeTables': [ 'PERSONS', 'INSTRUMENTS' ],
        'relationshipTables': {
            'LIKES': {
                'sourceTable': 'PERSONS',
                'targetTable': 'INSTRUMENTS'
            }
        }
    },
    'compute': {
        'embeddingDimension': 2
    },
    'write': [{
        'nodeLabel': 'PERSONS',
        'outputTable': 'PERSON_EMBEDDINGS'
    }]
});
Table 5. Results
JOB_ID JOB_START JOB_END JOB_RESULT

job_9f036be61fe043dbbef168b9bae4da25

2025-07-17 08:43:17.050

2025-07-17 08:43:21.622

 {
  "node2vec_1": {
    "computeMillis": 33,
    "configuration": {
    "concurrency": 6,
    "embeddingDimension": 2,
    "embeddingInitializer": "NORMALIZED",
    "inOutFactor": 1,
    "initialLearningRate": 0.01,
    "iterations": 1,
    "jobId": "235ad57f-7555-44d7-85f0-7a78bf21d30d",
    "logProgress": true,
    "minLearningRate": 1.000000000000000e-04,
    "mutateProperty": "node2vec",
    "negativeSamplingExponent": 0.75,
    "negativeSamplingRate": 5,
    "nodeLabels": [
    "*"
    ],
    "positiveSamplingFactor": 0.001,
    "relationshipTypes": [
      "*"
    ],
    "returnFactor": 1,
    "sudo": false,
    "walkBufferSize": 1000,
    "walkLength": 80,
    "walksPerNode": 10,
    "windowSize": 10
    },
    "lossPerIteration": [
      8.362697137375363
    ],
    "mutateMillis": 2,
    "nodeCount": 9,
    "nodePropertiesWritten": 9,
    "postProcessingMillis": 0,
    "preProcessingMillis": 7
  },
  "project_1": {
    "graphName": "snowgraph",
    "nodeCount": 9,
    "nodeMillis": 189,
    "relationshipCount": 9,
    "relationshipMillis": 296,
    "totalMillis": 485
  },
  "write_node_property_1": {
    "copyIntoTableMillis": 915,
    "exportMillis": 1714,
    "nodeLabel": "PERSONS",
    "nodeProperty": "node2vec",
    "outputTable": "EXAMPLE_DB.DATA_SCHEMA.PERSON_EMBEDDINGS",
    "propertiesExported": 5,
    "stageUploadMillis": 583
  }
}

The returned result contains information about the job execution and result distribution. Additionally, the embedding for each of the nodes has been written back to the Snowflake database. We can query it like so:

SELECT * FROM EXAMPLE_DB.DATA_SCHEMA.PERSON_EMBEDDINGS;
Table 6. Results
NODEID NODE2VEC

Alice

[-7.200873643159866e-02, -1.017554104328156e-01]

Bob

[-7.187437266111374e-02, 1.439279913902283e-01]

Carol

[-7.191916555166245e-02, 2.287001907825470e-01]

Dave

[-7.068923115730286e-02, 1.508352905511856e-01]

Eve

[-7.218790799379349e-02, 2.373333722352982e-01]

The results of the algorithm are not very intuitively interpretable, as the node embedding format is a mathematical abstraction of the node within its neighborhood, designed for machine learning programs. What we can see is that the embeddings have two elements (as configured using embeddingDimension) and that the numbers are relatively small (they all fit in the range of [-1, 1]).

Due to the random nature of the algorithm, results will vary between the runs. However, this does not necessarily mean that the pairwise distances of two node embeddings vary as much.