Node2Vec

Node2Vec is a node embedding algorithm that computes a vector representation of a node based on random walks in the graph. The neighborhood is sampled through random walks. Using a number of random neighborhood samples, the algorithm trains a single hidden layer neural network. The neural network is trained to predict the likelihood that a node will occur in a walk based on the occurrence of another node.

For more information on this algorithm, see:

Grover, Aditya, and Jure Leskovec. "node2vec: Scalable feature learning for networks." Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining. 2016.
https://snap.stanford.edu/node2vec/

Random Walks

A main concept of the Node2Vec algorithm is that of second order random walks. A random walk simulates a traversal of the graph in which the traversed relationships are chosen at random. In a classic random walk, each relationship has the same, possibly weighted, probability of being picked. This probability is not influenced by the previously visited nodes. The concept of second order random walks, however, tries to model the transition probability based on the currently visited node v, the node t visited before the current one, and the node x which is the target of a candidate relationship. Node2Vec random walks are thus influenced by two parameters: the returnFactor and the inOutFactor:

The returnFactor is used if t equals x, i.e., the random walk returns to the previously visited node.
The inOutFactor is used if the distance from t to x is equal to 2, i.e., the walk traverses further away from the node t

The probabilities for traversing a relationship during a random walk can be further influenced by specifying a relationshipWeightProperty. A relationship property value greater than 1 will increase the likelihood of a relationship being traversed, a property value between 0 and 1 will decrease that probability.

For every node in the graph Node2Vec generates a series of random walks with the particular node as start node. The number of random walks per node can be influenced by the walkPerNode configuration parameters, the walk length is controlled by the walkLength parameter.

Syntax

This section covers the syntax used to execute the Node2Vec algorithm.

Run Node2Vec.

CALL Neo4j_Graph_Analytics.graph.node2vec(
  'CPU_X64_XS',                    (1)
  {
    ['defaultTablePrefix': '...',] (2)
    'project': {...},              (3)
    'compute': {...},              (4)
    'write':   {...}               (5)
  }
);

1	Compute pool selector.
2	Optional prefix for table references.
3	Project config.
4	Compute config.
5	Write config.

Table 1. Parameters
Name	Type	Default	Optional	Description
computePoolSelector	String	`n/a`	no	The selector for the compute pool on which to run the Node2Vec job.
configuration	Map	`{}`	no	Configuration for graph project, algorithm compute and result write back.

The configuration map consists of the following three entries.

For more details on below Project configuration, refer to the Project documentation.

Table 2. Project configuration
Name	Type
nodeTables	List of node tables.
relationshipTables	Map of relationship types to relationship tables.

Table 3. Compute configuration
Name	Type	Default	Optional	Description
walkLength	Integer	`80`	yes	The number of steps in a single random walk.
walksPerNode	Integer	`10`	yes	The number of random walks generated for each node.
inOutFactor	Float	`1.0`	yes	Tendency of the random walk to stay close to the start node or fan out in the graph. Higher value means stay local.
returnFactor	Float	`1.0`	yes	Tendency of the random walk to return to the last visited node. A value below 1.0 means a higher tendency.
relationshipWeightProperty	String	`null`	yes	Name of the relationship property to use as weights to influence the probabilities of the random walks. The weights need to be >= 0. If unspecified, the algorithm runs unweighted.
windowSize	Integer	`10`	yes	Size of the context window when training the neural network.
negativeSamplingRate	Integer	`5`	yes	Number of negative samples to produce for each positive sample.
positiveSamplingFactor	Float	`0.001`	yes	Factor for influencing the distribution for positive samples. A higher value increases the probability that frequent nodes are down-sampled.
negativeSamplingExponent	Float	`0.75`	yes	Exponent applied to the node frequency to obtain the negative sampling distribution. A value of 1.0 samples proportionally to the frequency. A value of 0.0 samples each node equally.
embeddingDimension	Integer	`128`	yes	Size of the computed node embeddings.
embeddingInitializer	String	`NORMALIZED`	yes	Method to initialize embeddings. Values are sampled uniformly from a range `[-a, a]`. With `NORMALIZED`, `a=0.5/embeddingDimension` and with `UNIFORM` instead `a=1`.
iterations	Integer	`1`	yes	Number of training iterations. Higher iterations still sample more random walks, and therefore the set of walks will generally become more representative of the entire graph.
initialLearningRate	Float	`0.01`	yes	Learning rate used initially for training the neural network. The learning rate decreases after each training iteration.
minLearningRate	Float	`0.0001`	yes	Lower bound for learning rate as it is decreased during training.
randomSeed	Integer	`random`	yes	Seed value used to generate the random walks, which are used as the training set of the neural network. Note, that the generated embeddings are still nondeterministic.
walkBufferSize	Integer	`1000`	yes	The number of random walks to complete before starting training.

For more details on below Write configuration, refer to the Write documentation.

Table 4. Write configuration
Name	Type	Default	Optional	Description
nodeLabel	String	`n/a`	no	Node label in the in-memory graph from which to write a node property.
nodeProperty	String	`'node2vec'`	yes	The node property that will be written back to the Snowflake database.
outputTable	String	`n/a`	no	Table in Snowflake database to which node properties are written.

Example

In this section we will show examples of running the Node2Vec algorithm on a concrete graph. The intention is to illustrate what the results look like and to provide a guide in how to make use of the algorithm in a real setting. We will do this on a small knowledge graph of a handful of nodes, connected in a particular pattern. The example graph looks like this:

The following SQL statement will create the example graph tables in the Snowflake database:

CREATE OR REPLACE TABLE EXAMPLE_DB.DATA_SCHEMA.PERSONS (NODEID VARCHAR);
INSERT INTO EXAMPLE_DB.DATA_SCHEMA.PERSONS VALUES
  ('Alice'),
  ('Bob'),
  ('Carol'),
  ('Dave'),
  ('Eve');

CREATE OR REPLACE TABLE EXAMPLE_DB.DATA_SCHEMA.INSTRUMENTS (NODEID VARCHAR);
INSERT INTO EXAMPLE_DB.DATA_SCHEMA.INSTRUMENTS VALUES
  ('Guitar'),
  ('Synthesizer'),
  ('Bongos'),
  ('Trumpet');

CREATE OR REPLACE TABLE EXAMPLE_DB.DATA_SCHEMA.LIKES (SOURCENODEID VARCHAR, TARGETNODEID VARCHAR, WEIGHT FLOAT);
INSERT INTO EXAMPLE_DB.DATA_SCHEMA.LIKES VALUES
  ('Alice', 'Guitar',      1.0),
  ('Alice', 'Synthesizer', 1.0),
  ('Alice', 'Bongos',      0.5),
  ('Bob',   'Guitar',      1.0),
  ('Bob',   'Synthesizer', 1.0),
  ('Carol', 'Bongos',      1.0),
  ('Dave',  'Guitar',      1.0),
  ('Dave',  'Trumpet',     1.5),
  ('Dave',  'Bongos',      1.0);

This bipartite graph has two node sets, Person nodes and Instrument nodes. The two node sets are connected via LIKES relationships. Each relationship starts at a Person node and ends at an Instrument node.

Run job

To run the query, there is a required setup of grants for the application, your consumer role and your environment. Please see the Getting started page for more on this.

We also assume that the application name is the default Neo4j_Graph_Analytics. If you chose a different app name during installation, please replace it with that.

The following will run a Node2Vec job:

CALL Neo4j_Graph_Analytics.graph.node2vec('CPU_X64_XS', {
    'defaultTablePrefix': 'EXAMPLE_DB.DATA_SCHEMA',
    'project': {
        'nodeTables': [ 'PERSONS', 'INSTRUMENTS' ],
        'relationshipTables': {
            'LIKES': {
                'sourceTable': 'PERSONS',
                'targetTable': 'INSTRUMENTS'
            }
        }
    },
    'compute': {
        'embeddingDimension': 2
    },
    'write': [{
        'nodeLabel': 'PERSONS',
        'outputTable': 'PERSON_EMBEDDINGS'
    }]
});

Table 5. Results
JOB_ID	JOB_START	JOB_END	JOB_RESULT
job_9f036be61fe043dbbef168b9bae4da25	2025-07-17 08:43:17.050	2025-07-17 08:43:21.622	{ "node2vec_1": { "computeMillis": 33, "configuration": { "concurrency": 6, "embeddingDimension": 2, "embeddingInitializer": "NORMALIZED", "inOutFactor": 1, "initialLearningRate": 0.01, "iterations": 1, "jobId": "235ad57f-7555-44d7-85f0-7a78bf21d30d", "logProgress": true, "minLearningRate": 1.000000000000000e-04, "mutateProperty": "node2vec", "negativeSamplingExponent": 0.75, "negativeSamplingRate": 5, "nodeLabels": [ "" ], "positiveSamplingFactor": 0.001, "relationshipTypes": [ "" ], "returnFactor": 1, "sudo": false, "walkBufferSize": 1000, "walkLength": 80, "walksPerNode": 10, "windowSize": 10 }, "lossPerIteration": [ 8.362697137375363 ], "mutateMillis": 2, "nodeCount": 9, "nodePropertiesWritten": 9, "postProcessingMillis": 0, "preProcessingMillis": 7 }, "project_1": { "graphName": "snowgraph", "nodeCount": 9, "nodeMillis": 189, "relationshipCount": 9, "relationshipMillis": 296, "totalMillis": 485 }, "write_node_property_1": { "copyIntoTableMillis": 915, "exportMillis": 1714, "nodeLabel": "PERSONS", "nodeProperty": "node2vec", "outputTable": "EXAMPLE_DB.DATA_SCHEMA.PERSON_EMBEDDINGS", "propertiesExported": 5, "stageUploadMillis": 583 } }

Table 5. Results

JOB_ID

JOB_START

JOB_END

JOB_RESULT

job_9f036be61fe043dbbef168b9bae4da25

2025-07-17 08:43:17.050

2025-07-17 08:43:21.622

 {
  "node2vec_1": {
    "computeMillis": 33,
    "configuration": {
    "concurrency": 6,
    "embeddingDimension": 2,
    "embeddingInitializer": "NORMALIZED",
    "inOutFactor": 1,
    "initialLearningRate": 0.01,
    "iterations": 1,
    "jobId": "235ad57f-7555-44d7-85f0-7a78bf21d30d",
    "logProgress": true,
    "minLearningRate": 1.000000000000000e-04,
    "mutateProperty": "node2vec",
    "negativeSamplingExponent": 0.75,
    "negativeSamplingRate": 5,
    "nodeLabels": [
    "*"
    ],
    "positiveSamplingFactor": 0.001,
    "relationshipTypes": [
      "*"
    ],
    "returnFactor": 1,
    "sudo": false,
    "walkBufferSize": 1000,
    "walkLength": 80,
    "walksPerNode": 10,
    "windowSize": 10
    },
    "lossPerIteration": [
      8.362697137375363
    ],
    "mutateMillis": 2,
    "nodeCount": 9,
    "nodePropertiesWritten": 9,
    "postProcessingMillis": 0,
    "preProcessingMillis": 7
  },
  "project_1": {
    "graphName": "snowgraph",
    "nodeCount": 9,
    "nodeMillis": 189,
    "relationshipCount": 9,
    "relationshipMillis": 296,
    "totalMillis": 485
  },
  "write_node_property_1": {
    "copyIntoTableMillis": 915,
    "exportMillis": 1714,
    "nodeLabel": "PERSONS",
    "nodeProperty": "node2vec",
    "outputTable": "EXAMPLE_DB.DATA_SCHEMA.PERSON_EMBEDDINGS",
    "propertiesExported": 5,
    "stageUploadMillis": 583
  }
}

The returned result contains information about the job execution and result distribution. Additionally, the embedding for each of the nodes has been written back to the Snowflake database. We can query it like so:

SELECT * FROM EXAMPLE_DB.DATA_SCHEMA.PERSON_EMBEDDINGS;

Table 6. Results
NODEID	NODE2VEC
Alice	[-7.200873643159866e-02, -1.017554104328156e-01]
Bob	[-7.187437266111374e-02, 1.439279913902283e-01]
Carol	[-7.191916555166245e-02, 2.287001907825470e-01]
Dave	[-7.068923115730286e-02, 1.508352905511856e-01]
Eve	[-7.218790799379349e-02, 2.373333722352982e-01]

The results of the algorithm are not very intuitively interpretable, as the node embedding format is a mathematical abstraction of the node within its neighborhood, designed for machine learning programs. What we can see is that the embeddings have two elements (as configured using embeddingDimension) and that the numbers are relatively small (they all fit in the range of [-1, 1]).

Due to the random nature of the algorithm, results will vary between the runs. However, this does not necessarily mean that the pairwise distances of two node embeddings vary as much.