Random walk with restarts sampling

This feature is in the alpha tier. For more information on feature tiers, see API Tiers.

Directed

Undirected

Heterogeneous nodes

Heterogeneous relationships

Weighted relationships

Glossary

Directed: Directed trait. The algorithm is well-defined on a directed graph.
Directed: Directed trait. The algorithm ignores the direction of the graph.
Directed: Directed trait. The algorithm does not run on a directed graph.
Undirected: Undirected trait. The algorithm is well-defined on an undirected graph.
Undirected: Undirected trait. The algorithm ignores the undirectedness of the graph.
Heterogeneous nodes: Heterogeneous nodes fully supported. The algorithm has the ability to distinguish between nodes of different types.
Heterogeneous nodes: Heterogeneous nodes allowed. The algorithm treats all selected nodes similarly regardless of their label.
Heterogeneous relationships: Heterogeneous relationships fully supported. The algorithm has the ability to distinguish between relationships of different types.
Heterogeneous relationships: Heterogeneous relationships allowed. The algorithm treats all selected relationships similarly regardless of their type.
Weighted relationships: Weighted trait. The algorithm supports a relationship property to be used as weight, specified via the relationshipWeightProperty configuration parameter.
Weighted relationships: Weighted trait. The algorithm treats each relationship as equally important, discarding the value of any relationship weight.

Random walk with restarts sampling is featured in the end-to-end example Jupyter notebooks:

Introduction

Sometimes it may be useful to have a smaller but structurally representative sample of a given graph. For instance, such a sample could be used to train an inductive embedding algorithm (such as a graph neural network, like GraphSAGE). The training would then be faster than when training on the entire graph, and then the trained model could still be used to predict embeddings on the entire graph.

Random walk with restarts (RWR) samples the graph by taking random walks from a set of start nodes (see the startNodes parameter below). On each step of a random walk, there is some probability (see the restartProbability parameter below) that the walk stops, and a new walk from one of the start nodes starts instead (i.e. the walk restarts). Each node visited on these walks will be part of the sampled subgraph. The algorithm stops walking when the requested number of nodes have been visited (see the samplingRatio parameter below). The relationships of the sampled subgraph are those induced by the sampled nodes (i.e. the relationships of the original graph that connect nodes that have been sampled).

If at some point it’s very unlikely to visit new nodes by random walking from the current set of start nodes (possibly due to the original graph being disconnected), the algorithm will lazily expand the pool of start nodes one at a time by picking nodes uniformly at random from the original graph.

It was shown by Leskovec et al. in the paper "Sampling from Large Graphs" that RWR is a very good sampling algorithm for preserving structural features of the original graph that was sampled from. Additionally, RWR has been successfully used throughout the literature to sample batches for graph neural network (GNN) training.

Random walk with restarts is sometimes also referred to as rooted or personalized random walk.

Relationship weights

If the graph is weighted and relationshipWeightProperty is specified, the random walks are weighted. This means that the probability of walking along a relationship is the weight of that relationship divided by the sum of weights of outgoing relationships from the current node.

Node label stratification

In some cases it may be desirable for the sampled graph to preserve the distribution of node labels of the original graph. To enable such stratification, one can set nodeLabelStratification to true in the algorithm configuration. The stratified sampling is performed by only adding a node to the sampled graph if more nodes of that node’s particular set of labels are needed to uphold the node label distribution of the original graph.

By default, the algorithm treats all nodes in the same way no matter how they are labeled and makes no special effort to preserve the node label distribution of the original graph. Please note that the stratified sampling might be a bit slower since it has restrictions on the types of nodes it can add to the sampled graph when crawling it.

At this time there is no support for relationship type stratification.

Syntax

The following describes the API for running the algorithm

CALL gds.graph.sample.rwr(
  graphName: String,
  fromGraphName: String,
  configuration: Map
)
YIELD
  graphName,
  fromGraphName,
  nodeCount,
  relationshipCount,
  startNodeCount,
  projectMillis

Table 1. Parameters
Name	Type	Description
graphName	String	The name of the new graph that is stored in the graph catalog.
fromGraphName	String	The name of the original graph in the graph catalog.
configuration	Map	Additional parameters to configure the subgraph sampling.

Table 2. Configuration
Name	Type	Default	Optional	Description
nodeLabels	List of String	`['*']`	yes	Filter the named graph using the given node labels. Nodes with any of the given labels will be included.
relationshipTypes	List of String	`['*']`	yes	Filter the named graph using the given relationship types. Relationships with any of the given types will be included.
concurrency	Integer	`4 ^[1]`	yes	The number of concurrent threads used for running the algorithm.
jobId	String	`Generated internally`	yes	An ID that can be provided to more easily track the algorithm’s progress.
logProgress	Boolean	`true`	yes	If disabled the progress percentage will not be logged.
relationshipWeightProperty	String	`null`	yes	Name of the relationship property to use as weights. If unspecified, the algorithm runs unweighted.
samplingRatio	Float	`0.15`	yes	The fraction of nodes in the original graph to be sampled.
restartProbability	Float	`0.1`	yes	The probability that a sampling random walk restarts from one of the start nodes.
startNodes	List of Integer	`A node chosen uniformly at random`	yes	IDs of the initial set of nodes of the original graph from which the sampling random walks will start.
nodeLabelStratification	Boolean	`false`	yes	If true, preserves the node label distribution of the original graph.
randomSeed	Integer	`n/a`	yes	A random seed which is used for all randomness in the computation. Requires `concurrency = 1`.
1. In a GDS Session the default is the number of available processors

Table 3. Results
Name	Type	Description
graphName	String	The name of the new graph that is stored in the graph catalog.
fromGraphName	String	The name of the original graph in the graph catalog.
nodeCount	Integer	Number of nodes in the subgraph.
relationshipCount	Integer	Number of relationships in the subgraph.
startNodeCount	Integer	Number of start nodes actually used by the algorithm.
projectMillis	Integer	Milliseconds for projecting the subgraph.

Examples

All the examples below should be run in an empty database.

The examples use Cypher projections as the norm. Native projections will be deprecated in a future release.

In this section we will demonstrate the usage of the RWR sampling algorithm on a small toy graph.

Setting up

In this section we will show examples of running the Random walk with restarts sampling algorithm on a concrete graph. The intention is to illustrate what the results look like and to provide a guide in how to make use of the algorithm in a real setting. We will do this on a small social network graph of a handful nodes connected in a particular pattern. The example graph looks like this:

The following Cypher statement will create the example graph in the Neo4j database:

CREATE
  (nAlice:User {name: 'Alice'}),
  (nBridget:User {name: 'Bridget'}),
  (nCharles:User {name: 'Charles'}),
  (nDoug:User {name: 'Doug'}),
  (nMark:User {name: 'Mark'}),
  (nMichael:User {name: 'Michael'}),

  (nAlice)-[:LINK]->(nBridget),
  (nAlice)-[:LINK]->(nCharles),
  (nCharles)-[:LINK]->(nBridget),

  (nAlice)-[:LINK]->(nDoug),

  (nMark)-[:LINK]->(nDoug),
  (nMark)-[:LINK]->(nMichael),
  (nMichael)-[:LINK]->(nMark);

This graph has two clusters of Users, that are closely connected. Between those clusters there is one single relationship.

We can now project the graph and store it in the graph catalog.

The following statement will project the graph and store it in the graph catalog.

MATCH (n:User)-[r:LINK]->(m:User)
RETURN gds.graph.project('myGraph', n, m)

Sampling

We can now go on to sample a subgraph from "myGraph" using RWR. Using the "Alice" User node as our set of start nodes, we will venture to visit four nodes in the graph for our sample. Since we have six nodes total in our graph, and 4/6 ≈ 0.66 we will use this as our sampling ratio.

The following will run the Random walk with restarts sampling algorithm:

MATCH (start:User {name: 'Alice'})
CALL gds.graph.sample.rwr('mySample', 'myGraph', { samplingRatio: 0.66, startNodes: [start] })
YIELD nodeCount, relationshipCount
RETURN nodeCount, relationshipCount

Table 4. Results
nodeCount	relationshipCount
4	4

As we can see we did indeed visit four nodes. Looking at the topology of our original graph, "myGraph", we can conclude that the nodes must be those corresponding to the User nodes with the name properties "Alice", "Bridget", "Charles" and "Doug". And the relationships sampled are those connecting these nodes.