Common Neighbour Aware Random Walk sampling

This feature is in the beta tier. For more information on feature tiers, see API Tiers.

Directed

Undirected

Heterogeneous nodes

Heterogeneous relationships

Weighted relationships

Glossary

Directed: Directed trait. The algorithm is well-defined on a directed graph.
Directed: Directed trait. The algorithm ignores the direction of the graph.
Directed: Directed trait. The algorithm does not run on a directed graph.
Undirected: Undirected trait. The algorithm is well-defined on an undirected graph.
Undirected: Undirected trait. The algorithm ignores the undirectedness of the graph.
Heterogeneous nodes: Heterogeneous nodes fully supported. The algorithm has the ability to distinguish between nodes of different types.
Heterogeneous nodes: Heterogeneous nodes allowed. The algorithm treats all selected nodes similarly regardless of their label.
Heterogeneous relationships: Heterogeneous relationships fully supported. The algorithm has the ability to distinguish between relationships of different types.
Heterogeneous relationships: Heterogeneous relationships allowed. The algorithm treats all selected relationships similarly regardless of their type.
Weighted relationships: Weighted trait. The algorithm supports a relationship property to be used as weight, specified via the relationshipWeightProperty configuration parameter.
Weighted relationships: Weighted trait. The algorithm treats each relationship as equally important, discarding the value of any relationship weight.

Introduction

Graph sampling algorithms can be used to reduce the size of large and complex graphs while preserving structural properties. This can help to reduce bias, and ensure privacy, and making graph analysis more scalable. Sampling algorithms are widely used in machine learning, social network analysis, and many other applications.

The Common Neighbour Aware Random Walk (CNARW) is a graph sampling technique that involves optimizing the selection of the next-hop node. It takes into account the number of common neighbours between the current node and the next-hop candidates.

According to the paper, a major reason why simple random walks tend to converge slowly is due to the high clustering feature that is typical for some kinds of graphs e.g. for online social networks (OSNs). When moving to neighbours uniformly at random, it is easy to get caught in local loops and revisit previously visited nodes, which slows down convergence.

To address this issue, one solution is to prioritize nodes that offer a higher likelihood of exploring unvisited nodes in each step. Nodes with higher degrees may provide this opportunity, but they may also have more common neighbours with previously visited nodes, increasing the likelihood of revisits.

Therefore, choosing a node with a higher degree and fewer common neighbours with previously visited nodes (or the current node) not only increases the chances of discovering unvisited nodes but also reduces the probability of revisiting previously visited nodes in future steps.

The implementation of the algorithm is based on the following paper:

Common Neighbors Matter: Fast Random Walk Sampling with Common Neighbor Awareness

Relationship weights

Same as in the relationshipWeightProperty parameter in RandomWalksWithRestarts algorithm.

Node label stratification

Same as in the nodeLabelStratification parameter in RandomWalksWithRestarts algorithm.

Syntax

The following describes the API for running the algorithm

CALL gds.graph.sample.cnarw(
  graphName: String,
  fromGraphName: String,
  configuration: Map
)
YIELD
  graphName,
  fromGraphName,
  nodeCount,
  relationshipCount,
  startNodeCount,
  projectMillis

Table 1. Parameters
Name	Type	Description
graphName	String	The name of the new graph that is stored in the graph catalog.
fromGraphName	String	The name of the original graph in the graph catalog.
configuration	Map	Additional parameters to configure the subgraph sampling.

Table 2. Configuration
Name	Type	Default	Optional	Description
nodeLabels	List of String	`['*']`	yes	Filter the named graph using the given node labels. Nodes with any of the given labels will be included.
relationshipTypes	List of String	`['*']`	yes	Filter the named graph using the given relationship types. Relationships with any of the given types will be included.
concurrency	Integer	`4 ^[1]`	yes	The number of concurrent threads used for running the algorithm.
jobId	String	`Generated internally`	yes	An ID that can be provided to more easily track the algorithm’s progress.
logProgress	Boolean	`true`	yes	If disabled the progress percentage will not be logged.
relationshipWeightProperty	String	`null`	yes	Name of the relationship property to use as weights. If unspecified, the algorithm runs unweighted.
samplingRatio	Float	`0.15`	yes	The fraction of nodes in the original graph to be sampled.
restartProbability	Float	`0.1`	yes	The probability that a sampling random walk restarts from one of the start nodes.
startNodes	List of Integer or Node	`A node chosen uniformly at random`	yes	IDs of the initial set of nodes of the original graph from which the sampling random walks will start.
nodeLabelStratification	Boolean	`false`	yes	If true, preserves the node label distribution of the original graph.
randomSeed	Integer	`n/a`	yes	The seed value to control the randomness of the algorithm. Note that `concurrency` must be set to 1 when setting this parameter.
1. In a GDS Session the default is the number of available processors

Table 3. Results
Name	Type	Description
graphName	String	The name of the new graph that is stored in the graph catalog.
fromGraphName	String	The name of the original graph in the graph catalog.
nodeCount	Integer	Number of nodes in the subgraph.
relationshipCount	Integer	Number of relationships in the subgraph.
startNodeCount	Integer	Number of start nodes actually used by the algorithm.
projectMillis	Integer	Milliseconds for projecting the subgraph.

Examples

All the examples below should be run in an empty database.

The examples use Cypher projections as the norm. Native projections will be deprecated in a future release.

In this section we will demonstrate the usage of the CNARW sampling algorithm on a small toy graph.

Setting up

In this section we will show examples of running the Common Neighbour Aware Random Walk graph sampling algorithm on a concrete graph. The intention is to illustrate what the results look like and to provide a guide in how to make use of the algorithm in a real setting. We will do this on a small social network graph of a handful nodes connected in a particular pattern. The example graph looks like this:

The following Cypher statement will create the example graph in the Neo4j database:

CREATE
    (J:Female {name:'Juliette'}),
    (R:Male {name:'Romeo'}),
    (r1:Male {name:'Ryan'}),
    (r2:Male {name:'Robert'}),
    (r3:Male {name:'Riley'}),
    (r4:Female {name:'Ruby'}),
    (j1:Female {name:'Josie'}),
    (j2:Male {name:'Joseph'}),
    (j3:Female {name:'Jasmine'}),
    (j4:Female {name:'June'}),
    (J)-[:LINK]->(R),
    (R)-[:LINK]->(J),
    (r1)-[:LINK]->(R),   (R)-[:LINK]->(r1),
    (r2)-[:LINK]->(R),   (R)-[:LINK]->(r2),
    (r3)-[:LINK]->(R),   (R)-[:LINK]->(r3),
    (r4)-[:LINK]->(R),   (R)-[:LINK]->(r4),
    (r1)-[:LINK]->(r2),  (r2)-[:LINK]->(r1),
    (r1)-[:LINK]->(r3),  (r3)-[:LINK]->(r1),
    (r1)-[:LINK]->(r4),  (r4)-[:LINK]->(r1),
    (r2)-[:LINK]->(r3),  (r3)-[:LINK]->(r2),
    (r2)-[:LINK]->(r4),  (r4)-[:LINK]->(r2),
    (r3)-[:LINK]->(r4),  (r4)-[:LINK]->(r3),
    (j1)-[:LINK]->(J),   (J)-[:LINK]->(j1),
    (j2)-[:LINK]->(J),   (J)-[:LINK]->(j2),
    (j3)-[:LINK]->(J),   (J)-[:LINK]->(j3),
    (j4)-[:LINK]->(J),   (J)-[:LINK]->(j4),
    (j1)-[:LINK]->(j2),  (j2)-[:LINK]->(j1),
    (j1)-[:LINK]->(j3),  (j3)-[:LINK]->(j1),
    (j1)-[:LINK]->(j4),  (j4)-[:LINK]->(j1),
    (j2)-[:LINK]->(j3),  (j3)-[:LINK]->(j2),
    (j2)-[:LINK]->(j4),  (j4)-[:LINK]->(j2),
    (j3)-[:LINK]->(j4),  (j4)-[:LINK]->(j3) ;

This graph has two clusters of Users, that are closely connected. Between those clusters there is bidirectional relationship.

We can now project the graph and store it in the graph catalog.

The following statement will project the graph and store it in the graph catalog.

MATCH (n:Male|Female)-[r:LINK]->(m:Male|Female)
RETURN gds.graph.project('graph', n, m,
  {
    sourceNodeLabels: labels(n),
    targetNodeLabels: labels(m),
    relationshipType: type(r)
  }
)

Sampling

We can now go on to sample a subgraph from "myGraph" using CNARW. Using the "Juliette" node as our set of start nodes, we will venture to visit five nodes in the graph for our sample. Since we have six nodes total in our graph, and 5/10 = 0.5 we will use this as our sampling ratio.

The following will run the Common Neighbour Aware random walk sampling algorithm:

MATCH (start:Female {name: 'Juliette'})
CALL gds.graph.sample.cnarw('sampledGraph', 'graph',
{
    samplingRatio: 0.5,
    startNodes: [start]
})
YIELD nodeCount
RETURN nodeCount;

Table 4. Results
nodeCount
5

Due to the random nature of the sampling the algorithm may return different counts in different runs.

The main difference between the Common Neighbour Aware Random Walk and Random Walks with Restarts graphs sampling algorithms is that there are more chances to go into another cluster for the first one, which is colored in blue in the example above. The relationships sampled are those connecting these nodes.

Memory Estimation

First off, we will estimate the cost of running the algorithm using the estimate procedure. This can be done with any execution mode. Estimating the sampling procedure is useful to understand the memory impact that running the procedure on your graph will have. If the estimation shows that there is a very high probability of the execution going over its memory limitations, the execution is prohibited. To read more about this, see Automatic estimation and execution blocking.

For more details on estimate in general, see Memory Estimation.

The following will estimate the memory requirements for running the sampling algorithm:

CALL gds.graph.sample.cnarw.estimate('graph',
{
    samplingRatio: 0.5
})
YIELD requiredMemory
RETURN requiredMemory;

Table 5. Results
requiredMemory
"1272 Bytes"