Weakly Connected Components

Glossary

Directed: Directed trait. The algorithm is well-defined on a directed graph.
Directed: Directed trait. The algorithm ignores the direction of the graph.
Directed: Directed trait. The algorithm does not run on a directed graph.
Undirected: Undirected trait. The algorithm is well-defined on an undirected graph.
Undirected: Undirected trait. The algorithm ignores the undirectedness of the graph.
Heterogeneous nodes: Heterogeneous nodes fully supported. The algorithm has the ability to distinguish between nodes of different types.
Heterogeneous nodes: Heterogeneous nodes allowed. The algorithm treats all selected nodes similarly regardless of their label.
Heterogeneous relationships: Heterogeneous relationships fully supported. The algorithm has the ability to distinguish between relationships of different types.
Heterogeneous relationships: Heterogeneous relationships allowed. The algorithm treats all selected relationships similarly regardless of their type.
Weighted relationships: Weighted trait. The algorithm supports a relationship property to be used as weight, specified via the relationshipWeightProperty configuration parameter.
Weighted relationships: Weighted trait. The algorithm treats each relationship as equally important, discarding the value of any relationship weight.

Introduction

The Weakly Connected Components (WCC) algorithm finds sets of connected nodes in directed and undirected graphs. Two nodes are connected, if there exists a path between them. The set of all nodes that are connected with each other form a component. In contrast to Strongly Connected Components (SCC), the direction of relationships on the path between two nodes is not considered. For example, in a directed graph (a)→(b), a and b will be in the same component, even if there is no directed relationship (b)→(a).

WCC is often used early in an analysis to understand the structure of a graph. Using WCC to understand the graph structure enables running other algorithms independently on an identified cluster.

The implementation of the algorithm is based on the following papers:

Syntax

This section covers the syntax used to execute the Weakly Connected Components algorithm in each of its execution modes. We are describing the named graph variant of the syntax. To learn more about general syntax variants, see Syntax overview.

WCC syntax per mode

Run WCC in stream mode on a named graph.

CALL gds.wcc.stream(
  graphName: String,
  configuration: Map
)
YIELD
  nodeId: Integer,
  componentId: Integer

Table 1. Parameters
Name	Type	Default	Optional	Description
graphName	String	`n/a`	no	The name of a graph stored in the catalog.
configuration	Map	`{}`	yes	Configuration for algorithm-specifics and/or graph filtering.

Table 2. Configuration
Name	Type	Default	Optional	Description
nodeLabels	List of String	`['*']`	yes	Filter the named graph using the given node labels. Nodes with any of the given labels will be included.
relationshipTypes	List of String	`['*']`	yes	Filter the named graph using the given relationship types. Relationships with any of the given types will be included.
concurrency	Integer	`4 ^[1]`	yes	The number of concurrent threads used for running the algorithm.
jobId	String	`Generated internally`	yes	An ID that can be provided to more easily track the algorithm’s progress.
logProgress	Boolean	`true`	yes	If disabled the progress percentage will not be logged.
relationshipWeightProperty	String	`null`	yes	Name of the relationship property to use as weights. If unspecified, the algorithm runs unweighted.
seedProperty	String	`n/a`	yes	Used to set the initial component for a node. The property value needs to be a number.
threshold	Float	`null`	yes	The value of the weight above which the relationship is considered in the computation.
consecutiveIds	Boolean	`false`	yes	Flag to decide whether component identifiers are mapped into a consecutive id space (requires additional memory).
minComponentSize	Integer	`0`	yes	Only nodes inside communities larger or equal the given value are returned.
1. In a GDS Session the default is the number of available processors

Table 3. Results
Name	Type	Description
nodeId	Integer	Node ID.
componentId	Integer	Component ID.

Run WCC in stats mode on a named graph.

CALL gds.wcc.stats(
  graphName: String,
  configuration: Map
)
YIELD
  componentCount: Integer,
  preProcessingMillis: Integer,
  computeMillis: Integer,
  postProcessingMillis: Integer,
  componentDistribution: Map,
  configuration: Map

Table 4. Parameters
Name	Type	Default	Optional	Description
graphName	String	`n/a`	no	The name of a graph stored in the catalog.
configuration	Map	`{}`	yes	Configuration for algorithm-specifics and/or graph filtering.

Table 5. Configuration
Name	Type	Default	Optional	Description
nodeLabels	List of String	`['*']`	yes	Filter the named graph using the given node labels. Nodes with any of the given labels will be included.
relationshipTypes	List of String	`['*']`	yes	Filter the named graph using the given relationship types. Relationships with any of the given types will be included.
concurrency	Integer	`4 ^[2]`	yes	The number of concurrent threads used for running the algorithm.
jobId	String	`Generated internally`	yes	An ID that can be provided to more easily track the algorithm’s progress.
logProgress	Boolean	`true`	yes	If disabled the progress percentage will not be logged.
relationshipWeightProperty	String	`null`	yes	Name of the relationship property to use as weights. If unspecified, the algorithm runs unweighted.
seedProperty	String	`n/a`	yes	Used to set the initial component for a node. The property value needs to be a number.
threshold	Float	`null`	yes	The value of the weight above which the relationship is considered in the computation.
consecutiveIds	Boolean	`false`	yes	Flag to decide whether component identifiers are mapped into a consecutive id space (requires additional memory).
2. In a GDS Session the default is the number of available processors

Table 6. Results
Name	Type	Description
componentCount	Integer	The number of computed components.
preProcessingMillis	Integer	Milliseconds for preprocessing the data.
computeMillis	Integer	Milliseconds for running the algorithm.
postProcessingMillis	Integer	Milliseconds for computing component count and distribution statistics.
componentDistribution	Map	Map containing min, max, mean as well as p1, p5, p10, p25, p50, p75, p90, p95, p99 and p999 percentile values of component sizes.
configuration	Map	The configuration used for running the algorithm.

Run WCC in mutate mode on a named graph.

CALL gds.wcc.mutate(
  graphName: String,
  configuration: Map
)
YIELD
  componentCount: Integer,
  nodePropertiesWritten: Integer,
  preProcessingMillis: Integer,
  computeMillis: Integer,
  mutateMillis: Integer,
  postProcessingMillis: Integer,
  componentDistribution: Map,
  configuration: Map

Table 7. Parameters
Name	Type	Default	Optional	Description
graphName	String	`n/a`	no	The name of a graph stored in the catalog.
configuration	Map	`{}`	yes	Configuration for algorithm-specifics and/or graph filtering.

Table 8. Configuration
Name	Type	Default	Optional	Description
mutateProperty	String	`n/a`	no	The node property in the GDS graph to which the component ID is written.
nodeLabels	List of String	`['*']`	yes	Filter the named graph using the given node labels.
relationshipTypes	List of String	`['*']`	yes	Filter the named graph using the given relationship types.
concurrency	Integer	`4`	yes	The number of concurrent threads used for running the algorithm.
jobId	String	`Generated internally`	yes	An ID that can be provided to more easily track the algorithm’s progress.
relationshipWeightProperty	String	`null`	yes	Name of the relationship property to use as weights. If unspecified, the algorithm runs unweighted.
seedProperty	String	`n/a`	yes	Used to set the initial component for a node. The property value needs to be a number.
threshold	Float	`null`	yes	The value of the weight above which the relationship is considered in the computation.
consecutiveIds	Boolean	`false`	yes	Flag to decide whether component identifiers are mapped into a consecutive id space (requires additional memory).

Table 9. Results
Name	Type	Description
componentCount	Integer	The number of computed components.
nodePropertiesWritten	Integer	The number of node properties written.
preProcessingMillis	Integer	Milliseconds for preprocessing the data.
computeMillis	Integer	Milliseconds for running the algorithm.
mutateMillis	Integer	Milliseconds for adding properties to the projected graph.
postProcessingMillis	Integer	Milliseconds for computing component count and distribution statistics.
componentDistribution	Map	Map containing min, max, mean as well as p1, p5, p10, p25, p50, p75, p90, p95, p99 and p999 percentile values of component sizes.
configuration	Map	The configuration used for running the algorithm.

Run WCC in write mode on a named graph.

CALL gds.wcc.write(
  graphName: String,
  configuration: Map
)
YIELD
  componentCount: Integer,
  nodePropertiesWritten: Integer,
  preProcessingMillis: Integer,
  computeMillis: Integer,
  writeMillis: Integer,
  postProcessingMillis: Integer,
  componentDistribution: Map,
  configuration: Map

Table 10. Parameters
Name	Type	Default	Optional	Description
graphName	String	`n/a`	no	The name of a graph stored in the catalog.
configuration	Map	`{}`	yes	Configuration for algorithm-specifics and/or graph filtering.

Table 11. Configuration
Name	Type	Default	Optional	Description
nodeLabels	List of String	`['*']`	yes	Filter the named graph using the given node labels. Nodes with any of the given labels will be included.
relationshipTypes	List of String	`['*']`	yes	Filter the named graph using the given relationship types. Relationships with any of the given types will be included.
concurrency	Integer	`4 ^[3]`	yes	The number of concurrent threads used for running the algorithm.
jobId	String	`Generated internally`	yes	An ID that can be provided to more easily track the algorithm’s progress.
logProgress	Boolean	`true`	yes	If disabled the progress percentage will not be logged.
writeConcurrency	Integer	`value of 'concurrency'`	yes	The number of concurrent threads used for writing the result to Neo4j.
writeProperty	String	`n/a`	no	The node property in the Neo4j database to which the component ID is written.
relationshipWeightProperty	String	`null`	yes	Name of the relationship property to use as weights. If unspecified, the algorithm runs unweighted.
seedProperty	String	`n/a`	yes	Used to set the initial component for a node. The property value needs to be a number.
threshold	Float	`null`	yes	The value of the weight above which the relationship is considered in the computation.
consecutiveIds	Boolean	`false`	yes	Flag to decide whether component identifiers are mapped into a consecutive id space (requires additional memory).
minComponentSize	Integer	`0`	yes	Only nodes inside communities larger or equal the given value will be written to the underlying Neo4j database.
3. In a GDS Session the default is the number of available processors

Table 12. Results
Name	Type	Description
componentCount	Integer	The number of computed components.
nodePropertiesWritten	Integer	The number of node properties written.
preProcessingMillis	Integer	Milliseconds for preprocessing the data.
computeMillis	Integer	Milliseconds for running the algorithm.
writeMillis	Integer	Milliseconds for writing result back to Neo4j.
postProcessingMillis	Integer	Milliseconds for computing component count and distribution statistics.
componentDistribution	Map	Map containing min, max, mean as well as p1, p5, p10, p25, p50, p75, p90, p95, p99 and p999 percentile values of component sizes.
configuration	Map	The configuration used for running the algorithm.

Examples

All the examples below should be run in an empty database.

The examples use Cypher projections as the norm. Native projections will be deprecated in a future release.

In this section we will show examples of running the Weakly Connected Components algorithm on a concrete graph. The intention is to illustrate what the results look like and to provide a guide in how to make use of the algorithm in a real setting. We will do this on a small user network graph of a handful nodes connected in a particular pattern. The example graph looks like this:

The following Cypher statement will create the example graph in the Neo4j database:

CREATE
  (nAlice:User {name: 'Alice'}),
  (nBridget:User {name: 'Bridget'}),
  (nCharles:User {name: 'Charles'}),
  (nDoug:User {name: 'Doug'}),
  (nMark:User {name: 'Mark'}),
  (nMichael:User {name: 'Michael'}),

  (nAlice)-[:LINK {weight: 0.5}]->(nBridget),
  (nAlice)-[:LINK {weight: 4}]->(nCharles),
  (nMark)-[:LINK {weight: 1.1}]->(nDoug),
  (nMark)-[:LINK {weight: 2}]->(nMichael);

This graph has two connected components, each with three nodes. The relationships that connect the nodes in each component have a property weight which determines the strength of the relationship.

The following statement will project a graph using a Cypher projection and store it in the graph catalog under the name 'myGraph'.

MATCH (source:User)-[r:LINK]->(target:User)
RETURN gds.graph.project(
  'myGraph',
  source,
  target,
  { relationshipProperties: r { .weight } }
)

In the following examples we will demonstrate using the Weakly Connected Components algorithm on this graph.

Memory Estimation

First off, we will estimate the cost of running the algorithm using the estimate procedure. This can be done with any execution mode. We will use the write mode in this example. Estimating the algorithm is useful to understand the memory impact that running the algorithm on your graph will have. When you later actually run the algorithm in one of the execution modes the system will perform an estimation. If the estimation shows that there is a very high probability of the execution going over its memory limitations, the execution is prohibited. To read more about this, see Automatic estimation and execution blocking.

For more details on estimate in general, see Memory Estimation.

The following will estimate the memory requirements for running the algorithm in write mode:

CALL gds.wcc.write.estimate('myGraph', { writeProperty: 'component' })
YIELD nodeCount, relationshipCount, bytesMin, bytesMax, requiredMemory

Table 13. Results
nodeCount	relationshipCount	bytesMin	bytesMax	requiredMemory
6	4	112	112	"112 Bytes"

Stream

In the stream execution mode, the algorithm returns the component ID for each node. This allows us to inspect the results directly or post-process them in Cypher without any side effects. For example, we can order the results to see the nodes that belong to the same component displayed next to each other.

For more details on the stream mode in general, see Stream.

The following will run the algorithm and stream results:

CALL gds.wcc.stream('myGraph')
YIELD nodeId, componentId
RETURN gds.util.asNode(nodeId).name AS name, componentId
ORDER BY componentId, name

Table 14. Results
name	componentId
"Alice"	0
"Bridget"	0
"Charles"	0
"Doug"	3
"Mark"	3
"Michael"	3

The result shows that the algorithm identifies two components. This can be verified in the example graph.

The default behaviour of the algorithm is to run unweighted, e.g. without using relationship weights. The weighted option will be demonstrated in Weighted

The actual component ids may differ because the order of nodes projected in the in-memory graph is not guaranteed. For this case it is equally plausible to get the inverse solution, f.i. when our community 0 nodes are mapped to community 3 instead, and vice versa.

Stats

In the stats execution mode, the algorithm returns a single row containing a summary of the algorithm result. This execution mode does not have any side effects. It can be useful for evaluating algorithm performance by inspecting the computeMillis return item. In the examples below we will omit returning the timings. The full signature of the procedure can be found in the syntax section.

For more details on the stats mode in general, see Stats.

The following will run the algorithm in stats mode:

CALL gds.wcc.stats('myGraph')
YIELD componentCount

Table 15. Results
componentCount
2

The result shows that myGraph has two components and this can be verified by looking at the example graph.

Mutate

The mutate execution mode extends the stats mode with an important side effect: updating the named graph with a new node property containing the component ID for that node. The name of the new property is specified using the mandatory configuration parameter mutateProperty. The result is a single summary row, similar to stats, but with some additional metrics. The mutate mode is especially useful when multiple algorithms are used in conjunction.

For more details on the mutate mode in general, see Mutate.

The following will run the algorithm in mutate mode:

CALL gds.wcc.mutate('myGraph', { mutateProperty: 'componentId' })
YIELD nodePropertiesWritten, componentCount;

Table 16. Results
nodePropertiesWritten	componentCount
6	2

Write

The write execution mode extends the stats mode with an important side effect: writing the component ID for each node as a property to the Neo4j database. The name of the new property is specified using the mandatory configuration parameter writeProperty. The result is a single summary row, similar to stats, but with some additional metrics. The write mode enables directly persisting the results to the database.

For more details on the write mode in general, see Write.

The following will run the algorithm in write mode:

CALL gds.wcc.write('myGraph', { writeProperty: 'componentId' })
YIELD nodePropertiesWritten, componentCount;

Table 17. Results
nodePropertiesWritten	componentCount
6	2

As we can see from the results, the nodes connected to one another are calculated by the algorithm as belonging to the same connected component.

Weighted

By configuring the algorithm to use a weight we can increase granularity in the way the algorithm calculates component assignment. We do this by specifying the property key with the relationshipWeightProperty configuration parameter. Additionally, we can specify a threshold for the weight value. Then, only weights greater than the threshold value will be considered by the algorithm. We do this by specifying the threshold value with the threshold configuration parameter.

If a relationship does not have the specified weight property, the algorithm falls back to using a default value of zero.

The following will run the algorithm using relationship weight and stream results:

CALL gds.wcc.stream('myGraph', {
  relationshipWeightProperty: 'weight',
  threshold: 1.0
}) YIELD nodeId, componentId
RETURN gds.util.asNode(nodeId).name AS Name, componentId AS ComponentId
ORDER BY ComponentId, Name

Table 18. Results
Name	ComponentId
`"Alice"`	`0`
`"Charles"`	`0`
`"Bridget"`	`1`
`"Doug"`	`3`
`"Mark"`	`3`
`"Michael"`	`3`

As we can see from the results, the node named 'Bridget' is now in its own component, due to its relationship weight being less than the configured threshold and thus ignored.

We are using stream mode to illustrate running the algorithm as weighted or unweighted, all the other algorithm modes also support this configuration parameter.

Seeded components

It is possible to define preliminary component IDs for nodes using the seedProperty configuration parameter. This is helpful if we want to retain components from a previous run and it is known that no components have been split by removing relationships. The property value needs to be a number.

The algorithm first checks if there is a seeded component ID assigned to the node. If there is one, that component ID is used. Otherwise, a new unique component ID is assigned to the node.

Once every node belongs to a component, the algorithm merges components of connected nodes. When components are merged, the resulting component is always the one with the lower component ID. Note that the consecutiveIds configuration option cannot be used in combination with seeding in order to retain the seeding values.

The algorithm assumes that nodes with the same seed value do in fact belong to the same component. If any two nodes in different components have the same seed, behavior is undefined. It is then recommended running WCC without seeds.

To demonstrate this in practice, we will go through a few steps:

We will run the algorithm and write the results to Neo4j.
Then we will add another node to our graph, this node will not have the property computed in Step 1.
We will project a new graph that has the result from Step 1 as nodeProperty
And then we will run the algorithm again, this time in stream mode, and we will use the seedProperty configuration parameter.

We will use the weighted variant of WCC.

Step 1

The following will run the algorithm in write mode:

CALL gds.wcc.write('myGraph', {
  writeProperty: 'componentId',
  relationshipWeightProperty: 'weight',
  threshold: 1.0
})
YIELD nodePropertiesWritten, componentCount;

Table 19. Results
nodePropertiesWritten	componentCount
6	3

Step 2

After the algorithm has finished writing to Neo4j we want to create a new node in the database.

The following will create a new node in the Neo4j graph, with no component ID:

MATCH (b:User {name: 'Bridget'})
CREATE (b)-[:LINK {weight: 2.0}]->(new:User {name: 'Mats'})

Step 3

Note, that we cannot use our already projected graph as it does not contain the component id. We will therefore project a second graph that contains the previously computed component id.

The following will project a new graph containing the previously computed component id:

MATCH (source:User)-[r:LINK]->(target:User)
RETURN gds.graph.project(
  'myGraph-seeded',
  source,
  target,
  {
    sourceNodeProperties: source { .componentId },
    targetNodeProperties: target { .componentId },
    relationshipProperties: r { .weight }
  }
)

Step 4

The following will run the algorithm in stream mode using seedProperty:

CALL gds.wcc.stream('myGraph-seeded', {
  seedProperty: 'componentId',
  relationshipWeightProperty: 'weight',
  threshold: 1.0
}) YIELD nodeId, componentId
RETURN gds.util.asNode(nodeId).name AS name, componentId
ORDER BY componentId, name

Table 20. Results
name	componentId
"Alice"	0
"Charles"	0
"Bridget"	1
"Mats"	1
"Doug"	3
"Mark"	3
"Michael"	3

The result shows that despite not having the seedProperty when it was projected, the node 'Mats' has been assigned to the same component as the node 'Bridget'. This is correct because these two nodes are connected.

Writing Seeded components

In the previous section we demonstrated the seedProperty usage in stream mode. It is also available in the other modes of the algorithm. Below is an example on how to use seedProperty in write mode. Note that the example below relies on Steps 1 - 3 from the previous section.

The following will run the algorithm in write mode using seedProperty:

CALL gds.wcc.write('myGraph-seeded', {
  seedProperty: 'componentId',
  writeProperty: 'componentId',
  relationshipWeightProperty: 'weight',
  threshold: 1.0
})
YIELD nodePropertiesWritten, componentCount;

Table 21. Results
nodePropertiesWritten	componentCount
1	3

If the seedProperty configuration parameter has the same value as writeProperty, the algorithm only writes properties for nodes where the component ID has changed. If they differ, the algorithm writes properties for all nodes.

Graph Sampling optimization

The WCC implementation provides two compute strategies:

The unsampled strategy as described in Wait-free Parallel Algorithms for the Union-Find Problem.
The sampled strategy as described in Optimizing Parallel Graph Connectivity Computation via Subgraph Sampling

While both strategies provide very good performance, the sampled strategy is usually the faster one. The decision, which strategy to use, depends on the input graph. If the relationships of the graph are …

… undirected, the algorithm picks the sampled strategy.
… directed, the algorithm picks the unsampled strategy.
… directed and inverse indexed, the algorithm picks the sampled strategy.

The direction of a relationship is defined by the orientation which can be set during a graph projection. While NATURAL and REVERSE orientation result in a directed graph, the UNDIRECTED orientation leads to undirected relationships. In order to create a directed graph with inverse indexed relationships, one can use the indexInverse parameter as part of the relationship projection. An inverse index allows the algorithm to traverse the relationships of a node according to the opposite orientation. If the graph is projected using a NATURAL orientation, the inverse index represents the REVERSE orientation and vice versa.

The following statement will project the above example graph using a Cypher projection with inverse index and store it in the graph catalog under the name myIndexedGraph.

MATCH (source:User)-[r:LINK]->(target:User)
RETURN gds.graph.project(
  'myIndexedGraph',
  source,
  target,
  {},
  { inverseIndexedRelationshipTypes: ['*'] }
)

The following query is identical to the stream example in the previous section. This time, we execute WCC on myIndexedGraph which will allow the algorithm to use the sampled strategy.

The following will run the algorithm with sampled strategy and stream results:

CALL gds.wcc.stream('myIndexedGraph')
YIELD nodeId, componentId
RETURN gds.util.asNode(nodeId).name AS name, componentId
ORDER BY componentId, name

Table 22. Results
name	componentId
"Alice"	0
"Bridget"	0
"Charles"	0
"Doug"	3
"Mark"	3
"Michael"	3

The actual component ids may differ due to the randomness in the Graph sampling optimization. For this case it is equally plausible to get the inverse solution, f.i. when our community 0 nodes are mapped to community 3 instead, and vice versa.