Weakly Connected Components
Glossary
- Directed
-
Directed trait. The algorithm is well-defined on a directed graph.
- Directed
-
Directed trait. The algorithm ignores the direction of the graph.
- Directed
-
Directed trait. The algorithm does not run on a directed graph.
- Undirected
-
Undirected trait. The algorithm is well-defined on an undirected graph.
- Undirected
-
Undirected trait. The algorithm ignores the undirectedness of the graph.
- Heterogeneous nodes
-
Heterogeneous nodes fully supported. The algorithm has the ability to distinguish between nodes of different types.
- Heterogeneous nodes
-
Heterogeneous nodes allowed. The algorithm treats all selected nodes similarly regardless of their label.
- Heterogeneous relationships
-
Heterogeneous relationships fully supported. The algorithm has the ability to distinguish between relationships of different types.
- Heterogeneous relationships
-
Heterogeneous relationships allowed. The algorithm treats all selected relationships similarly regardless of their type.
- Weighted relationships
-
Weighted trait. The algorithm supports a relationship property to be used as weight, specified via the relationshipWeightProperty configuration parameter.
- Weighted relationships
-
Weighted trait. The algorithm treats each relationship as equally important, discarding the value of any relationship weight.
Introduction
The Weakly Connected Components (WCC) algorithm finds sets of connected nodes in directed and undirected graphs.
Two nodes are connected, if there exists a path between them.
The set of all nodes that are connected with each other form a component.
In contrast to Strongly Connected Components (SCC), the direction of relationships on the path between two nodes is not considered.
For example, in a directed graph (a)→(b)
, a
and b
will be in the same component, even if there is no directed relationship (b)→(a)
.
WCC is often used early in an analysis to understand the structure of a graph. Using WCC to understand the graph structure enables running other algorithms independently on an identified cluster.
The implementation of the algorithm is based on the following papers:
Syntax
This section covers the syntax used to execute the Weakly Connected Components algorithm in each of its execution modes. We are describing the named graph variant of the syntax. To learn more about general syntax variants, see Syntax overview.
CALL gds.wcc.stream(
graphName: String,
configuration: Map
)
YIELD
nodeId: Integer,
componentId: Integer
Name | Type | Default | Optional | Description |
---|---|---|---|---|
graphName |
String |
|
no |
The name of a graph stored in the catalog. |
configuration |
Map |
|
yes |
Configuration for algorithm-specifics and/or graph filtering. |
Name | Type | Default | Optional | Description |
---|---|---|---|---|
List of String |
|
yes |
Filter the named graph using the given node labels. Nodes with any of the given labels will be included. |
|
List of String |
|
yes |
Filter the named graph using the given relationship types. Relationships with any of the given types will be included. |
|
Integer |
|
yes |
The number of concurrent threads used for running the algorithm. |
|
String |
|
yes |
An ID that can be provided to more easily track the algorithm’s progress. |
|
Boolean |
|
yes |
If disabled the progress percentage will not be logged. |
|
String |
|
yes |
Name of the relationship property to use as weights. If unspecified, the algorithm runs unweighted. |
|
String |
|
yes |
Used to set the initial component for a node. The property value needs to be a number. |
|
threshold |
Float |
|
yes |
The value of the weight above which the relationship is considered in the computation. |
consecutiveIds |
Boolean |
|
yes |
Flag to decide whether component identifiers are mapped into a consecutive id space (requires additional memory). |
minComponentSize |
Integer |
|
yes |
Only nodes inside communities larger or equal the given value are returned. |
Name | Type | Description |
---|---|---|
nodeId |
Integer |
Node ID. |
componentId |
Integer |
Component ID. |
CALL gds.wcc.stats(
graphName: String,
configuration: Map
)
YIELD
componentCount: Integer,
preProcessingMillis: Integer,
computeMillis: Integer,
postProcessingMillis: Integer,
componentDistribution: Map,
configuration: Map
Name | Type | Default | Optional | Description |
---|---|---|---|---|
graphName |
String |
|
no |
The name of a graph stored in the catalog. |
configuration |
Map |
|
yes |
Configuration for algorithm-specifics and/or graph filtering. |
Name | Type | Default | Optional | Description |
---|---|---|---|---|
List of String |
|
yes |
Filter the named graph using the given node labels. Nodes with any of the given labels will be included. |
|
List of String |
|
yes |
Filter the named graph using the given relationship types. Relationships with any of the given types will be included. |
|
Integer |
|
yes |
The number of concurrent threads used for running the algorithm. |
|
String |
|
yes |
An ID that can be provided to more easily track the algorithm’s progress. |
|
Boolean |
|
yes |
If disabled the progress percentage will not be logged. |
|
String |
|
yes |
Name of the relationship property to use as weights. If unspecified, the algorithm runs unweighted. |
|
String |
|
yes |
Used to set the initial component for a node. The property value needs to be a number. |
|
threshold |
Float |
|
yes |
The value of the weight above which the relationship is considered in the computation. |
consecutiveIds |
Boolean |
|
yes |
Flag to decide whether component identifiers are mapped into a consecutive id space (requires additional memory). |
Name | Type | Description |
---|---|---|
componentCount |
Integer |
The number of computed components. |
preProcessingMillis |
Integer |
Milliseconds for preprocessing the data. |
computeMillis |
Integer |
Milliseconds for running the algorithm. |
postProcessingMillis |
Integer |
Milliseconds for computing component count and distribution statistics. |
componentDistribution |
Map |
Map containing min, max, mean as well as p1, p5, p10, p25, p50, p75, p90, p95, p99 and p999 percentile values of component sizes. |
configuration |
Map |
The configuration used for running the algorithm. |
CALL gds.wcc.mutate(
graphName: String,
configuration: Map
)
YIELD
componentCount: Integer,
nodePropertiesWritten: Integer,
preProcessingMillis: Integer,
computeMillis: Integer,
mutateMillis: Integer,
postProcessingMillis: Integer,
componentDistribution: Map,
configuration: Map
Name | Type | Default | Optional | Description |
---|---|---|---|---|
graphName |
String |
|
no |
The name of a graph stored in the catalog. |
configuration |
Map |
|
yes |
Configuration for algorithm-specifics and/or graph filtering. |
Name | Type | Default | Optional | Description |
---|---|---|---|---|
mutateProperty |
String |
|
no |
The node property in the GDS graph to which the component ID is written. |
List of String |
|
yes |
Filter the named graph using the given node labels. |
|
List of String |
|
yes |
Filter the named graph using the given relationship types. |
|
Integer |
|
yes |
The number of concurrent threads used for running the algorithm. |
|
String |
|
yes |
An ID that can be provided to more easily track the algorithm’s progress. |
|
String |
|
yes |
Name of the relationship property to use as weights. If unspecified, the algorithm runs unweighted. |
|
String |
|
yes |
Used to set the initial component for a node. The property value needs to be a number. |
|
threshold |
Float |
|
yes |
The value of the weight above which the relationship is considered in the computation. |
consecutiveIds |
Boolean |
|
yes |
Flag to decide whether component identifiers are mapped into a consecutive id space (requires additional memory). |
Name | Type | Description |
---|---|---|
componentCount |
Integer |
The number of computed components. |
nodePropertiesWritten |
Integer |
The number of node properties written. |
preProcessingMillis |
Integer |
Milliseconds for preprocessing the data. |
computeMillis |
Integer |
Milliseconds for running the algorithm. |
mutateMillis |
Integer |
Milliseconds for adding properties to the projected graph. |
postProcessingMillis |
Integer |
Milliseconds for computing component count and distribution statistics. |
componentDistribution |
Map |
Map containing min, max, mean as well as p1, p5, p10, p25, p50, p75, p90, p95, p99 and p999 percentile values of component sizes. |
configuration |
Map |
The configuration used for running the algorithm. |
CALL gds.wcc.write(
graphName: String,
configuration: Map
)
YIELD
componentCount: Integer,
nodePropertiesWritten: Integer,
preProcessingMillis: Integer,
computeMillis: Integer,
writeMillis: Integer,
postProcessingMillis: Integer,
componentDistribution: Map,
configuration: Map
Name | Type | Default | Optional | Description |
---|---|---|---|---|
graphName |
String |
|
no |
The name of a graph stored in the catalog. |
configuration |
Map |
|
yes |
Configuration for algorithm-specifics and/or graph filtering. |
Name | Type | Default | Optional | Description |
---|---|---|---|---|
List of String |
|
yes |
Filter the named graph using the given node labels. Nodes with any of the given labels will be included. |
|
List of String |
|
yes |
Filter the named graph using the given relationship types. Relationships with any of the given types will be included. |
|
Integer |
|
yes |
The number of concurrent threads used for running the algorithm. |
|
String |
|
yes |
An ID that can be provided to more easily track the algorithm’s progress. |
|
Boolean |
|
yes |
If disabled the progress percentage will not be logged. |
|
Integer |
|
yes |
The number of concurrent threads used for writing the result to Neo4j. |
|
String |
|
no |
The node property in the Neo4j database to which the component ID is written. |
|
String |
|
yes |
Name of the relationship property to use as weights. If unspecified, the algorithm runs unweighted. |
|
String |
|
yes |
Used to set the initial component for a node. The property value needs to be a number. |
|
threshold |
Float |
|
yes |
The value of the weight above which the relationship is considered in the computation. |
consecutiveIds |
Boolean |
|
yes |
Flag to decide whether component identifiers are mapped into a consecutive id space (requires additional memory). |
minComponentSize |
Integer |
|
yes |
Only nodes inside communities larger or equal the given value will be written to the underlying Neo4j database. |
Name | Type | Description |
---|---|---|
componentCount |
Integer |
The number of computed components. |
nodePropertiesWritten |
Integer |
The number of node properties written. |
preProcessingMillis |
Integer |
Milliseconds for preprocessing the data. |
computeMillis |
Integer |
Milliseconds for running the algorithm. |
writeMillis |
Integer |
Milliseconds for writing result back to Neo4j. |
postProcessingMillis |
Integer |
Milliseconds for computing component count and distribution statistics. |
componentDistribution |
Map |
Map containing min, max, mean as well as p1, p5, p10, p25, p50, p75, p90, p95, p99 and p999 percentile values of component sizes. |
configuration |
Map |
The configuration used for running the algorithm. |
Examples
All the examples below should be run in an empty database. The examples use Cypher projections as the norm. Native projections will be deprecated in a future release. |
In this section we will show examples of running the Weakly Connected Components algorithm on a concrete graph. The intention is to illustrate what the results look like and to provide a guide in how to make use of the algorithm in a real setting. We will do this on a small user network graph of a handful nodes connected in a particular pattern. The example graph looks like this:
CREATE
(nAlice:User {name: 'Alice'}),
(nBridget:User {name: 'Bridget'}),
(nCharles:User {name: 'Charles'}),
(nDoug:User {name: 'Doug'}),
(nMark:User {name: 'Mark'}),
(nMichael:User {name: 'Michael'}),
(nAlice)-[:LINK {weight: 0.5}]->(nBridget),
(nAlice)-[:LINK {weight: 4}]->(nCharles),
(nMark)-[:LINK {weight: 1.1}]->(nDoug),
(nMark)-[:LINK {weight: 2}]->(nMichael);
This graph has two connected components, each with three nodes.
The relationships that connect the nodes in each component have a property weight
which determines the strength of the relationship.
MATCH (source:User)-[r:LINK]->(target:User)
RETURN gds.graph.project(
'myGraph',
source,
target,
{ relationshipProperties: r { .weight } }
)
In the following examples we will demonstrate using the Weakly Connected Components algorithm on this graph.
Memory Estimation
First off, we will estimate the cost of running the algorithm using the estimate
procedure.
This can be done with any execution mode.
We will use the write
mode in this example.
Estimating the algorithm is useful to understand the memory impact that running the algorithm on your graph will have.
When you later actually run the algorithm in one of the execution modes the system will perform an estimation.
If the estimation shows that there is a very high probability of the execution going over its memory limitations, the execution is prohibited.
To read more about this, see Automatic estimation and execution blocking.
For more details on estimate
in general, see Memory Estimation.
CALL gds.wcc.write.estimate('myGraph', { writeProperty: 'component' })
YIELD nodeCount, relationshipCount, bytesMin, bytesMax, requiredMemory
nodeCount | relationshipCount | bytesMin | bytesMax | requiredMemory |
---|---|---|---|---|
6 |
4 |
112 |
112 |
"112 Bytes" |
Stream
In the stream
execution mode, the algorithm returns the component ID for each node.
This allows us to inspect the results directly or post-process them in Cypher without any side effects.
For example, we can order the results to see the nodes that belong to the same component displayed next to each other.
For more details on the stream
mode in general, see Stream.
CALL gds.wcc.stream('myGraph')
YIELD nodeId, componentId
RETURN gds.util.asNode(nodeId).name AS name, componentId
ORDER BY componentId, name
name | componentId |
---|---|
"Alice" |
0 |
"Bridget" |
0 |
"Charles" |
0 |
"Doug" |
3 |
"Mark" |
3 |
"Michael" |
3 |
The result shows that the algorithm identifies two components. This can be verified in the example graph.
The default behaviour of the algorithm is to run unweighted
, e.g. without using relationship
weights.
The weighted
option will be demonstrated in Weighted
The actual component ids may differ because the order of nodes projected in the in-memory graph is not guaranteed.
For this case it is equally plausible to get the inverse solution, f.i. when our community |
Stats
In the stats
execution mode, the algorithm returns a single row containing a summary of the algorithm result.
This execution mode does not have any side effects.
It can be useful for evaluating algorithm performance by inspecting the computeMillis
return item.
In the examples below we will omit returning the timings.
The full signature of the procedure can be found in the syntax section.
For more details on the stats
mode in general, see Stats.
stats
mode:CALL gds.wcc.stats('myGraph')
YIELD componentCount
componentCount |
---|
2 |
The result shows that myGraph
has two components and this can be verified by looking at the example graph.
Mutate
The mutate
execution mode extends the stats
mode with an important side effect: updating the named graph with a new node property containing the component ID for that node.
The name of the new property is specified using the mandatory configuration parameter mutateProperty
.
The result is a single summary row, similar to stats
, but with some additional metrics.
The mutate
mode is especially useful when multiple algorithms are used in conjunction.
For more details on the mutate
mode in general, see Mutate.
mutate
mode:CALL gds.wcc.mutate('myGraph', { mutateProperty: 'componentId' })
YIELD nodePropertiesWritten, componentCount;
nodePropertiesWritten | componentCount |
---|---|
6 |
2 |
Write
The write
execution mode extends the stats
mode with an important side effect: writing the component ID for each node as a property to the Neo4j database.
The name of the new property is specified using the mandatory configuration parameter writeProperty
.
The result is a single summary row, similar to stats
, but with some additional metrics.
The write
mode enables directly persisting the results to the database.
For more details on the write
mode in general, see Write.
write
mode:CALL gds.wcc.write('myGraph', { writeProperty: 'componentId' })
YIELD nodePropertiesWritten, componentCount;
nodePropertiesWritten | componentCount |
---|---|
6 |
2 |
As we can see from the results, the nodes connected to one another are calculated by the algorithm as belonging to the same connected component.
Weighted
By configuring the algorithm to use a weight we can increase granularity in the way the algorithm calculates component assignment.
We do this by specifying the property key with the relationshipWeightProperty
configuration parameter.
Additionally, we can specify a threshold for the weight value.
Then, only weights greater than the threshold value will be considered by the algorithm.
We do this by specifying the threshold value with the threshold
configuration parameter.
If a relationship does not have the specified weight property, the algorithm falls back to using a default value of zero.
CALL gds.wcc.stream('myGraph', {
relationshipWeightProperty: 'weight',
threshold: 1.0
}) YIELD nodeId, componentId
RETURN gds.util.asNode(nodeId).name AS Name, componentId AS ComponentId
ORDER BY ComponentId, Name
Name | ComponentId |
---|---|
|
|
|
|
|
|
|
|
|
|
|
|
As we can see from the results, the node named 'Bridget' is now in its own component, due to its relationship weight being less than the configured threshold and thus ignored.
The actual component ids may differ because the order of nodes projected in the in-memory graph is not guaranteed.
For this case it is equally plausible to get the inverse solution, f.i. when our community |
We are using stream mode to illustrate running the algorithm as weighted or unweighted, all the other algorithm modes also support this configuration parameter. |
Seeded components
It is possible to define preliminary component IDs for nodes using the seedProperty
configuration parameter.
This is helpful if we want to retain components from a previous run and it is known that no components have been split by removing relationships.
The property value needs to be a number.
The algorithm first checks if there is a seeded component ID assigned to the node. If there is one, that component ID is used. Otherwise, a new unique component ID is assigned to the node.
Once every node belongs to a component, the algorithm merges components of connected nodes.
When components are merged, the resulting component is always the one with the lower component ID.
Note that the consecutiveIds
configuration option cannot be used in combination with seeding in order to retain the seeding values.
The algorithm assumes that nodes with the same seed value do in fact belong to the same component. If any two nodes in different components have the same seed, behavior is undefined. It is then recommended running WCC without seeds. |
To demonstrate this in practice, we will go through a few steps:
-
We will run the algorithm and write the results to Neo4j.
-
Then we will add another node to our graph, this node will not have the property computed in Step 1.
-
We will project a new graph that has the result from Step 1 as
nodeProperty
-
And then we will run the algorithm again, this time in
stream
mode, and we will use theseedProperty
configuration parameter.
We will use the weighted variant of WCC.
Step 1
write
mode:CALL gds.wcc.write('myGraph', {
writeProperty: 'componentId',
relationshipWeightProperty: 'weight',
threshold: 1.0
})
YIELD nodePropertiesWritten, componentCount;
nodePropertiesWritten | componentCount |
---|---|
6 |
3 |
Step 2
After the algorithm has finished writing to Neo4j we want to create a new node in the database.
MATCH (b:User {name: 'Bridget'})
CREATE (b)-[:LINK {weight: 2.0}]->(new:User {name: 'Mats'})
Step 3
Note, that we cannot use our already projected graph as it does not contain the component id. We will therefore project a second graph that contains the previously computed component id.
MATCH (source:User)-[r:LINK]->(target:User)
RETURN gds.graph.project(
'myGraph-seeded',
source,
target,
{
sourceNodeProperties: source { .componentId },
targetNodeProperties: target { .componentId },
relationshipProperties: r { .weight }
}
)
Step 4
stream
mode using seedProperty
:CALL gds.wcc.stream('myGraph-seeded', {
seedProperty: 'componentId',
relationshipWeightProperty: 'weight',
threshold: 1.0
}) YIELD nodeId, componentId
RETURN gds.util.asNode(nodeId).name AS name, componentId
ORDER BY componentId, name
name | componentId |
---|---|
"Alice" |
0 |
"Charles" |
0 |
"Bridget" |
1 |
"Mats" |
1 |
"Doug" |
3 |
"Mark" |
3 |
"Michael" |
3 |
The result shows that despite not having the seedProperty
when it was projected, the node 'Mats' has been assigned to the same component as the node 'Bridget'.
This is correct because these two nodes are connected.
The actual component ids may differ because the order of nodes projected in the in-memory graph is not guaranteed.
For this case it is equally plausible to get the inverse solution, f.i. when our community |
Writing Seeded components
In the previous section we demonstrated the seedProperty
usage in stream
mode.
It is also available in the other modes of the algorithm.
Below is an example on how to use seedProperty
in write
mode.
Note that the example below relies on Steps 1 - 3 from the previous section.
write
mode using seedProperty
:CALL gds.wcc.write('myGraph-seeded', {
seedProperty: 'componentId',
writeProperty: 'componentId',
relationshipWeightProperty: 'weight',
threshold: 1.0
})
YIELD nodePropertiesWritten, componentCount;
nodePropertiesWritten | componentCount |
---|---|
1 |
3 |
If the |
Graph Sampling optimization
The WCC implementation provides two compute strategies:
-
The unsampled strategy as described in Wait-free Parallel Algorithms for the Union-Find Problem.
-
The sampled strategy as described in Optimizing Parallel Graph Connectivity Computation via Subgraph Sampling
While both strategies provide very good performance, the sampled strategy is usually the faster one. The decision, which strategy to use, depends on the input graph. If the relationships of the graph are …
-
… undirected, the algorithm picks the sampled strategy.
-
… directed, the algorithm picks the unsampled strategy.
-
… directed and inverse indexed, the algorithm picks the sampled strategy.
The direction of a relationship is defined by the orientation
which can be set during a graph projection.
While NATURAL
and REVERSE
orientation result in a directed graph, the UNDIRECTED
orientation leads to undirected relationships.
In order to create a directed graph with inverse indexed relationships, one can use the indexInverse
parameter as part of the relationship projection.
An inverse index allows the algorithm to traverse the relationships of a node according to the opposite orientation.
If the graph is projected using a NATURAL
orientation, the inverse index represents the REVERSE
orientation and vice versa.
myIndexedGraph
.MATCH (source:User)-[r:LINK]->(target:User)
RETURN gds.graph.project(
'myIndexedGraph',
source,
target,
{},
{ inverseIndexedRelationshipTypes: ['*'] }
)
The following query is identical to the stream example in the previous section.
This time, we execute WCC on myIndexedGraph
which will allow the algorithm to use the sampled strategy.
CALL gds.wcc.stream('myIndexedGraph')
YIELD nodeId, componentId
RETURN gds.util.asNode(nodeId).name AS name, componentId
ORDER BY componentId, name
name | componentId |
---|---|
"Alice" |
0 |
"Bridget" |
0 |
"Charles" |
0 |
"Doug" |
3 |
"Mark" |
3 |
"Michael" |
3 |
The actual component ids may differ due to the randomness in the Graph sampling optimization.
For this case it is equally plausible to get the inverse solution, f.i. when our community |