5.3.3. Weakly Connected Components

This section describes the Weakly Connected Components (WCC) algorithm in the Neo4j Graph Data Science library.

This topic includes:

5.3.3.1. Introduction

The WCC algorithm finds sets of connected nodes in an undirected graph, where all nodes in the same set form a connected component. WCC is often used early in an analysis to understand the structure of a graph. Using WCC to understand the graph structure enables running other algorithms independently on an identified cluster. As a preprocessing step for directed graphs, it helps quickly identify disconnected groups.

For more information on this algorithm, see:

Running this algorithm requires sufficient memory availability. Before running this algorithm, we recommend that you read Section 3.1, “Memory Estimation”.

5.3.3.2. Syntax

This section covers the syntax used to execute the Weakly Connected Components algorithm in each of its execution modes. We are describing the named graph variant of the syntax. To learn more about general syntax variants, see Section 5.1, “Syntax overview”.

Example 5.5. WCC syntax per mode

Run WCC in stream mode on a named graph. 

CALL gds.wcc.stream(
  graphName: String,
  configuration: Map
)
YIELD
  nodeId: Integer,
  componentId: Integer

Table 5.137. Parameters
Name Type Default Optional Description

graphName

String

n/a

no

The name of a graph stored in the catalog.

configuration

Map

{}

yes

Configuration for algorithm-specifics and/or graph filtering.

Table 5.138. General configuration for algorithm execution on a named graph.
Name Type Default Optional Description

nodeLabels

String[]

['*']

yes

Filter the named graph using the given node labels.

relationshipTypes

String[]

['*']

yes

Filter the named graph using the given relationship types.

concurrency

Integer

4

yes

The number of concurrent threads used for running the algorithm.

Table 5.139. Algorithm specific configuration
Name Type Default Optional Description

relationshipWeightProperty

String

null

yes

The relationship property that contains the weight. If null, the graph is treated as unweighted. Must be numeric.

defaultValue

Float

null

yes

The default value of the relationship weight in case it is missing or invalid.

seedProperty

String

n/a

yes

Used to set the initial component for a node. The property value needs to be a number.

threshold

Float

null

yes

The value of the weight above which the relationship is considered in the computation.

consecutiveIds

Boolean

false

yes

Flag to decide whether component identifiers are mapped into a consecutive id space (requires additional memory).

Table 5.140. Results
Name Type Description

nodeId

Integer

The Neo4j node ID.

componentId

Integer

The component ID.

Run WCC in stats mode on a named graph. 

CALL gds.wcc.stats(
  graphName: String,
  configuration: Map
)
YIELD
  componentCount: Integer,
  createMillis: Integer,
  computeMillis: Integer,
  postProcessingMillis: Integer,
  componentDistribution: Map,
  configuration: Map

Table 5.141. Parameters
Name Type Default Optional Description

graphName

String

n/a

no

The name of a graph stored in the catalog.

configuration

Map

{}

yes

Configuration for algorithm-specifics and/or graph filtering.

Table 5.142. General configuration for algorithm execution on a named graph.
Name Type Default Optional Description

nodeLabels

String[]

['*']

yes

Filter the named graph using the given node labels.

relationshipTypes

String[]

['*']

yes

Filter the named graph using the given relationship types.

concurrency

Integer

4

yes

The number of concurrent threads used for running the algorithm.

Table 5.143. Algorithm specific configuration
Name Type Default Optional Description

relationshipWeightProperty

String

null

yes

The relationship property that contains the weight. If null, the graph is treated as unweighted. Must be numeric.

defaultValue

Float

null

yes

The default value of the relationship weight in case it is missing or invalid.

seedProperty

String

n/a

yes

Used to set the initial component for a node. The property value needs to be a number.

threshold

Float

null

yes

The value of the weight above which the relationship is considered in the computation.

consecutiveIds

Boolean

false

yes

Flag to decide whether component identifiers are mapped into a consecutive id space (requires additional memory).

Table 5.144. Results
Name Type Description

componentCount

Integer

The number of computed components.

createMillis

Integer

Milliseconds for loading data.

computeMillis

Integer

Milliseconds for running the algorithm.

postProcessingMillis

Integer

Milliseconds for computing component count and distribution statistics.

componentDistribution

Map

Map containing min, max, mean as well as p50, p75, p90, p95, p99 and p999 percentile values of component sizes.

configuration

Map

The configuration used for running the algorithm.

Run WCC in mutate mode on a named graph. 

CALL gds.wcc.mutate(
  graphName: String,
  configuration: Map
)
YIELD
  componentCount: Integer,
  nodePropertiesWritten: Integer,
  relationshipPropertiesWritten: Integer,
  createMillis: Integer,
  computeMillis: Integer,
  mutateMillis: Integer,
  postProcessingMillis: Integer,
  componentDistribution: Map,
  configuration: Map

Table 5.145. Parameters
Name Type Default Optional Description

graphName

String

n/a

no

The name of a graph stored in the catalog.

configuration

Map

{}

yes

Configuration for algorithm-specifics and/or graph filtering.

Table 5.146. General configuration for algorithm execution on a named graph.
Name Type Default Optional Description

nodeLabels

String[]

['*']

yes

Filter the named graph using the given node labels.

relationshipTypes

String[]

['*']

yes

Filter the named graph using the given relationship types.

concurrency

Integer

4

yes

The number of concurrent threads used for running the algorithm.

mutateProperty

String

n/a

no

The node property in the GDS graph to which the component ID is written.

Table 5.147. Algorithm specific configuration
Name Type Default Optional Description

relationshipWeightProperty

String

null

yes

The relationship property that contains the weight. If null, the graph is treated as unweighted. Must be numeric.

defaultValue

Float

null

yes

The default value of the relationship weight in case it is missing or invalid.

seedProperty

String

n/a

yes

Used to set the initial component for a node. The property value needs to be a number.

threshold

Float

null

yes

The value of the weight above which the relationship is considered in the computation.

consecutiveIds

Boolean

false

yes

Flag to decide whether component identifiers are mapped into a consecutive id space (requires additional memory).

Table 5.148. Results
Name Type Description

componentCount

Integer

The number of computed components.

nodePropertiesWritten

Integer

The number of node properties written.

relationshipPropertiesWritten

Integer

The number of relationship properties written.

createMillis

Integer

Milliseconds for loading data.

computeMillis

Integer

Milliseconds for running the algorithm.

mutateMillis

Integer

Milliseconds for adding properties to the in-memory graph.

postProcessingMillis

Integer

Milliseconds for computing component count and distribution statistics.

componentDistribution

Map

Map containing min, max, mean as well as p50, p75, p90, p95, p99 and p999 percentile values of component sizes.

configuration

Map

The configuration used for running the algorithm.

Run WCC in write mode on a named graph. 

CALL gds.wcc.write(
  graphName: String,
  configuration: Map
)
YIELD
  componentCount: Integer,
  nodePropertiesWritten: Integer,
  relationshipPropertiesWritten: Integer,
  createMillis: Integer,
  computeMillis: Integer,
  writeMillis: Integer,
  postProcessingMillis: Integer,
  componentDistribution: Map,
  configuration: Map

Table 5.149. Parameters
Name Type Default Optional Description

graphName

String

n/a

no

The name of a graph stored in the catalog.

configuration

Map

{}

yes

Configuration for algorithm-specifics and/or graph filtering.

Table 5.150. General configuration for algorithm execution on a named graph.
Name Type Default Optional Description

nodeLabels

String[]

['*']

yes

Filter the named graph using the given node labels.

relationshipTypes

String[]

['*']

yes

Filter the named graph using the given relationship types.

concurrency

Integer

4

yes

The number of concurrent threads used for running the algorithm. Also provides the default value for 'writeConcurrency'.

writeConcurrency

Integer

value of 'concurrency'

yes

The number of concurrent threads used for writing the result to Neo4j.

writeProperty

String

n/a

no

The node property in the Neo4j database to which the component ID is written.

Table 5.151. Algorithm specific configuration
Name Type Default Optional Description

relationshipWeightProperty

String

null

yes

The relationship property that contains the weight. If null, the graph is treated as unweighted. Must be numeric.

defaultValue

Float

null

yes

The default value of the relationship weight in case it is missing or invalid.

seedProperty

String

n/a

yes

Used to set the initial component for a node. The property value needs to be a number.

threshold

Float

null

yes

The value of the weight above which the relationship is considered in the computation.

consecutiveIds

Boolean

false

yes

Flag to decide whether component identifiers are mapped into a consecutive id space (requires additional memory).

Table 5.152. Results
Name Type Description

componentCount

Integer

The number of computed components.

nodePropertiesWritten

Integer

The number of node properties written.

relationshipPropertiesWritten

Integer

The number of relationship properties written.

createMillis

Integer

Milliseconds for loading data.

computeMillis

Integer

Milliseconds for running the algorithm.

writeMillis

Integer

Milliseconds for writing result back to Neo4j.

postProcessingMillis

Integer

Milliseconds for computing component count and distribution statistics.

componentDistribution

Map

Map containing min, max, mean as well as p50, p75, p90, p95, p99 and p999 percentile values of component sizes.

configuration

Map

The configuration used for running the algorithm.

Anonymous graphs

It is also possible to execute the algorithm on a graph that is projected in conjunction with the algorithm execution. In this case, the graph does not have a name, and we call it anonymous. When executing over an anonymous graph the configuration map contains a graph projection configuration as well as an algorithm configuration. All execution modes support execution on anonymous graphs, although we only show syntax and mode-specific configuration for the write mode for brevity.

For more information on syntax variants, see Section 5.1, “Syntax overview”.

Run WCC in write mode on an anonymous graph: 

CALL gds.wcc.write(
  configuration: Map
)
YIELD
  componentCount: Integer,
  nodePropertiesWritten: Integer,
  relationshipPropertiesWritten: Integer,
  createMillis: Integer,
  computeMillis: Integer,
  writeMillis: Integer,
  postProcessingMillis: Integer,
  componentDistribution: Map,
  configuration: Map

Table 5.153. General configuration for algorithm execution on an anonymous graph.
Name Type Default Optional Description

nodeProjection

String, String[] or Map

null

yes

The node projection used for anonymous graph creation via a Native projection.

relationshipProjection

String, String[] or Map

null

yes

The relationship projection used for anonymous graph creation a Native projection.

nodeQuery

String

null

yes

The Cypher query used to select the nodes for anonymous graph creation via a Cypher projection.

relationshipQuery

String

null

yes

The Cypher query used to select the relationships for anonymous graph creation via a Cypher projection.

nodeProperties

String, String[] or Map

null

yes

The node properties to project during anonymous graph creation.

relationshipProperties

String, String[] or Map

null

yes

The relationship properties to project during anonymous graph creation.

concurrency

Integer

4

yes

The number of concurrent threads used for running the algorithm. Also provides the default value for 'readConcurrency' and 'writeConcurrency'.

readConcurrency

Integer

value of 'concurrency'

yes

The number of concurrent threads used for creating the graph.

writeConcurrency

Integer

value of 'concurrency'

yes

The number of concurrent threads used for writing the result to Neo4j.

writeProperty

String

n/a

no

The node property in the Neo4j database to which the component ID is written.

Table 5.154. Algorithm specific configuration
Name Type Default Optional Description

relationshipWeightProperty

String

null

yes

The relationship property that contains the weight. If null, the graph is treated as unweighted. Must be numeric.

defaultValue

Float

null

yes

The default value of the relationship weight in case it is missing or invalid.

seedProperty

String

n/a

yes

Used to set the initial component for a node. The property value needs to be a number.

threshold

Float

null

yes

The value of the weight above which the relationship is considered in the computation.

consecutiveIds

Boolean

false

yes

Flag to decide whether component identifiers are mapped into a consecutive id space (requires additional memory).

The results are the same as for running write mode with a named graph, see the write mode syntax above.

5.3.3.3. Examples

In this section we will show examples of executing the Weakly Connected Components algorithm on a concrete graph. The intention is to illustrate what the results look like and to provide a guide in how to make use of the algorithm in a real setting. We will do this on a small social network graph of a handful nodes connected in a particular pattern. The example graph looks like this:

wcc graph

The following Cypher statement will create the example graph in the Neo4j database: 

CREATE
  (nAlice:User {name: 'Alice'}),
  (nBridget:User {name: 'Bridget'}),
  (nCharles:User {name: 'Charles'}),
  (nDoug:User {name: 'Doug'}),
  (nMark:User {name: 'Mark'}),
  (nMichael:User {name: 'Michael'}),

  (nAlice)-[:LINK {weight: 0.5}]->(nBridget),
  (nAlice)-[:LINK {weight: 4}]->(nCharles),
  (nMark)-[:LINK {weight: 1.1}]->(nDoug),
  (nMark)-[:LINK {weight: 2}]->(nMichael);

This graph has two connected components, each with three nodes. The relationships that connect the nodes in each component have a property weight which determines the strength of the relationship.

In the examples below we will use named graphs and native projections as the norm. However, anonymous graphs and/or Cypher projections can also be used.

The following statement will create a graph using a native projection and store it in the graph catalog under the name 'myGraph'. 

CALL gds.graph.create(
  'myGraph',
  'User',
  'LINK',
  {
    relationshipProperties: 'weight'
  }
)

In the following examples we will demonstrate using the Weakly Connected Components algorithm on this graph.

Memory Estimation

First off, we will estimate the cost of running the algorithm using the estimate procedure. This can be done with any execution mode. We will use the write mode in this example. Estimating the algorithm is useful to understand the memory impact that running the algorithm on your graph will have. When you later actually run the algorithm in one of the execution modes the system will perform an estimation. If the estimation shows that there is a very high probability of the execution going over its memory limitations, the execution is prohibited. To read more about this, see Section 3.1.3, “Automatic estimation and execution blocking”.

For more details on estimate in general, see Section 3.1, “Memory Estimation”.

The following will estimate the memory requirements for running the algorithm in write mode: 

CALL gds.wcc.write.estimate('myGraph', { writeProperty: 'component' })
YIELD nodeCount, relationshipCount, bytesMin, bytesMax, requiredMemory

Table 5.155. Results
nodeCount relationshipCount bytesMin bytesMax requiredMemory

6

4

176

176

"176 Bytes"

Stream

In the stream execution mode, the algorithm returns the component ID for each node. This allows us to inspect the results directly or post-process them in Cypher without any side effects. For example, we can order the results to see the nodes that belong to the same component displayed next to each other.

For more details on the stream mode in general, see Section 3.3.1, “Stream”.

The following will run the algorithm and stream results: 

CALL gds.wcc.stream('myGraph')
YIELD nodeId, componentId
RETURN gds.util.asNode(nodeId).name AS name, componentId
ORDER BY componentId, name

Table 5.156. Results
name componentId

"Alice"

0

"Bridget"

0

"Charles"

0

"Doug"

3

"Mark"

3

"Michael"

3

The result shows that the algorithm identifies two components. This can be verified in the example graph.

The default behaviour of the algorithm is to run unweighted, e.g. without using relationship weights. The weighted option will be demonstrated in the section called “Weighted”

Stats

In the stats execution mode, the algorithm returns a single row containing a summary of the algorithm result. In particular, Betweenness Centrality returns the minimum, maximum and sum of all centrality scores. This execution mode does not have any side effects. It can be useful for evaluating algorithm performance by inspecting the computeMillis return item. In the examples below we will omit returning the timings. The full signature of the procedure can be found in the syntax section.

For more details on the stats mode in general, see Section 3.3.2, “Stats”.

The following will run the algorithm in stats mode: 

CALL gds.wcc.stats('myGraph')
YIELD componentCount

Table 5.157. Results
componentCount

2

The result shows that myGraph has two components and this can be verified by looking at the example graph.

Mutate

The mutate execution mode extends the stats mode with an important side effect: updating the named graph with a new node property containing the component ID for that node. The name of the new property is specified using the mandatory configuration parameter mutateProperty. The result is a single summary row, similar to stats, but with some additional metrics. The mutate mode is especially useful when multiple algorithms are used in conjunction.

For more details on the mutate mode in general, see Section 3.3.3, “Mutate”.

The following will run the algorithm in mutate mode: 

CALL gds.wcc.mutate('myGraph', { mutateProperty: 'componentId' })
YIELD nodePropertiesWritten, componentCount;

Table 5.158. Results
nodePropertiesWritten componentCount

6

2

Write

The write execution mode extends the stats mode with an important side effect: writing the component ID for each node as a property to the Neo4j database. The name of the new property is specified using the mandatory configuration parameter writeProperty. The result is a single summary row, similar to stats, but with some additional metrics. The write mode enables directly persisting the results to the database.

For more details on the write mode in general, see Section 3.3.4, “Write”.

The following will run the algorithm in write mode: 

CALL gds.wcc.write('myGraph', { writeProperty: 'componentId' })
YIELD nodePropertiesWritten, componentCount;

Table 5.159. Results
nodePropertiesWritten componentCount

6

2

As we can see from the results, the nodes connected to one another are calculated by the algorithm as belonging to the same connected component.

Weighted

By configuring the algorithm to use a weight we can increase granularity in the way the algorithm calculates component assignment. We do this by specifying the property key with the relationshipWeightProperty configuration parameter. Additionally, we can specify a threshold for the weight value. Then, only weights greater than the threshold value will be considered by the algorithm. We do this by specifying the threshold value with the threshold configuration parameter.

If a relationship does not have the specified weight property, the algorithm falls back to using a default value. The default fallback value is zero, but can be configured to using the defaultValue configuration parameter.

The following will run the algorithm and stream results: 

CALL gds.wcc.stream('myGraph', {
  relationshipWeightProperty: 'weight',
  threshold: 1.0
}) YIELD nodeId, componentId
RETURN gds.util.asNode(nodeId).name AS Name, componentId AS ComponentId
ORDER BY ComponentId, Name

Table 5.160. Results
Name ComponentId

"Alice"

0

"Charles"

0

"Bridget"

1

"Doug"

3

"Mark"

3

"Michael"

3

As we can see from the results, the node named 'Bridget' is now in its own component, due to its relationship weight being less than the configured threshold and thus ignored.

We are using stream mode to illustrate running the algorithm as weighted or unweighted, all the other algorithm modes also support this configuration parameter.

Seeded components

It is possible to define preliminary component IDs for nodes using the seedProperty configuration parameter. This is helpful if we want to retain components from a previous run and it is known that no components have been split by removing relationships. The property value needs to be a number.

The algorithm first checks if there is a seeded component ID assigned to the node. If there is one, that component ID is used. Otherwise, a new unique component ID is assigned to the node.

Once every node belongs to a component, the algorithm merges components of connected nodes. When components are merged, the resulting component is always the one with the lower component ID. Note that the consecutiveIds configuration option cannot be used in combination with seeding in order to retain the seeding values.

The algorithm assumes that nodes with the same seed value do in fact belong to the same component. If any two nodes in different components have the same seed, behavior is undefined. It is then recommended running WCC without seeds.

To demonstrate this in practice, we will go through a few steps:

  1. We will run the algorithm and write the results to Neo4j.
  2. Then we will add another node to our graph, this node will not have the property computed in Step 1.
  3. We will create a new in-memory graph that has the result from Step 1 as nodeProperty
  4. And then we will run the algorithm again, this time in stream mode, and we will use the seedProperty configuration parameter.

We will use the weighted variant of WCC.

Step 1

The following will run the algorithm in write mode: 

CALL gds.wcc.write('myGraph', {
  writeProperty: 'componentId',
  relationshipWeightProperty: 'weight',
  threshold: 1.0
})
YIELD nodePropertiesWritten, componentCount;

Table 5.161. Results
nodePropertiesWritten componentCount

6

3

Step 2

After the algorithm has finished writing to Neo4j we want to create a new node in the database.

The following will create a new node in the Neo4j graph, with no component ID: 

MATCH (b:User {name: 'Bridget'})
CREATE (b)-[:LINK {weight: 2.0}]->(new:User {name: 'Mats'})

Step 3

Note, that we cannot use our already created graph as it does not contain the component id. We will therefore create a second in-memory graph that contains the previously computed component id.

The following will create a new graph containing the previously computed component id: 

CALL gds.graph.create(
  'myGraph-seeded',
  'User',
  'LINK',
  {
    nodeProperties: 'componentId',
    relationshipProperties: 'weight'
  }
)

Step 4

The following will run the algorithm in stream mode using seedProperty

CALL gds.wcc.stream('myGraph-seeded', {
  seedProperty: 'componentId',
  relationshipWeightProperty: 'weight',
  threshold: 1.0
}) YIELD nodeId, componentId
RETURN gds.util.asNode(nodeId).name AS name, componentId
ORDER BY componentId, name

Table 5.162. Results
name componentId

"Alice"

0

"Charles"

0

"Bridget"

1

"Mats"

1

"Doug"

3

"Mark"

3

"Michael"

3

The result shows that despite not having the seedProperty when it was created, the node 'Mats' has been assigned to the same component as the node 'Bridget'. This is correct because these two nodes are connected.

Writing Seeded components

In the previous section we demonstrated the seedProperty usage in stream mode. It is also available in the other modes of the algorithm. Below is an example on how to use seedProperty in write mode. Note that the example below relies on Steps 1 - 3 from the previous section.

The following will run the algorithm in write mode using seedProperty

CALL gds.wcc.write('myGraph-seeded', {
  seedProperty: 'componentId',
  writeProperty: 'componentId',
  relationshipWeightProperty: 'weight',
  threshold: 1.0
})
YIELD nodePropertiesWritten, componentCount;

Table 5.163. Results
nodePropertiesWritten componentCount

1

3

If the seedProperty configuration parameter has the same value as writeProperty, the algorithm only writes properties for nodes where the component ID has changed. If they differ, the algorithm writes properties for all nodes.