6.3. The Connected Components algorithm

This section describes the Connected Components algorithm in the Neo4j Graph Algorithms library.

The Connected Components, or Union Find, algorithm finds sets of connected nodes in an undirected graph where each node is reachable from any other node in the same set. It differs from the Strongly Connected Components algorithm (SCC) because it only needs a path to exist between pairs of nodes in one direction, whereas SCC needs a path to exist in both directions. As with SCC, UnionFind is often used early in an analysis to understand a graph’s structure.

This section includes:

6.3.1. History and explanation

The algorithm was first described by Bernard A. Galler and Michael J. Fischer in 1964. The components in a graph are computed using either the breadth-first search or depth-first search algorithms.

6.3.2. Use-cases - when to use the Connected Components algorithm

  • Testing whether a graph is connected is an essential pre-processing step for every graph algorithm. Such tests can be performed so quickly, and easily, that you should always verify that your input graph is connected, even when you know it has to be. Subtle, difficult-to-detect, bugs often result when your algorithm is run only on one component of a disconnected graph.
  • Union Find can be used to keep track of clusters of database records, as part of the de-duplication process - an important task in master data management applications. Read more in "An efficient domain-independent algorithm for detecting approximately duplicate database records".
  • Weakly connected components (WCC) can be used to analyse citation networks. One study uses WCC to work out how well connected the network is, and then to see whether the connectivity remains if 'hub' or 'authority' nodes are moved from the graph. Read more in "Characterizing and Mining Citation Graph of Computer Science Literature".

6.3.3. Connected Components algorithm sample

If we recall that an undirected graph is connected if, for every pair of vertices there is a path in the graph between those vertices. A connected component of an undirected graph is a maximal connected subgraph of the graph. That means that the direction of the relationships in our graph are ignored - we treat the graph as undirected.

We have two implementations of the Connected Components algorithm. The first treats the graph as unweighted and the second treats it as weighted, where you can define the threshold of the weight above which relationships are included.

connected components

The following will create a sample graph: 

MERGE (nAlice:User {id:'Alice'})
MERGE (nBridget:User {id:'Bridget'})
MERGE (nCharles:User {id:'Charles'})
MERGE (nDoug:User {id:'Doug'})
MERGE (nMark:User {id:'Mark'})
MERGE (nMichael:User {id:'Michael'})

MERGE (nAlice)-[:FRIEND {weight:0.5}]->(nBridget)
MERGE (nAlice)-[:FRIEND {weight:4}]->(nCharles)
MERGE (nMark)-[:FRIEND {weight:1}]->(nDoug)
MERGE (nMark)-[:FRIEND {weight:2}]->(nMichael);

6.3.3.1. Unweighted version

The following will run the algorithm and stream results: 

CALL algo.unionFind.stream('User', 'FRIEND', {})
YIELD nodeId,setId

RETURN algo.asNode(nodeId).id AS user, setId

The following will run the algorithm and write back results: 

CALL algo.unionFind('User', 'FRIEND', {write:true, partitionProperty:"partition"})
YIELD nodes, setCount, loadMillis, computeMillis, writeMillis;

Table 6.13. Results
Name Partition

Alice

0

Charles

0

Bridget

0

Michael

4

Doug

4

Mark

4

We have two distinct group of users, that have no link between them.

The first group contains Alice, Charles, and Bridget, while the second group contains Michael, Doug, and Mark.

The following will check the number and size of partitions, using Cypher: 

MATCH (u:User)
RETURN u.partition as partition,count(*) as size_of_partition
ORDER by size_of_partition DESC
LIMIT 20;

6.3.3.2. Weighted version

If you define the property that holds the weight (weightProperty) and the threshold, it means the nodes are only connected, if the threshold on the weight of the relationship is high enough, otherwise the relationship is thrown away.

The following will run the algorithm and stream results: 

CALL algo.unionFind.stream('User', 'FRIEND', {weightProperty:'weight', defaultValue:0.0, threshold:1.0, concurrency: 1})
YIELD nodeId,setId

RETURN algo.asNode(nodeId).id AS user, setId

The following will run the algorithm and write back results: 

CALL algo.unionFind('User', 'FRIEND', {write:true, partitionProperty:"partition",weightProperty:'weight', defaultValue:0.0, threshold:1.0, concurrency: 1})
YIELD nodes, setCount, loadMillis, computeMillis, writeMillis;

Table 6.14. Results
Name Partition

Alice

0

Charles

0

Bridget

1

Michael

4

Doug

4

Mark

4

In this case we can see that, because the weight of the relationship between Bridget and Alice is only 0.5, the relationship is ignored by the algorithm, and Bridget ends up in her own component.

6.3.4. Example usage

As mentioned above, connected components are an essential step in preprocessing your data. One reason is that most centralities suffer from disconnected components, or you just want to find disconnected groups of nodes. Int his example, Yelp’s social network will be used to demonstrate how to proceed when dealing with real world data. A typical social network consists of one big component and a number of small disconnected components.

The following will get the count of connected components: 

CALL algo.unionFind.stream('User', 'FRIEND', {})
YIELD nodeId,setId
RETURN count(distinct setId) as count_of_components;

We get back the count of disconnected components being 18512 if we do not count users without friends. Let’s now check the size of top 20 components to get a better picture:

The following will get the size of top 20 components: 

CALL algo.unionFind.stream('User', 'FRIEND', {})
YIELD nodeId,setId
RETURN setId,count(*) as size_of_component
ORDER BY size_of_component
LIMIT 20;

The biggest component has 8938630 out of total 8981389 (99,5%). It is quite high, but not shocking, as we have a friendship social network where we can expect small world effect and 6 degree of separation rule, where you can get to any person in a social network, just depends how long is the path.

We can now move on to next step of analysis and run centralities on only the biggest components, so that our results will be more accurate. We will write back the results to the node, and use centralities with Cypher loading, or set a new label for the biggest component.

6.3.5. Huge graph projection

The default label and relationship-type projection has a limitation of 2 billion nodes and 2 billion relationships. Therefore, if our projected graph contains more than 2 billion nodes or relationships, we will need to use huge graph projection.

Set graph:'huge' in the config: 

CALL algo.unionFind('User', 'FRIEND', {graph:'huge'})
YIELD nodes, setCount, loadMillis, computeMillis, writeMillis;

6.3.6. Cypher projection

If label and relationship-type are not selective enough to describe your subgraph to run the algorithm on, you can use Cypher statements to load or project subsets of your graph. This can also be used to run algorithms on a virtual graph. You can learn more in the Section 2.2, “Cypher projection” section of the manual.

Set graph:'cypher' in the config: 

CALL algo.unionFind(
  'MATCH (p:User) RETURN id(p) as id',
  'MATCH (p1:User)-[f:FRIEND]->(p2:User)
   RETURN id(p1) as source, id(p2) as target, f.weight as weight',
  {graph:'cypher',write:true}
);

6.3.7. Syntax

The following will run the algorithm and write back results: 

CALL algo.unionFind(label:String, relationship:String, {threshold:0.42,
    defaultValue:1.0, write: true, writeProperty:'partition', weightProperty:'weight', graph:'heavy', concurrency:4})
YIELD nodes, setCount, loadMillis, computeMillis, writeMillis

Table 6.15. Parameters
Name Type Default Optional Description

label

string

null

yes

The label to load from the graph. If null, load all nodes

relationship

string

null

yes

The relationship-type to load from the graph. If null, load all relationships

weightProperty

string

null

yes

The property name that contains weight. If null, treats the graph as unweighted. Must be numeric.

write

boolean

true

yes

Specifies if the result should be written back as a node property

writeProperty

string

'partition'

yes

The property name written back the ID of the partition particular node belongs to

threshold

float

null

yes

The value of the weight above which the relationship is not thrown away

defaultValue

float

null

yes

The default value of the weight in case it is missing or invalid

concurrency

int

available CPUs

yes

The number of concurrent threads

graph

string

'heavy'

yes

Use 'heavy' when describing the subset of the graph with label and relationship-type parameter. Use 'cypher' for describing the subset with cypher node-statement and relationship-statement

Table 6.16. Results
Name Type Description

loadMillis

int

Milliseconds for loading data

computeMillis

int

Milliseconds for running the algorithm

writeMillis

int

Milliseconds for writing result data back

postProcessingMillis

int

Milliseconds for computing percentiles and community count

nodes

int

The number of nodes considered

communityCount

int

The number of communities found

p1

double

The 1 percentile of community size.

p5

double

The 5 percentile of community size.

p10

double

The 10 percentile of community size.

p25

double

The 25 percentile of community size.

p50

double

The 50 percentile of community size.

p75

double

The 75 percentile of community size.

p90

double

The 90 percentile of community size.

p95

double

The 95 percentile of community size.

p99

double

The 99 percentile of community size.

p100

double

The 100 percentile of community size.

write

boolean

Specifies if the result was written back as a node property

writeProperty

string

The property name written back to

The following will run the algorithm and stream results: 

CALL algo.unionFind.stream(label:String, relationship:String,
    {weightProperty:'weight', threshold:0.42, defaultValue:1.0, concurrency:4})
YIELD nodeId, setId - yields a setId to each node id

Table 6.17. Parameters
Name Type Default Optional Description

label

string

null

yes

The label to load from the graph. If null, load all nodes

relationship

string

null

yes

The relationship-type to load from the graph. If null, load all relationships

concurrency

int

available CPUs

yes

The number of concurrent threads

weightProperty

string

null

yes

The property name that contains weight. If null, treats the graph as unweighted. Must be numeric.

threshold

float

null

yes

The value of the weight above which the relationship is not thrown away

defaultValue

float

null

yes

The default value of the weight in case it is missing or invalid

graph

string

'heavy'

yes

Use 'heavy' when describing the subset of the graph with label and relationship-type parameter. Use 'cypher' for describing the subset with cypher node-statement and relationship-statement

Table 6.18. Results
Name Type Description

nodeId

int

Node ID

setId

int

Partition ID

6.3.8. Implementations

algo.unionFind

  • If a threshold configuration parameter is supplied, only relationships with a property value higher than the threshold are merged.

    algo.unionFind.queue

  • Parallel Union Find, using ExecutorService only.
  • Algorithm based on the idea that DisjointSetStruct can be built using just a partition of the nodes, which are then merged pairwise.
  • The implementation is based on a queue which acts as a buffer for each computed DisjointSetStruct. As long as there are more elements on the queue, the algorithm takes two, merges them, and adds its result to the queue until only 1 element remains.

algo.unionFind.forkJoinMerge

  • Like in algo.unionFind.queue, the resulting DisjointSetStruct of each node-partition is merged by the ForkJoin pool, while the calculation of the DisjointSetStruct is done by the ExecutorService.

algo.unionFind.forkJoin

  • Calculation and merge using forkJoinPool

algo.unionFind.mscoloring

  • Coloring based parallel algorithm