This section describes the Connected Components algorithm in the Neo4j Graph Algorithms library.
The Connected Components, or Union Find
, algorithm finds sets of connected nodes in an undirected graph where each node is reachable from any other node in the same
set.
It differs from the Strongly Connected Components algorithm (SCC) because it only needs a path to exist between pairs of nodes
in one direction, whereas SCC needs a path to exist in both directions.
As with SCC, UnionFind
is often used early in an analysis to understand a graph’s structure.
This section includes:
The algorithm was first described by Bernard A. Galler and Michael J. Fischer in 1964. The components in a graph are computed using either the breadth-first search or depth-first search algorithms.
Union Find
can be used to keep track of clusters of database records, as part of the de-duplication process - an important task in master
data management applications.
Read more in "An efficient domain-independent algorithm for detecting approximately duplicate database records".
If we recall that an undirected graph is connected if, for every pair of vertices there is a path in the graph between those vertices. A connected component of an undirected graph is a maximal connected subgraph of the graph. That means that the direction of the relationships in our graph are ignored - we treat the graph as undirected.
We have two implementations of the Connected Components algorithm. The first treats the graph as unweighted and the second treats it as weighted, where you can define the threshold of the weight above which relationships are included.
The following will create a sample graph:
MERGE (nAlice:User {id:'Alice'})
MERGE (nBridget:User {id:'Bridget'})
MERGE (nCharles:User {id:'Charles'})
MERGE (nDoug:User {id:'Doug'})
MERGE (nMark:User {id:'Mark'})
MERGE (nMichael:User {id:'Michael'})
MERGE (nAlice)-[:FRIEND {weight:0.5}]->(nBridget)
MERGE (nAlice)-[:FRIEND {weight:4}]->(nCharles)
MERGE (nMark)-[:FRIEND {weight:1}]->(nDoug)
MERGE (nMark)-[:FRIEND {weight:2}]->(nMichael);
The following will run the algorithm and stream results:
CALL algo.unionFind.stream('User', 'FRIEND', {})
YIELD nodeId,setId
RETURN algo.asNode(nodeId).id AS user, setId
The following will run the algorithm and write back results:
CALL algo.unionFind('User', 'FRIEND', {write:true, partitionProperty:"partition"})
YIELD nodes, setCount, loadMillis, computeMillis, writeMillis;
Name | Partition |
---|---|
Alice |
0 |
Charles |
0 |
Bridget |
0 |
Michael |
4 |
Doug |
4 |
Mark |
4 |
We have two distinct group of users, that have no link between them.
The first group contains Alice, Charles, and Bridget, while the second group contains Michael, Doug, and Mark.
The following will check the number and size of partitions, using Cypher:
MATCH (u:User)
RETURN u.partition as partition,count(*) as size_of_partition
ORDER by size_of_partition DESC
LIMIT 20;
If you define the property that holds the weight (weightProperty
) and the threshold, it means the nodes are only connected, if the threshold on the weight of the relationship is high enough,
otherwise the relationship is thrown away.
The following will run the algorithm and stream results:
CALL algo.unionFind.stream('User', 'FRIEND', {weightProperty:'weight', defaultValue:0.0, threshold:1.0, concurrency: 1})
YIELD nodeId,setId
RETURN algo.asNode(nodeId).id AS user, setId
The following will run the algorithm and write back results:
CALL algo.unionFind('User', 'FRIEND', {write:true, partitionProperty:"partition",weightProperty:'weight', defaultValue:0.0, threshold:1.0, concurrency: 1})
YIELD nodes, setCount, loadMillis, computeMillis, writeMillis;
Name | Partition |
---|---|
Alice |
0 |
Charles |
0 |
Bridget |
1 |
Michael |
4 |
Doug |
4 |
Mark |
4 |
In this case we can see that, because the weight of the relationship between Bridget and Alice is only 0.5, the relationship is ignored by the algorithm, and Bridget ends up in her own component.
As mentioned above, connected components are an essential step in preprocessing your data. One reason is that most centralities suffer from disconnected components, or you just want to find disconnected groups of nodes. Int his example, Yelp’s social network will be used to demonstrate how to proceed when dealing with real world data. A typical social network consists of one big component and a number of small disconnected components.
The following will get the count of connected components:
CALL algo.unionFind.stream('User', 'FRIEND', {})
YIELD nodeId,setId
RETURN count(distinct setId) as count_of_components;
We get back the count of disconnected components being 18512 if we do not count users without friends. Let’s now check the size of top 20 components to get a better picture:
The following will get the size of top 20 components:
CALL algo.unionFind.stream('User', 'FRIEND', {})
YIELD nodeId,setId
RETURN setId,count(*) as size_of_component
ORDER BY size_of_component
LIMIT 20;
The biggest component has 8938630 out of total 8981389 (99,5%). It is quite high, but not shocking, as we have a friendship social network where we can expect small world effect and 6 degree of separation rule, where you can get to any person in a social network, just depends how long is the path.
We can now move on to next step of analysis and run centralities on only the biggest components, so that our results will be more accurate. We will write back the results to the node, and use centralities with Cypher loading, or set a new label for the biggest component.
The default label and relationship-type projection has a limitation of 2 billion nodes and 2 billion relationships. Therefore, if our projected graph contains more than 2 billion nodes or relationships, we will need to use huge graph projection.
Set graph:'huge'
in the config:
CALL algo.unionFind('User', 'FRIEND', {graph:'huge'})
YIELD nodes, setCount, loadMillis, computeMillis, writeMillis;
If label and relationship-type are not selective enough to describe your subgraph to run the algorithm on, you can use Cypher statements to load or project subsets of your graph. This can also be used to run algorithms on a virtual graph. You can learn more in the Section 2.2, “Cypher projection” section of the manual.
Set graph:'cypher'
in the config:
CALL algo.unionFind(
'MATCH (p:User) RETURN id(p) as id',
'MATCH (p1:User)-[f:FRIEND]->(p2:User)
RETURN id(p1) as source, id(p2) as target, f.weight as weight',
{graph:'cypher',write:true}
);
The following will run the algorithm and write back results:
CALL algo.unionFind(label:String, relationship:String, {threshold:0.42,
defaultValue:1.0, write: true, writeProperty:'partition', weightProperty:'weight', graph:'heavy', concurrency:4})
YIELD nodes, setCount, loadMillis, computeMillis, writeMillis
Name | Type | Default | Optional | Description |
---|---|---|---|---|
label |
string |
null |
yes |
The label to load from the graph. If null, load all nodes |
relationship |
string |
null |
yes |
The relationship-type to load from the graph. If null, load all relationships |
weightProperty |
string |
null |
yes |
The property name that contains weight. If null, treats the graph as unweighted. Must be numeric. |
write |
boolean |
true |
yes |
Specifies if the result should be written back as a node property |
writeProperty |
string |
'partition' |
yes |
The property name written back the ID of the partition particular node belongs to |
threshold |
float |
null |
yes |
The value of the weight above which the relationship is not thrown away |
defaultValue |
float |
null |
yes |
The default value of the weight in case it is missing or invalid |
concurrency |
int |
available CPUs |
yes |
The number of concurrent threads |
graph |
string |
'heavy' |
yes |
Use 'heavy' when describing the subset of the graph with label and relationship-type parameter. Use 'cypher' for describing the subset with cypher node-statement and relationship-statement |
Name | Type | Description |
---|---|---|
loadMillis |
int |
Milliseconds for loading data |
computeMillis |
int |
Milliseconds for running the algorithm |
writeMillis |
int |
Milliseconds for writing result data back |
postProcessingMillis |
int |
Milliseconds for computing percentiles and community count |
nodes |
int |
The number of nodes considered |
communityCount |
int |
The number of communities found |
p1 |
double |
The 1 percentile of community size. |
p5 |
double |
The 5 percentile of community size. |
p10 |
double |
The 10 percentile of community size. |
p25 |
double |
The 25 percentile of community size. |
p50 |
double |
The 50 percentile of community size. |
p75 |
double |
The 75 percentile of community size. |
p90 |
double |
The 90 percentile of community size. |
p95 |
double |
The 95 percentile of community size. |
p99 |
double |
The 99 percentile of community size. |
p100 |
double |
The 100 percentile of community size. |
write |
boolean |
Specifies if the result was written back as a node property |
writeProperty |
string |
The property name written back to |
The following will run the algorithm and stream results:
CALL algo.unionFind.stream(label:String, relationship:String,
{weightProperty:'weight', threshold:0.42, defaultValue:1.0, concurrency:4})
YIELD nodeId, setId - yields a setId to each node id
Name | Type | Default | Optional | Description |
---|---|---|---|---|
label |
string |
null |
yes |
The label to load from the graph. If null, load all nodes |
relationship |
string |
null |
yes |
The relationship-type to load from the graph. If null, load all relationships |
concurrency |
int |
available CPUs |
yes |
The number of concurrent threads |
weightProperty |
string |
null |
yes |
The property name that contains weight. If null, treats the graph as unweighted. Must be numeric. |
threshold |
float |
null |
yes |
The value of the weight above which the relationship is not thrown away |
defaultValue |
float |
null |
yes |
The default value of the weight in case it is missing or invalid |
graph |
string |
'heavy' |
yes |
Use 'heavy' when describing the subset of the graph with label and relationship-type parameter. Use 'cypher' for describing the subset with cypher node-statement and relationship-statement |
Name | Type | Description |
---|---|---|
nodeId |
int |
Node ID |
setId |
int |
Partition ID |
algo.unionFind
If a threshold configuration parameter is supplied, only relationships with a property value higher than the threshold are merged.
algo.unionFind.queue
Union Find
, using ExecutorService
only.
DisjointSetStruct
can be built using just a partition of the nodes, which are then merged pairwise.
DisjointSetStruct
.
As long as there are more elements on the queue, the algorithm takes two, merges them, and adds its result to the queue until
only 1 element remains.
algo.unionFind.forkJoinMerge
algo.unionFind.queue
, the resulting DisjointSetStruct
of each node-partition is merged by the ForkJoin
pool, while the calculation of the DisjointSetStruct
is done by the ExecutorService
.
algo.unionFind.forkJoin
forkJoinPool
algo.unionFind.mscoloring