7.3.2. The Strongly Connected Components algorithm

This section describes the Strongly Connected Components algorithm in the Neo4j Graph Algorithms library.

The Strongly Connected Components (SCC) algorithm finds sets of connected nodes in a directed graph where each node is reachable in both directions from any other node in the same set. It is often used early in a graph analysis process to help us get an idea of how our graph is structured.

The Strongly Connected Components algorithms are experimental and not officially supported.

This section includes:

7.3.2.1. History and explanation

SCC is one of the earliest graph algorithms, and the first linear-time algorithm was described by Tarjan in 1972. Decomposing a directed graph into its strongly connected components is a classic application of the depth-first search algorithm.

7.3.2.2. Use-cases - when to use the Strongly Connected Components algorithm

  • In the analysis of powerful transnational corporations, SCC can be used to find the set of firms in which every member owns directly and/or indirectly owns shares in every other member. Although it has benefits, such as reducing transaction costs and increasing trust, this type of structure can weaken market competition. Read more in "The Network of Global Corporate Control".
  • SCC can be used to compute the connectivity of different network configurations when measuring routing performance in multihop wireless networks. Read more in "Routing performance in the presence of unidirectional links in multihop wireless networks"
  • Strongly Connected Components algorithms can be used as a first step in many graph algorithms that work only on strongly connected graph. In social networks, a group of people are generally strongly connected (For example, students of a class or any other common place). Many people in these groups generally like some common pages, or play common games. The SCC algorithms can be used to find such groups, and suggest the commonly liked pages or games to the people in the group who have not yet liked those pages or games.

7.3.2.3. Strongly Connected Components algorithm sample

strongly connected components

A directed graph is strongly connected if there is a path between all pairs of vertices. This algorithm treats the graph as directed, which means that the direction of the relationship is important. A strongly connected component only exists if there are relationships between nodes in both direction.

The following will create a sample graph: 

MERGE (nAlice:User {id:'Alice'})
MERGE (nBridget:User {id:'Bridget'})
MERGE (nCharles:User {id:'Charles'})
MERGE (nDoug:User {id:'Doug'})
MERGE (nMark:User {id:'Mark'})
MERGE (nMichael:User {id:'Michael'})

MERGE (nAlice)-[:FOLLOW]->(nBridget)
MERGE (nAlice)-[:FOLLOW]->(nCharles)
MERGE (nMark)-[:FOLLOW]->(nDoug)
MERGE (nMark)-[:FOLLOW]->(nMichael)
MERGE (nBridget)-[:FOLLOW]->(nMichael)
MERGE (nDoug)-[:FOLLOW]->(nMark)
MERGE (nMichael)-[:FOLLOW]->(nAlice)
MERGE (nAlice)-[:FOLLOW]->(nMichael)
MERGE (nBridget)-[:FOLLOW]->(nAlice)
MERGE (nMichael)-[:FOLLOW]->(nBridget);

The following will run the algorithm and write back results: 

CALL algo.scc('User','FOLLOW', {write:true,partitionProperty:'partition'})
YIELD loadMillis, computeMillis, writeMillis, setCount, maxSetSize, minSetSize;

Table 7.16. Results
Name Partition

Alice

1

Bridget

1

Michael

1

Charles

0

Doug

2

Mark

2

We have 3 strongly connected components in our sample graph.

The first, and biggest, component has members Alice, Bridget, and Michael, while the second component has Doug and Mark. Charles ends up in his own component because there isn’t an outgoing relationship from that node to any of the others.

The following will find the largest partition: 

MATCH (u:User)
RETURN u.partition as partition,count(*) as size_of_partition
ORDER by size_of_partition DESC
LIMIT 1

7.3.2.4. Huge graph projection

The default label and relationship-type projection has a limitation of 2 billion nodes and 2 billion relationships. Therefore, if our projected graph contains more than 2 billion nodes or relationships, we will need to use huge graph projection.

Set graph:'huge' in the config: 

CALL algo.scc('User','FOLLOW', {graph:'huge'})
YIELD loadMillis, computeMillis, writeMillis, setCount;

7.3.2.5. Cypher projection

If label and relationship-type are not selective enough to describe your subgraph to run the algorithm on, you can use Cypher statements to load or project subsets of your graph. This can also be used to run algorithms on a virtual graph. You can learn more in the Section 2.2, “Cypher projection” section of the manual.

Set graph:'cypher' in the config: 

CALL algo.scc(
  'MATCH (u:User) RETURN id(u) as id',
  'MATCH (u1:User)-[:FOLLOW]->(u2:User) RETURN id(u1) as source,id(u2) as target',
  {write:true,graph:'cypher'})
YIELD loadMillis, computeMillis, writeMillis;

7.3.2.6. Syntax

The following will run the algorithm and write back results: 

CALL algo.scc(label:String, relationship:String,
    {write:true,writeProperty:'partition',concurrency:4, graph:'heavy'})
YIELD loadMillis, computeMillis, writeMillis, setCount, maxSetSize, minSetSize

Table 7.17. Parameters
Name Type Default Optional Description

label

string

null

yes

The label to load from the graph. If null, load all nodes

relationship

string

null

yes

The relationship-type to load from the graph. If null, load all relationships

write

boolean

true

yes

Specifies if the result should be written back as a node property

writeProperty

string

'partition'

yes

The property name written back to

concurrency

int

available CPUs

yes

The number of concurrent threads

graph

string

'heavy'

yes

Use 'heavy' when describing the subset of the graph with label and relationship-type parameter. Use 'cypher' for describing the subset with cypher node-statement and relationship-statement

Table 7.18. Results
Name Type Description

loadMillis

int

Milliseconds for loading data

computeMillis

int

Milliseconds for running the algorithm

writeMillis

int

Milliseconds for writing result data back

postProcessingMillis

int

Milliseconds for computing percentiles and community count

nodes

int

The number of nodes considered

communityCount

int

The number of communities found

p1

double

The 1 percentile of community size.

p5

double

The 5 percentile of community size.

p10

double

The 10 percentile of community size.

p25

double

The 25 percentile of community size.

p50

double

The 50 percentile of community size.

p75

double

The 75 percentile of community size.

p90

double

The 90 percentile of community size.

p95

double

The 95 percentile of community size.

p99

double

The 99 percentile of community size.

p100

double

The 100 percentile of community size.

write

boolean

Specifies if the result was written back as a node property

writeProperty

string

The property name written back to

The following will run the algorithm and stream results: 

CALL algo.scc.stream(label:String, relationship:String, {concurrency:4})
YIELD nodeId, partition

Table 7.19. Parameters
Name Type Default Optional Description

label

string

null

yes

The label to load from the graph. If null, load all nodes

relationship

string

null

yes

The relationship-type to load from the graph. If null, load all relationships

concurrency

int

available CPUs

yes

The number of concurrent threads

graph

string

'heavy'

yes

Use 'heavy' when describing the subset of the graph with label and relationship-type parameter. Use 'cypher' for describing the subset with cypher node-statement and relationship-statement

Table 7.20. Results
Name Type Description

nodeId

int

Node ID

partition

int

Partition ID

7.3.2.7. Implementations

algo.scc

  • Iterative adaptation (same as algo.scc.iterative).

algo.scc.recursive.tarjan

  • Original recursive tarjan implementation.

algo.scc.recursive.tunedTarjan

  • Also a recursive tarjan implementation.

algo.scc.iterative

  • Iterative adaption of tarjan algorithm.

algo.scc.multistep

  • Parallel SCC algorithm.