Community Detection
This Jupyter notebook is hosted here in the Neo4j Graph Data Science Client Github repository.
The notebook shows the usage of the graphdatascience
library for
community detection on the Reddit Hyperlink Network dataset that can be
downloaded
here. We will
use the soc-redditHyperlinks-body.tsv
file.
The tasks we cover here include performing initial graph preprocessing using Weakly Connected Components and then performing community detection on the largest component using the Louvain algorithm.
1. Setup
We start by importing our dependencies and setting up our GDS client connection to the database.
# Install necessary dependencies
%pip install graphdatascience pandas
import os
import pandas as pd
from graphdatascience import GraphDataScience
NEO4J_URI = os.environ.get("NEO4J_URI", "bolt://localhost:7687")
NEO4J_AUTH = None
if os.environ.get("NEO4J_USER") and os.environ.get("NEO4J_PASSWORD"):
NEO4J_AUTH = (
os.environ.get("NEO4J_USER"),
os.environ.get("NEO4J_PASSWORD"),
)
gds = GraphDataScience(NEO4J_URI, auth=NEO4J_AUTH)
from graphdatascience import ServerVersion
assert gds.server_version() >= ServerVersion(2, 5, 0)
2. Importing the dataset
We import the dataset as a pandas dataframe first. We work with only a subset of the dataset. The sampled data is only till 1st March 2014.
df = pd.read_csv("https://snap.stanford.edu/data/soc-redditHyperlinks-body.tsv", sep="\t")
df = df[df["TIMESTAMP"] < "2014-03-01 02:51:13"]
df.head()
The LINK_SENTIMENT
column tells if there is a positive (+1) or
negative (-1) relationship from the source subreddit to destination
subreddit. We filter out the negative sentiment relationships as they
won’t add to any meaningful communities. We also drop duplicate
relationships.
relationship_df = df[df["LINK_SENTIMENT"] == 1]
columns = ["SOURCE_SUBREDDIT", "TARGET_SUBREDDIT"]
relationship_df = relationship_df[columns]
relationship_df = relationship_df.drop_duplicates()
relationship_df.head()
Next, we get a list of all the distinct nodes (source or destination) and load them as a dataframe.
# Get unique nodes for each column
source_nodes = pd.Series(df["SOURCE_SUBREDDIT"])
target_nodes = pd.Series(df["TARGET_SUBREDDIT"])
# Get unique nodes for both columns
all_nodes = pd.Series(pd.concat([df["SOURCE_SUBREDDIT"], df["TARGET_SUBREDDIT"]])).unique()
# Create new dataframe with distinct nodes
nodes_df = pd.DataFrame({"SUBREDDIT": all_nodes})
nodes_df.head()
Finally, we load this data (nodes and edges) into a Graph Database and a GDS graph.
gds.run_cypher(
"UNWIND $nodes AS node CREATE (n:Subreddit {name: node.SUBREDDIT})",
params={"nodes": nodes_df.to_dict("records")},
)
gds.run_cypher(
"""
UNWIND $rels AS rel
MATCH (source:Subreddit {name: rel.SOURCE_SUBREDDIT}), (target:Subreddit {name: rel.TARGET_SUBREDDIT})
CREATE (source)-[:HYPERLINKED_TO]->(target)
""",
params={"rels": relationship_df.to_dict("records")},
)
G, result = gds.graph.project("reddit", "Subreddit", "HYPERLINKED_TO")
print(f"The projection took {result['projectMillis']} ms")
# We can use convenience methods on `G` to check if the projection looks correct
print(f"Graph '{G.name()}' node count: {G.node_count()}")
print(f"Graph '{G.name()}' node labels: {G.node_labels()}")
gds.graph.list()
3. Weakly Connected Components
A graph dataset need not always be connected. That is, there may not exist a path from every node to every other node in the graph dataset (subgraphs in it may not be connected to each other at all). Hence, we need to find the total number of nodes in each subgraph to see if it is big enough for further graph analysis. Smaller subgraphs or lone nodes will not contribute to the community detection task and should be eliminated. Weakly Connected Components is often used as one of the early steps of graph preprocessing.
We use the Weakly Connected Components algorithm to find sets of connected nodes and assign each set a component ID.
result = gds.wcc.mutate(G, mutateProperty="componentId")
print(f"Components found: {result.componentCount}")
# We can verify that the componentId was mutated
G.node_properties()
Next, we will see the size of each connected component and depending on that, we can pick the subgraph that needs further analysis.
We use run_cypher
here instead of the direct GDS client call since
we want to see the size of the connected components.
query = """
CALL gds.graph.nodeProperties.stream('reddit', 'componentId')
YIELD nodeId, propertyValue
WITH gds.util.asNode(nodeId).name AS node, propertyValue AS componentId
WITH componentId, collect(node) AS subreddits
WITH componentId, subreddits, size(subreddits) AS componentSize
RETURN componentId, componentSize, subreddits
ORDER BY componentSize DESC
"""
components = gds.run_cypher(query)
components
largest_component = components["componentId"][0]
print(f"The largest component has the id {largest_component} with {components['componentSize'][0]} subreddits.")
For our further analysis we will work only with that subgraph.
largest_component_graph, _ = gds.graph.filter(
"largest_connected_components", G, f"n.componentId={largest_component}", "*"
)
largest_component_graph
4. Community Detection using Louvain
We use the
Louvain
algorithm to detect communities in our subgraph and assign a
louvainCommunityId
to each community.
gds.louvain.mutate(largest_component_graph, mutateProperty="louvainCommunityId")
We get a modularity score of 0.5898 for our community detection algorithm.
gds.graph.nodeProperties.write(largest_component_graph, ["louvainCommunityId"])
We can also check that the property was written by the below command.
gds.run_cypher(
"""
MATCH (n) WHERE 'louvainCommunityId' IN keys(n)
RETURN n.name, n.louvainCommunityId LIMIT 10
"""
)
Now we want to inspect the communities produced by Louvain.
query = """
CALL gds.graph.nodeProperties.stream('largest_connected_components', 'louvainCommunityId')
YIELD nodeId, propertyValue
WITH gds.util.asNode(nodeId).name AS node, propertyValue AS communityId
WITH communityId, collect(node) AS subreddits
WITH communityId, subreddits, size(subreddits) AS communitySize
RETURN communityId, communitySize, subreddits
ORDER BY communitySize DESC
"""
communities = gds.run_cypher(query)
communities
6. Cleanup
Before finishing we can clean up the example data from both the GDS in-memory state and the database.
# Cleanup GDS
largest_component_graph.drop()
G.drop()
# Cleanup database
gds.run_cypher("MATCH (n:Subreddit) DETACH DELETE n")
7. References
Srijan Kumar, William L. Hamilton, Jure Leskovec, and Dan Jurafsky. 2018. Community Interaction and Conflict on the Web. In Proceedings of the 2018 World Wide Web Conference (WWW ’18). International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, CHE, 933–943. https://doi.org/10.1145/3178876.3186141