GraphGist: Insights from GitHub public timeline

by Harish Chakravarthy

Use Case(s)

[Warning]Warning

This GraphGist has not yet been submitted and approved for publication. If you're the developer, please submit for publication using the GraphGist Portal.


Inspiration

Open-source developers all over the world are working on millions of projects: writing code & documentation, fixing & submitting bugs, and so forth^ . What is trending now on GitHub? Who are the contributors? Are you curious? Exactly! This curiosity got me started on this interesting journey. Using different event types from GitHub public timeline I gathered interesting data for further analysis to build an interesting story and offer insights.

Last month I launched Ask GitHub to answer questions based on GitHub public timeline. Building a graph model and learning Cypher has got me thinking and energized! I just can’t wait to implement new features using Cypher. Feel free to contribute to AskGitHub on GitHub .

This Neo4j GraphGist is an entry for GraphGist Winter Challenge 2015


Data Source

Public GitHub timeline from GitHub Archive is parsed hourly using node.js streaming parser. Currently event type PushEvent, CreateEvent & WatchEvent are captured. PushEvent contains information about commits and authors. CreateEvent contains new repositories. WatchEvent contains information about popular repositories. All the data is first stored in MongoDB. Data stored in MongoDB is then processed using Neo4jSync.py to generate CSV files and imported into GrapheneDB. This data model will change - Hello Neo4j!


Data Model

Currently there are three types of nodes - Repository, Organization & People. Repository node contains information about repository and when node was created. Organization node contains information about the organization specific repository belongs to and when node was created. People node contains information about contributors (email address of contributors) and when the node was created. IN_ORGANIZATION relationship exists between Respository node and Organization node. IS_ACTOR relationship exists between Respository and People node. There can be more than one person contributing to a repository.

Nodes & Relationships model developed using YUML

Screenshot #1: Repositories for organization openstack

Screenshot #2: Repository openstack/openstack


Insights

//Clean up
MATCH (n) OPTIONAL MATCH (n)-[r]-() DELETE n,r;

//Unique Constraints
CREATE CONSTRAINT ON (a:Repository) ASSERT a.id IS UNIQUE;
CREATE CONSTRAINT ON (b:People) ASSERT b.id IS UNIQUE;
CREATE CONSTRAINT ON (c:Organization) ASSERT c.id IS UNIQUE;

// Data set is from GitHub public timeline ~ 8:00 PM PST, March 20, 2015

//Load Repositories
LOAD CSV WITH HEADERS FROM 'https://raw.githubusercontent.com/harishvc/githubanalytics/master/cypher-dataset-20March2015-211400/repositories.csv' AS csvLine FIELDTERMINATOR '|'
MERGE (r:Repository {id: csvLine.id})
SET r.created_at = toInt(csvLine.now);

//Load Organizations
LOAD CSV WITH HEADERS FROM 'https://raw.githubusercontent.com/harishvc/githubanalytics/master/cypher-dataset-20March2015-211400/organizations.csv' AS csvLine FIELDTERMINATOR '|'
MERGE (o:Organization {id: csvLine.id})
SET o.created_at = toInt(csvLine.now);

//Create relationship between repository and organization
LOAD CSV WITH HEADERS FROM 'https://raw.githubusercontent.com/harishvc/githubanalytics/master/cypher-dataset-20March2015-211400/inorganization-relations.csv' AS csvLine FIELDTERMINATOR '|'
MATCH (a:Repository { id: csvLine.a}),(b:Organization {id: csvLine.b})
CREATE UNIQUE (a)-[:IN_ORGANIZATION]->(b);

//Load People
LOAD CSV WITH HEADERS FROM 'https://raw.githubusercontent.com/harishvc/githubanalytics/master/cypher-dataset-20March2015-211400/people.csv' AS csvLine FIELDTERMINATOR '|'
MERGE (p:People {id: csvLine.id})
SET p.created_at = toInt(csvLine.now);

//Create relationship between repository and people
LOAD CSV WITH HEADERS FROM 'https://raw.githubusercontent.com/harishvc/githubanalytics/master/cypher-dataset-20March2015-211400/isactor-relations.csv' AS csvLine FIELDTERMINATOR '|'
MATCH (a:Repository { id: csvLine.a}),(b:People {id: csvLine.b})
CREATE UNIQUE (a)-[:IS_ACTOR]->(b);

Insight #1: How many repositories?

match (a:Repository) return count(a) as repositories;
Loading table...

Insight #2: How many people?

match (b:People) return count(b) as people;
Loading table...

Insight #3: How many organizations?

match (c:Organization) return count(c) as organization;
Loading table...

Insight #4: Similar Repositories

Grouping by #contributors will exclude repositories with commits using a single account - limitation

MATCH (a)-[r1:IS_ACTOR]->(match)<-[r2:IS_ACTOR]-(b)  where a.id > b.id
with a,b, collect (distinct match.id) as connections,collect (distinct type(r1)) as rel1
where length(connections) >= 1 //set minimum # of connections
return a.id, b.id,length(connections) as count order by length(connections) desc
Loading table...

Insite #5: Active Organizations

MATCH (a)-[r1:IN_ORGANIZATION]->(b)
WHERE a.id > b.id
RETURN b.id AS organization, count(b.id) AS count, collect (distinct(a.id)) as repositories
ORDER BY count DESC
Loading table...

Insight #6: Active Contributors

Grouping by #repositories a person has contributed to, will include a specific use-case where commits are pushed via a single GitHub account to numerous repositories - limitation

MATCH (a)-[r1:IS_ACTOR]->(b)
WHERE a.id > b.id
WITH b.id AS contributor, count(b.id) AS count, collect (distinct(a.id)) as repositories
RETURN contributor, count, repositories
ORDER BY count DESC
Loading table...

Next Steps

Building nodes, relationships, understanding Cypher and learning from different use cases posted on Neo4jGraph Gist has got me thinking and energized! I just can’t wait to implement new features inside AskGitHub. I am also thinking about content analysis and adding repository languages.

Perspectives welcome!

Run
Table
Graph
Table!
Graph!
Error!
Loading