GraphGists

Every developer has a tab open Stack Overflow, the massively popular question-and-answer site. This dataset explores questions, answers, comments, and tags related to the neo4j- and cypher- tagged questions on Stack Overflow.

Load JSON Import

Update this dataset using apoc.load.json. Run the query as-is or add a tag to the tags array.

// look for several pages of questions
WITH ["neo4j","cypher"] as tags
UNWIND tags as tagName
UNWIND range(1,2) as page // careful with throttling
WITH "https://api.stackexchange.com/2.3/questions?page="+page+"&pagesize=100&order=desc&sort=creation&tagged="+tagName+"&site=stackoverflow&filter=!5-i6Zw8Y)4W7vpy91PMYsKM-k9yzEsSC1_Uxlf" as url
CALL apoc.load.json(url) YIELD value
CALL apoc.util.sleep(250)
UNWIND value.items AS q
// create the questions
MERGE (question:Question {uuid:q.question_id})
  ON CREATE SET question.title = q.title,
  	question.link = q.share_link,
  	question.creation_date = q.creation_date,
  	question.accepted_answer_id=q.accepted_answer_id,
  	question.view_count=q.view_count,
   	question.answer_count=q.answer_count,
   	question.body_markdown=q.body_markdown
 // who asked the question
MERGE (owner:User {uuid:coalesce(q.owner.user_id,'deleted')})
  ON CREATE SET owner.display_name = q.owner.display_name
MERGE (owner)-[:ASKED]->(question)
// what tags do the questions have
FOREACH (tagName IN q.tags |
  MERGE (tag:Tag {name:tagName})
    ON CREATE SET tag.link = "https://stackoverflow.com/questions/tagged/" + tag.name
  MERGE (question)-[:TAGGED]->(tag))
// who answered the questions?
FOREACH (a IN q.answers |
   MERGE (question)<-[:ANSWERED]-(answer:Answer {uuid:a.answer_id})
    ON CREATE SET answer.is_accepted = a.is_accepted,
    answer.link=a.share_link,
    answer.title=a.title,
    answer.body_markdown=a.body_markdown,
    answer.score=a.score,
   	answer.favorite_score=a.favorite_score,
   	answer.view_count=a.view_count
   MERGE (answerer:User {uuid:coalesce(a.owner.user_id,'deleted')})
    ON CREATE SET answerer.display_name = a.owner.display_name
   MERGE (answer)<-[:PROVIDED]-(answerer)
)
FOREACH (c in q.comments |
  MERGE (question)<-[:COMMENTED_ON]-(comment:Comment {uuid:c.comment_id})
    ON CREATE SET comment.link=c.link, comment.score=c.score
  MERGE (commenter:User {uuid:coalesce(c.owner.user_id,'deleted')})
    ON CREATE SET commenter.display_name = c.owner.display_name
  MERGE (comment)<-[:COMMENTED]-(commenter)
);

The Shape of the Data

---
CALL db.schema.visualization;
---

The Model

Basic Queries

Labels in the graph and counts for each label:

MATCH (n)
RETURN labels(n) as label, count(*);

Top Tags:

MATCH (q:Question)-[:TAGGED]->(t:Tag)
RETURN t.name,  count(q) AS questions
ORDER BY questions DESC
LIMIT 5;

Exploring Users

Top users asking questions:

MATCH (u:User)-[:ASKED]->(q:Question)
RETURN u.display_name, count(*) AS questions
ORDER by questions DESC
LIMIT 10;

Top users answering:

MATCH (u:User)-[:PROVIDED]->(a:Answer)-[:ANSWERED]->(q:Question)
RETURN u.display_name as user,COUNT(a) AS answers
ORDER BY answers DESC LIMIT 10;

Shortest path between users:

MATCH path = allShortestPaths((u1:User {display_name:"alexanoid"})-[*]-(u2:User {display_name:"InverseFalcon"})
)
RETURN path LIMIT 1;

Unanswered Questions

Tags for unanswered questions:

MATCH (q:Question)-[:TAGGED]->(t:Tag)
WHERE not t.name IN ['neo4j','cypher'] AND NOT (q)<-[:ANSWERED]-()
RETURN t.name as tag, count(q) AS questions
ORDER BY questions DESC LIMIT 10;

Correlation

Tag correlations:

MATCH (t1:Tag)<-[:TAGGED]-()-[:TAGGED]->(t2:Tag)
WHERE id(t1) < id(t2) and t1.name <> 'neo4j' and t2.name <> 'neo4j'
RETURN t1.name, t2.name,count(*) as freq
ORDER BY freq desc LIMIT 10;

Engagement

User engagement over time:

MATCH (u:User)-[:PROVIDED]->()-[:ANSWERED]->(q:Question)-[:TAGGED]->(t:Tag)
WHERE u.display_name = "InverseFalcon"
RETURN apoc.date.format(q.creation_date,'s','yyyy-MM') as month, count(distinct q) as count, collect(distinct t.name) as tags
ORDER BY month asc

Virtual Graphs

Project tags via co-occurrence:

MATCH (t1:Tag)<-[:TAGGED]-()-[:TAGGED]->(t2:Tag)
WHERE id(t1) < id(t2) and t1.name <> 'neo4j' and t2.name <> 'neo4j'
WITH t1, t2,count(*) as freq  where freq > 3
RETURN t1,t2, apoc.create.vRelationship(t1,'OCCURRED',{freq:freq},t2) as rel