StackOverflow

Stackoverflow is the question and answer site for software developers.

The Neo4j tag alone has more than 15.000 questions and answers.

You can participate and contribute yourself by answering Neo4j related questions.

We wrote about the stackoverflow API before:

Look at JSON

StackExchange has an API at api.stackexchange.com that allows querying for questions, users, tags.

Here we use the query API by tag.

api.stackexchange.com/2.2/questions?page=1&pagesize=100&order=desc&sort=creation&tagged=neo4j&site=stackoverflow

The response looks like this:

{"items": [
  {"tags": ["neo4j"],
   "owner": {
     "reputation": 105,
     "user_id": 6566172,
     "user_type": "registered",
     "accept_rate": 53,
     "display_name": "scroobius",
     "link": "https://stackoverflow.com/users/6566172/scroobius"
   },
   "is_answered": false,
   "view_count": 6,
   "favorite_count": 0,
   "down_vote_count": 0,
   "up_vote_count": 0,
   "last_activity_date": 1522116996,
   "creation_date": 1522116996,
   "question_id": 49503383,
   "share_link": "https://stackoverflow.com/q/49503383",
   "body_markdown": "I have two kinds of nodes in my database:..."

Which we can load into Neo4j and deconstruct in Cypher to create a graph.

Graph Model

(:User)-[:ASKED]->(:Question)
(:Question)-[:TAGGED]->(:Tag)

(:User)-[:PROVIDED]->(:Answer)
(:Answer)-[:ANSWERED]->(:Question)

Import

WITH "https://api.stackexchange.com/2.2/questions?pagesize=100&order=desc&sort=creation&tagged=neo4j&site=stackoverflow&filter=!5-i6Zw8Y)4W7vpy91PMYsKM-k9yzEsSC1_Uxlf" AS url
CALL apoc.load.json(url) YIELD value

UNWIND value.items AS q

MERGE (question:Question {id:q.question_id})
  ON CREATE SET question.title = q.title, question.share_link = q.share_link, question.favorite_count = q.favorite_count, question.creation_date = q.creation_date

WITH "https://api.stackexchange.com/2.2/questions?pagesize=100&order=desc&sort=creation&tagged=neo4j&site=stackoverflow&filter=!5-i6Zw8Y)4W7vpy91PMYsKM-k9yzEsSC1_Uxlf" AS url
CALL apoc.load.json(url) YIELD value

UNWIND value.items AS q
MATCH (question:Question {id: q.question_id})
MERGE (owner:User {id:q.owner.user_id})
  ON CREATE SET owner.display_name = q.owner.display_name
MERGE (owner)-[:ASKED]->(question)

Import, part 2

WITH "https://api.stackexchange.com/2.2/questions?pagesize=100&order=desc&sort=creation&tagged=neo4j&site=stackoverflow&filter=!5-i6Zw8Y)4W7vpy91PMYsKM-k9yzEsSC1_Uxlf" AS url
CALL apoc.load.json(url) YIELD value

UNWIND value.items AS q
MATCH (question:Question {id: q.question_id})

FOREACH (tagName IN q.tags | MERGE (tag:Tag {name:tagName}) MERGE (question)-[:TAGGED]->(tag))
FOREACH (a IN q.answers |
   MERGE (question)<-[:ANSWERED]-(answer:Answer {id:a.answer_id})
   MERGE (answerer:User {id:a.owner.user_id}) ON CREATE SET answerer.display_name = a.owner.display_name
   MERGE (answer)<-[:PROVIDED]-(answerer)
)

Indexes and Constraints

Create indexes / constraints (ensure multistatement option is enabled with a checked box in Neo4j Browser settings):

CREATE CONSTRAINT on (q:Question) ASSERT q.id IS UNIQUE;
CREATE CONSTRAINT on (t:Tag) ASSERT t.name IS UNIQUE;
CREATE CONSTRAINT on (u:User) ASSERT u.id IS UNIQUE;

Larger Import (10 pages)

Importing a larger amount of data:

UNWIND range(1,10) as page
WITH "https://api.stackexchange.com/2.2/questions?page="+page+"&pagesize=100&order=desc&sort=creation&tagged=neo4j&site=stackoverflow&filter=!5-i6Zw8Y)4W7vpy91PMYsKM-k9yzEsSC1_Uxlf" AS url
CALL apoc.load.json(url) YIELD value

UNWIND value.items AS q

MERGE (question:Question {id:q.question_id})
  ON CREATE SET question.title = q.title, question.share_link = q.share_link, question.favorite_count = q.favorite_count, question.creation_date = q.creation_date

UNWIND range(1,10) as page
WITH "https://api.stackexchange.com/2.2/questions?page="+page+"&pagesize=100&order=desc&sort=creation&tagged=neo4j&site=stackoverflow&filter=!5-i6Zw8Y)4W7vpy91PMYsKM-k9yzEsSC1_Uxlf" AS url
CALL apoc.load.json(url) YIELD value

UNWIND value.items AS q
MATCH (question:Question {id:q.question_id})

MERGE (owner:User {id:coalesce(q.owner.user_id,'deleted')})
  ON CREATE SET owner.display_name = q.owner.display_name
MERGE (owner)-[:ASKED]->(question)

Larger import, part 2

UNWIND range(1,10) as page
WITH "https://api.stackexchange.com/2.2/questions?page="+page+"&pagesize=100&order=desc&sort=creation&tagged=neo4j&site=stackoverflow&filter=!5-i6Zw8Y)4W7vpy91PMYsKM-k9yzEsSC1_Uxlf" AS url
CALL apoc.load.json(url) YIELD value

UNWIND value.items AS q
MATCH (question:Question {id:q.question_id})

FOREACH (tagName IN q.tags | MERGE (tag:Tag {name:tagName}) MERGE (question)-[:TAGGED]->(tag))
FOREACH (a IN q.answers |
   MERGE (question)<-[:ANSWERED]-(answer:Answer {id:a.answer_id})
   MERGE (answerer:User {id:coalesce(a.owner.user_id,'deleted')}) ON CREATE SET answerer.display_name = a.owner.display_name
   MERGE (answer)<-[:PROVIDED]-(answerer)
)

Statistics

Labels in the graph and counts for each label:

MATCH (n)
RETURN labels(n) as label, count(*);

Top Tags:

MATCH ()-[:TAGGED]->(t:Tag)
RETURN t.name,count(*) as posts order by posts desc limit 5;

The Top 10 Stack Overflow Users

Unanswered Questions

Tags for unanswered questions:

MATCH (q:Question)-[:TAGGED]->(t:Tag)
WHERE not t.name IN ['neo4j','cypher'] and not (q)<-[:ANSWERED]-()
RETURN t.name, count(*) as posts
ORDER BY posts desc LIMIT 10;

Days for unanswered questions:

MATCH (q:Question)
WITH apoc.date.format(q.creation_date,'s','EE') as day, count(*) as posts, sum(case when not (q)<-[:ANSWERED]-() then 1 else 0 end) as answered
RETURN day, posts, answered, 100*answered/posts as percentage
ORDER BY percentage desc LIMIT 10

Correlation

Tag correlations:

MATCH (t1:Tag)<-[:TAGGED]-()-[:TAGGED]->(t2:Tag)
WHERE id(t1) < id(t2) and t1.name <> 'neo4j' and t2.name <> 'neo4j'
RETURN t1.name, t2.name,count(*) as freq
ORDER BY freq desc LIMIT 10;

Engagement

User engagement over time:

MATCH (u:User)-[:PROVIDED]->()-[:ANSWERED]->(q:Question)-[:TAGGED]->(t:Tag)
WHERE u.display_name = "InverseFalcon"
RETURN apoc.date.format(q.creation_date,'s','yyyy-MM') as month, count(distinct q) as count, collect(distinct t.name) as tags
ORDER BY month asc

MATCH (u:User)-[:PROVIDED]->()-[:ANSWERED]->(q:Question)-[:TAGGED]->(t:Tag)
WHERE u.display_name = "Bruno Peres"
RETURN apoc.date.format(q.creation_date,'s','yyyy-MM') as month, count(distinct q) as count, collect(distinct t.name) as tags
ORDER BY month asc

Virtual Graphs

Project tags via co-occurrence:

MATCH (t1:Tag)<-[:TAGGED]-()-[:TAGGED]->(t2:Tag)
WHERE id(t1) < id(t2) and t1.name <> 'neo4j' and t2.name <> 'neo4j'
WITH t1, t2,count(*) as freq  where freq > 3
RETURN t1,t2, apoc.create.vRelationship(t1,'OCCURRED',{freq:freq},t2) as rel

Is this page helpful?

GraphGists

StackOverflow

StackOverflow

Look at JSON

Graph Model

Import

Import, part 2

Indexes and Constraints

Larger Import (10 pages)

Larger import, part 2

Statistics

The Top 10 Stack Overflow Users

The Top 10 Stack Overflow Users

Unanswered Questions

Correlation

Engagement

Virtual Graphs