StackOverflow
StackOverflow
Stackoverflow is the question and answer site for software developers.
The Neo4j tag alone has more than 15.000 questions and answers.
You can participate and contribute yourself by answering Neo4j related questions.
We wrote about the stackoverflow API before:
Look at JSON
StackExchange has an API at api.stackexchange.com that allows querying for questions, users, tags.
Here we use the query API by tag.
api.stackexchange.com/2.2/questions?page=1&pagesize=100&order=desc&sort=creation&tagged=neo4j&site=stackoverflow
The response looks like this:
{"items": [
{"tags": ["neo4j"],
"owner": {
"reputation": 105,
"user_id": 6566172,
"user_type": "registered",
"accept_rate": 53,
"display_name": "scroobius",
"link": "https://stackoverflow.com/users/6566172/scroobius"
},
"is_answered": false,
"view_count": 6,
"favorite_count": 0,
"down_vote_count": 0,
"up_vote_count": 0,
"last_activity_date": 1522116996,
"creation_date": 1522116996,
"question_id": 49503383,
"share_link": "https://stackoverflow.com/q/49503383",
"body_markdown": "I have two kinds of nodes in my database:..."
Which we can load into Neo4j and deconstruct in Cypher to create a graph.
Graph Model
(:User)-[:ASKED]->(:Question) (:Question)-[:TAGGED]->(:Tag) (:User)-[:PROVIDED]->(:Answer) (:Answer)-[:ANSWERED]->(:Question)
Import
WITH "https://api.stackexchange.com/2.2/questions?pagesize=100&order=desc&sort=creation&tagged=neo4j&site=stackoverflow&filter=!5-i6Zw8Y)4W7vpy91PMYsKM-k9yzEsSC1_Uxlf" AS url
CALL apoc.load.json(url) YIELD value
UNWIND value.items AS q
MERGE (question:Question {id:q.question_id})
ON CREATE SET question.title = q.title, question.share_link = q.share_link, question.favorite_count = q.favorite_count, question.creation_date = q.creation_date
WITH "https://api.stackexchange.com/2.2/questions?pagesize=100&order=desc&sort=creation&tagged=neo4j&site=stackoverflow&filter=!5-i6Zw8Y)4W7vpy91PMYsKM-k9yzEsSC1_Uxlf" AS url
CALL apoc.load.json(url) YIELD value
UNWIND value.items AS q
MATCH (question:Question {id: q.question_id})
MERGE (owner:User {id:q.owner.user_id})
ON CREATE SET owner.display_name = q.owner.display_name
MERGE (owner)-[:ASKED]->(question)
Import, part 2
WITH "https://api.stackexchange.com/2.2/questions?pagesize=100&order=desc&sort=creation&tagged=neo4j&site=stackoverflow&filter=!5-i6Zw8Y)4W7vpy91PMYsKM-k9yzEsSC1_Uxlf" AS url
CALL apoc.load.json(url) YIELD value
UNWIND value.items AS q
MATCH (question:Question {id: q.question_id})
FOREACH (tagName IN q.tags | MERGE (tag:Tag {name:tagName}) MERGE (question)-[:TAGGED]->(tag))
FOREACH (a IN q.answers |
MERGE (question)<-[:ANSWERED]-(answer:Answer {id:a.answer_id})
MERGE (answerer:User {id:a.owner.user_id}) ON CREATE SET answerer.display_name = a.owner.display_name
MERGE (answer)<-[:PROVIDED]-(answerer)
)
Indexes and Constraints
Create indexes / constraints (ensure multistatement option is enabled with a checked box in Neo4j Browser settings):
CREATE CONSTRAINT on (q:Question) ASSERT q.id IS UNIQUE;
CREATE CONSTRAINT on (t:Tag) ASSERT t.name IS UNIQUE;
CREATE CONSTRAINT on (u:User) ASSERT u.id IS UNIQUE;
Larger Import (10 pages)
Importing a larger amount of data:
UNWIND range(1,10) as page
WITH "https://api.stackexchange.com/2.2/questions?page="+page+"&pagesize=100&order=desc&sort=creation&tagged=neo4j&site=stackoverflow&filter=!5-i6Zw8Y)4W7vpy91PMYsKM-k9yzEsSC1_Uxlf" AS url
CALL apoc.load.json(url) YIELD value
UNWIND value.items AS q
MERGE (question:Question {id:q.question_id})
ON CREATE SET question.title = q.title, question.share_link = q.share_link, question.favorite_count = q.favorite_count, question.creation_date = q.creation_date
UNWIND range(1,10) as page
WITH "https://api.stackexchange.com/2.2/questions?page="+page+"&pagesize=100&order=desc&sort=creation&tagged=neo4j&site=stackoverflow&filter=!5-i6Zw8Y)4W7vpy91PMYsKM-k9yzEsSC1_Uxlf" AS url
CALL apoc.load.json(url) YIELD value
UNWIND value.items AS q
MATCH (question:Question {id:q.question_id})
MERGE (owner:User {id:coalesce(q.owner.user_id,'deleted')})
ON CREATE SET owner.display_name = q.owner.display_name
MERGE (owner)-[:ASKED]->(question)
Larger import, part 2
UNWIND range(1,10) as page
WITH "https://api.stackexchange.com/2.2/questions?page="+page+"&pagesize=100&order=desc&sort=creation&tagged=neo4j&site=stackoverflow&filter=!5-i6Zw8Y)4W7vpy91PMYsKM-k9yzEsSC1_Uxlf" AS url
CALL apoc.load.json(url) YIELD value
UNWIND value.items AS q
MATCH (question:Question {id:q.question_id})
FOREACH (tagName IN q.tags | MERGE (tag:Tag {name:tagName}) MERGE (question)-[:TAGGED]->(tag))
FOREACH (a IN q.answers |
MERGE (question)<-[:ANSWERED]-(answer:Answer {id:a.answer_id})
MERGE (answerer:User {id:coalesce(a.owner.user_id,'deleted')}) ON CREATE SET answerer.display_name = a.owner.display_name
MERGE (answer)<-[:PROVIDED]-(answerer)
)
Statistics
Labels in the graph and counts for each label:
MATCH (n)
RETURN labels(n) as label, count(*);
Top Tags:
MATCH ()-[:TAGGED]->(t:Tag)
RETURN t.name,count(*) as posts order by posts desc limit 5;
The Top 10 Stack Overflow Users
Top users asking questions:
MATCH (u:User)-[:ASKED]->()
RETURN u.display_name, count(*) as posts
ORDER by posts desc limit 10;
Top users answering:
MATCH (u:User)-[:PROVIDED]->()
RETURN u.display_name, count(*) as posts
ORDER by posts desc limit 10;
Top days for questions:
MATCH (q:Question)
RETURN apoc.date.format(q.creation_date,'s','EE') as day, count(*) as freq
ORDER BY freq desc;
The Top 10 Stack Overflow Users
Top times for questions (GMT?):
MATCH (q:Question)
RETURN apoc.date.format(q.creation_date,'s','HH') as hour, count(*) as freq
ORDER BY freq desc LIMIT 7;
Top tags per user:
MATCH (u:User)-[:PROVIDED]->()-[:ANSWERED]->()-[:TAGGED]->(t:Tag)
WHERE u.display_name = "cybersam"
RETURN t.name,count(*) as posts order by posts desc limit 10;
Shortest path between users:
MATCH path = allShortestPaths(
(u:User {display_name:"alexanoid"})-[*]-(me:User {display_name:"InverseFalcon"})
)
RETURN path;
Unanswered Questions
Tags for unanswered questions:
MATCH (q:Question)-[:TAGGED]->(t:Tag)
WHERE not t.name IN ['neo4j','cypher'] and not (q)<-[:ANSWERED]-()
RETURN t.name, count(*) as posts
ORDER BY posts desc LIMIT 10;
Days for unanswered questions:
MATCH (q:Question)
WITH apoc.date.format(q.creation_date,'s','EE') as day, count(*) as posts, sum(case when not (q)<-[:ANSWERED]-() then 1 else 0 end) as answered
RETURN day, posts, answered, 100*answered/posts as percentage
ORDER BY percentage desc LIMIT 10
Correlation
Tag correlations:
MATCH (t1:Tag)<-[:TAGGED]-()-[:TAGGED]->(t2:Tag)
WHERE id(t1) < id(t2) and t1.name <> 'neo4j' and t2.name <> 'neo4j'
RETURN t1.name, t2.name,count(*) as freq
ORDER BY freq desc LIMIT 10;
Engagement
User engagement over time:
MATCH (u:User)-[:PROVIDED]->()-[:ANSWERED]->(q:Question)-[:TAGGED]->(t:Tag)
WHERE u.display_name = "InverseFalcon"
RETURN apoc.date.format(q.creation_date,'s','yyyy-MM') as month, count(distinct q) as count, collect(distinct t.name) as tags
ORDER BY month asc
MATCH (u:User)-[:PROVIDED]->()-[:ANSWERED]->(q:Question)-[:TAGGED]->(t:Tag)
WHERE u.display_name = "Bruno Peres"
RETURN apoc.date.format(q.creation_date,'s','yyyy-MM') as month, count(distinct q) as count, collect(distinct t.name) as tags
ORDER BY month asc
Virtual Graphs
Project tags via co-occurrence:
MATCH (t1:Tag)<-[:TAGGED]-()-[:TAGGED]->(t2:Tag)
WHERE id(t1) < id(t2) and t1.name <> 'neo4j' and t2.name <> 'neo4j'
WITH t1, t2,count(*) as freq where freq > 3
RETURN t1,t2, apoc.create.vRelationship(t1,'OCCURRED',{freq:freq},t2) as rel
Is this page helpful?