Cypher: LOAD JSON from URL AS Data


Update: Much of this got much easier today with user defined procedures, like apoc.load.json, which add this kind of capability to Cypher directly.

Neo4j’s query language Cypher supports loading data from CSV directly but not from JSON files or URLs.

Almost every site offers some kind of API or endpoint that returns JSON and we can also query many NOSQL databases via HTTP and get JSON responses back.

It’s quite useful to be able to ingest document structured information from all those different sources into a more usable graph model.

I want to show here that retrieving that data and ingesting it into Neo4j using Cypher is really straightforward and takes only little effort.

As Cypher is already pretty good at deconstructing nested documents, it’s actually not that hard to achieve it from a tiny program.

I want to show you today how you can achieve this from Python, Javascript, Ruby, Java, and Bash.

The Domain: Stack Overflow


Being a developer I love Stack Overflow; just crossed 20k reputation by only answering 1100 Neo4j-related questions :). You can do that too. That’s why I want to use Stack Overflow users with their questions, answers, comments and tags as our domain today.

Loading JSON Data from a Stack Overflow URL


Pulling Stack Overflow information into a graph model allows me to find interesting insights, like:

    • What are the people asking or answering about Neo4j also interested in
    • How is their activity distributed across tags and between questions, answers and comments
    • Which kinds of questions attract answers and which don’t
    • Looking at my own data, which answers to what kinds of questions got the highest approval rates
We need some data and a model suited to answer those questions.

Stack Overflow API


Stack Overflow offers an API to retrieve that information, it’s credential protected as usual, but there is the cool option to pre-generate an API-URL that encodes your secrets and allows you to retrieve data without sharing them. You can still control some parameters like tags, page size and page-number though.

With this API-URL below, we load the last 10 questions with the Neo4j tag.

https://api.stackexchange.com/2.2/questions?pagesize=100&order=desc&sort=creation&tagged=neo4j&site=stackoverflow&filter=!5-i6Zw8Y)4W7vpy91PMYsKM-k9yzEsSC1_Uxlf

The response should look something like this (or scroll to the far bottom).

Overall Response Structure
{ "items": [{
	"question_id": 24620768,
	"link": "https://stackoverflow.com/questions/24620768/neo4j-cypher-query-get-last-n-elements",
	"title": "Neo4j cypher query: get last N elements",
 	"answer_count": 1,
 	"score": 1,
 	.....
 	"creation_date": 1404771217,
 	"body_markdown": "I have a graph....How can I do that?",
 	"tags": ["neo4j", "cypher"],
 	"owner": {
 		"reputation": 815,
 		"user_id": 1212067,
        ....
 		"link": "https://stackoverflow.com/users/1212067/"
 	},
 	"answers": [{
 		"owner": {
 			"reputation": 488,
 			"user_id": 737080,
 			"display_name": "Chris Leishman",
            ....
 		},
 		"answer_id": 24620959,
 		"share_link": "https://stackoverflow.com/a/24620959",
        ....
 		"body_markdown": "The simplest would be to use an ... some discussion on this here:...",
 		"title": "Neo4j cypher query: get last N elements"
 	}]
 }

Graph Model


So what does the graph-model look like? We can develop it by looking at the questions we want to answer and the entities and relationships they refer to.

We need this model upfront to know where to put our data when we insert it into the graph. After all we don’t want to have loose ends.

Discover How to LOAD JSON Files from URLs AS Graph-Ready Data


Cypher Import Statement


The Cypher query to create that domain is also straightforward. You can deconstruct maps with dot notation map.key and arrays with slices array[0..4]. You’d use UNWIND to convert collections into rows and FOREACH to iterate over a collection with update statements. To create nodes and relationships we use MERGE and CREATE commands.

My friend Mark just published a blog post explaining in detail how you apply these operations to your data.

The JSON response that we retrieved from the API call is passed in as a parameter {json} to the Cypher statement, which we alias with the more handy data identifier. Then we use the aforementioned means to extract the relevant information out of the data collection of questions, treating each as q.

For each question we access the direct attributes but also related information like the owner or contained collections like tags or answers which we deconstruct in turn.

WITH {json} as data
UNWIND data.items as q
MERGE (question:Question {id:q.question_id}) ON CREATE
  SET question.title = q.title, question.share_link = q.share_link, question.favorite_count = q.favorite_count

MERGE (owner:User {id:q.owner.user_id}) ON CREATE SET owner.display_name = q.owner.display_name
MERGE (owner)-[:ASKED]->(question)

FOREACH (tagName IN q.tags | MERGE (tag:Tag {name:tagName}) MERGE (question)-[:TAGGED]->(tag))
FOREACH (a IN q.answers |
   MERGE (question)<-[:ANSWERS]-(answer:Answer {id:a.answer_id})
   MERGE (answerer:User {id:a.owner.user_id}) ON CREATE SET answerer.display_name = a.owner.display_name
   MERGE (answer)<-[:PROVIDED]-(answerer)
)

Calling Cypher with the JSON parameters


To pass in the JSON to Cypher we have to programmatically call the Cypher endpoint of the Neo4j server, which can be done via one of the many drivers for Neo4j or manually by POSTing the necessary payload to Neo4j. We can also call the Java API.

So without further ado here are our examples for a selection of different languages, drivers and APIs:

Python

We use the py2neo driver by Nigel Small to execute the statement:

import os
import requests
from py2neo import neo4j

# Connect to graph and add constraints.
neo4jUrl = os.environ.get('NEO4J_URL',"https://localhost:7474/db/data/")
graph = neo4j.GraphDatabaseService(neo4jUrl)

# Add uniqueness constraints.
neo4j.CypherQuery(graph, "CREATE CONSTRAINT ON (q:Question) ASSERT q.id IS UNIQUE;").run()

# Build URL.
apiUrl = "https://api.stackexchange.com/2.2/questions...." % (tag,page,page_size)
# Send GET request.
json = requests.get(apiUrl, headers = {"accept":"application/json"}).json()

# Build query.
query = """
UNWIND {json} AS data ....
"""

# Send Cypher query.
neo4j.CypherQuery(graph, query).run(json=json)

We also did something similar with getting tweets from the Twitter search API into Ne4oj for the OSCON conference.

Javascript

For JavaScript I want to show how to call the transactional Cypher endpoint directly, by just using the request node module.

var r=require("request");
var neo4jUrl = (env["NEO4J_URL"] || "https://localhost:7474") + "/db/data/transaction/commit";

function cypher(query,params,cb) {
  r.post({uri:neo4jUrl,
          json:{statements:[{statement:query,parameters:params}]}},
         function(err,res) { cb(err,res.body)})
}

var query="UNWIND {json} AS data ....";
var apiUrl = "https://api.stackexchange.com/2.2/questions....";

r.get({url:apiUrl,json:true,gzip:true}, function(err,res,json) {
  cypher(query,{json:json},function(err, result) { console.log(err, JSON.stringify(result))});
});

Java

With Java I want to show how to use the Neo4j embedded API to execute Cypher.

import org.apache.http.*;
import org.codehaus.jackson.map.ObjectMapper;
import org.neo4j.graphdb.*;

// somewhere in your application-scoped setup code
ObjectMapper mapper = new ObjectMapper();
HttpClient http = HttpClients.createMinimal();
GraphDatabaseService db = new GraphDatabaseFactory().newEmbeddedGraphDatabase(PATH);

// execute API request and parse response as JSON
HttpResponse response = http.execute(new HttpGet( apiUrl ));
Map json = mapper.readValue(response.getEntity().getContent(), Map.class)

// execute Cypher
String query = "UNWIND {json} AS data ....";
db.execute(query, singletonMap("json",json));

// application scoped shutdown, or JVM-shutdown-hook
db.shutdown();

Ruby

Using the neo4j-core Gem, we can talk to Neo4j Server or embedded (using jRuby) by just changing a single line of configuration.

require 'rubygems'
require 'neo4j-core'
require 'rest-client'
require 'json'

QUERY="UNWIND {json} AS data ...."
API = "https://api.stackexchange.com/2.2/questions...."

res = RestClient.get(API)
json = JSON.parse(res.to_str)

session = Neo4j::Session.open
session.query(QUERY, json: json)

Bash

Bash is of course most fun, as we have to do fancy text substitutions to make this work.

load_json.sh
#!/bin/bash
echo "Usage load_json.sh 'https://json.api.com?params=values' import_json.cypher"
echo "Use {data} as parameter in your query for the JSON data"
JSON_API="$1"
QUERY=`cat "$2"` # cypher file
JSON_DATA=`curl --compress -s -H accept:application/json -s "$JSON_API"`
POST_DATA="{\"statements\":[{\"statement\": \"$QUERY\", \"parameters\": {\"data\":\"$JSON_DATA\"}}]}"
DB_URL=${NEO4J_URL-https://localhost:7474}
curl -i -H accept:application/json -H content-type:application/json -d "$POST_DATA" -XPOST 
"$DB_URL/db/data/transaction/commit"

Example Use-Cases


Here are some simple example queries that I now can run on top of this imported dataset.

To not overload this blog post with too much information, we’ll answer our original questions in Part 2.

Find the User who was most active


MATCH (u:User)
OPTIONAL MATCH (u)-[:PROVIDED|ASKED|COMMENTED]->()
RETURN u,count(*)
ORDER BY count(*) DESC
LIMIT 5

Find co-used Tags


MATCH (t:Tag)
OPTIONAL MATCH (t)<-[:TAGGED]-(question)-[:TAGGED]->(t2)
RETURN t.name,t2.name,count(distinct question) as questions
ORDER BY questions DESC

MATCH (t:Tag)<-[r:TAGGED]->(question)
RETURN t,r,question

Conclusion


So as you can see, even with LOAD JSON not being part of the language, it’s easy enough to retrieve JSON data from an API endpoint and deconstruct and insert it into Neo4j by just using plain Cypher.

Accessing web-APIs is a simple task in all stacks and languages and JSON as transport format is ubiquitous.

Fortunately, the unfortunately lesser known capabilities of Cypher to deconstruct complex JSON documents allow us to quickly turn them into a really nice graph structure without duplication of information and rich relationships.

I encourage you to try it with your favorite web-APIs and send us your example with graph model, Cypher import query and 2-3 use-case queries that reveal some interesting insights into the data you ingested to content@neotechnology.com.


Want to learn more about graph databases? Click below to get your free copy of O’Reilly’s Graph Databases ebook and start building better apps powered by graph technologies.



Appendix: Stack Overflow Response


{
	"items": [{
		"answers": [{
			"owner": {
				"reputation": 488,
				"user_id": 737080,
				"user_type": "registered",
				"accept_rate": 45,
				"profile_image": "https://www.gravatar.com/avatar/
ffa6eed1e8a9c1b2adb37ca88c07dede?s=128&d=identicon&r=PG",
				"display_name": "Chris Leishman",
				"link": "https://stackoverflow.com/users/737080/chris-leishman"
			},
			"tags": [],
			"comment_count": 0,
			"down_vote_count": 0,
			"up_vote_count": 2,
			"is_accepted": false,
			"score": 2,
			"last_activity_date": 1404772223,
			"creation_date": 1404772223,
			"answer_id": 24620959,
			"question_id": 24620768,
			"share_link": "https://stackoverflow.com/a/24620959",
			"body_markdown": "The simplest would be to use an ... some discussion on this here:
 https://docs.neo4j.org/chunked/stable/cypherdoc-linked-lists.html)",
			"link": "https://stackoverflow.com/questions/24620768/neo4j-cypher-query-get-last-n-elements/24620959#24620959",
			"title": "Neo4j cypher query: get last N elements"
		}],
		"tags": ["neo4j", "cypher"],
		"owner": {
			"reputation": 815,
			"user_id": 1212067,
			"user_type": "registered",
			"accept_rate": 73,
			"profile_image": "https://i.stack.imgur.com/nnyS1.png?s=128&g=1",
			"display_name": "C&#233;sar Garc&#237;a Tapia",
			"link": "https://stackoverflow.com/users/1212067/c%c3%a9sar-garc%c3%ada-tapia"
		},
		"comment_count": 0,
		"delete_vote_count": 0,
		"close_vote_count": 0,
		"is_answered": true,
		"view_count": 14,
		"favorite_count": 0,
		"down_vote_count": 0,
		"up_vote_count": 1,
		"answer_count": 1,
		"score": 1,
		"last_activity_date": 1404772230,
		"creation_date": 1404771217,
		"question_id": 24620768,
		"share_link": "https://stackoverflow.com/q/24620768",
		"body_markdown": "I have a graph that...How can I do that?",
		"link": "https://stackoverflow.com/questions/24620768/neo4j-cypher-query-get-last-n-elements",
		"title": "Neo4j cypher query: get last N elements"
	}, {
		"tags": ["neo4j", "cypher"],
		"owner": {
			"reputation": 63,
			"user_id": 845435,
			"user_type": "registered",
			"accept_rate": 67,
			"profile_image": "https://www.gravatar.com/avatar/
610458a30958c9d336ee691fa1a87369?s=128&d=identicon&r=PG",
			"display_name": "user845435",
			"link": "https://stackoverflow.com/users/845435/user845435"
		},
		"comment_count": 0,
		"delete_vote_count": 0,
		"close_vote_count": 0,
		"is_answered": false,
		"view_count": 16,
		"favorite_count": 0,
		"down_vote_count": 0,
		"up_vote_count": 0,
		"answer_count": 0,
		"score": 0,
		"last_activity_date": 1404768987,
		"creation_date": 1404768987,
		"question_id": 24620297,
		"share_link": "https://stackoverflow.com/q/24620297",
		"body_markdown": 
"I&#39;m trying to implement a simple graph db for NYC subway................Thanks!\r\n",
		"link": "https://stackoverflow.com/questions/24620297/cypher-query-with-infinite-relationship-takes-forever",
		"title": "Cypher query with infinite relationship takes forever"
	}],
	"has_more": true,
	"quota_max": 300,
	"quota_remaining": 205
}