Goals This article demonstrates some techniques for loading data from JSON based REST APIs into Neo4j. Prerequisites Before importing data from a JSON based REST API you should have installed the APOC library. Intermediate Overview Importing JSON Data into Neo4j… Read more →
This article demonstrates some techniques for loading data from JSON based REST APIs into Neo4j.
Before importing data from a JSON based REST API you should have installed the APOC library.
Importing JSON Data into Neo4j
There are a plethora of JSON based Web APIs that we can import into Neo4j, and we can use one of the Load JSON procedures to retrieve data from these APIs and turn it into map values ready for Cypher to consume.
The APOC user guide provides a worked example showing how to import data from StackOverflow into Neo4j.
The Strava API
Strava is an application used by runners and cyclists to record their activities and share them with their friends. This data is available to users via a JSON based REST API.
Before we start calling the API we need to create an application. We will then be provided with an access token that we will need to use in all our requests to the API.
We can create a parameter in the Neo4j Browser or Cypher shell by executing the following command:
Working with a paginated endpoint
We are interested in importing the activities for the logged in athlete. That endpoint takes the following parameters:
We’re interested in
per_page, where we can defined the number of activities returned per call to the endpoint, and
after, where we can tell the API to only return results after a provided epoch timestamp.
Let’s imagine that we have more activities that we can return in one request to the API. We’ll need to paginate to retrieve all our activities and import them into Neo4j.
Before we paginate the API let’s first learn how to import one page worth of activities into Neo4j. The following query will return activities starting from the earliest timestamp:
We create a node with the label
Run for each activity and set a few properties as well.
The most interesting one for this example is
startDate which we’ll pass to the
after parameter later on.
This query will load the first 30 activities, but what if we want to get the next 30?
We can change the first line of the query to find the most recent timestamp of any of our
Run nodes and then pass that to the API.
If there aren’t any
Run nodes then we can use a value of 0:
We could continue to run this query manually, but it’s about time that we automated it.
Automated API pagination
One way to do this is by using a scripting language and creating a loop inside which we make calls to that endpoint until we run out of activities to retrieve.
If we’re a bit creative we can achieve the same outcome with the
From the APOC documentation, this is the description of the periodic iterate procedure:
It is useful to run a query repeatedly in separate transactions until it doesn’t process and generates any results anymore. So you can iterate in batches over elements that don’t fulfil a condition and update them so that they do afterwards.
In our case the exit condition will be when we receive less than 30 activities from the API.
Let’s first update our query to return a value of
0 if less than 30 activities are returned and the actual count if it’s 30:
All that’s left to do now is wrap the whole thing in periodic commit:
This query will now send multiple commits to the API until we have loaded all our activities.