Using (Spring Data) Neo4j for the Hubway Data Challenge
Using Spring Data Neo4j it was incredibly easy to model and import the Hubway Challenge dataset into a Neo4j graph database, to make it available for advanced querying and visualization.
The Challenge and Data
Tonight @graphmaven pointed me to the boston.com article about the Hubway Data Challenge.
Getting Started
As midnight had just passed and the Spring Data Neo4j 2.1.0.RELEASE was built inofficially during the day I thought it would be a good exercise to model the data using entities and importing it into Neo4j. So the first step was the domain model, which is pretty straightforward:
Based on the Spring Data book example project, I created the pom.xml
with the dependencies (org.springframework.data:spring-data-neo4j:2.1.0.RELEASE
) and the Spring application context files.
Import Stations
Starting with the Station in modelling and importing was the easiest. In the entity we have several names, one of which is the unique identifier (terminalName), the station name itself can be searched with a fulltext-index. As hubway also provides geo-information for the stations we use the Neo4j-Spatial index provider to later integrate with spatial searches (near, bounding box etc.)
@NodeEntity
@TypeAlias("Station")
public class Station {
@GraphId Long id;
@Indexed(numeric = false)
private Short stationId;
@Indexed(unique=true)
private String terminalName;
@Indexed(indexType = IndexType.FULLTEXT, indexName = "stations")
private String name;
boolean installed, locked, temporary;
double lat, lon;
@Indexed(indexType = IndexType.POINT, indexName = "locations")
String wkt;
protected Station() {
}
public Station(Short stationId, String terminalName, String name,
double lat, double lon) {
this.stationId = stationId;
this.name = name;
this.terminalName = terminalName;
this.lon = lon;
this.lat = lat;
this.wkt = String.format("POINT(%f %f)",lon,lat).replace(",",".");
}
}
I used the JavaCSV
library for reading the data files. The importer just creates a Spring contexts and retrieves the service with injected dependencies and declarative transaction management. Then the actual import is as simple as creating entity instances and passing them to the Neo4jTemplate
for saving.
ClassPathXmlApplicationContext ctx = new ClassPathXmlApplicationContext("classpath:META-INF/spring/application-context.xml");
ImportService importer = ctx.getBean(ImportService.class);
CsvReader stationsFile = new CsvReader(stationsCsv);
stationsFile.readHeaders();
importer.importStations(stationsFile);
stationsFile.close();
public class ImportService {
@Autowired private Neo4jTemplate template;
private final Mapstations = new HashMap ();
@Transactional
public void importStations(CsvReader stationsFile) throws IOException {
// id,terminalName,name,installed,locked,temporary,lat,lng
while (stationsFile.readRecord()) {
Station station = new Station(asShort(stationsFile,"id"),
stationsFile.get("terminalName"),
stationsFile.get("name"),
asDouble(stationsFile, "lat"),
asDouble(stationsFile, "lng"));
template.save(station);
stations.put(station.getStationId(), station);
}
}
}
Import trips
Importing the trips themselves is only a little more involved. In the modeling of the trip I choose to create a RelationshipEntity
called Action
to represent the start or end of a trip. That entity connects the trip to a station and holds the date at which it happend. During the import I found a number of data rows to be inconsistent (missing stations), so those were skipped. As half a million entries are a bit too much for a single transaction I split the import up into batches of 5k trips each.
@Transactional
public boolean importTrips(CsvReader trips, int count) throws IOException {
//"id","status","duration","start_date","start_station_id",
// "end_date","end_station_id","bike_nr","subscription_type",
// "zip_code","birth_date","gender"
while (trips.readRecord()) {
Station start = findStation(trips, "start_station_id");
Station end = findStation(trips, "end_station_id");
if (start==null || end==null) continue;
Member member = obtainMember(trips);
Bike bike = obtainBike(trips);
Trip trip = new Trip(member, bike)
.from(start, date(trips.get("start_date")))
.to(end, date(trips.get("end_date")));
template.save(trip);
count--;
if (count==0) return true;
}
return false;
}
First look at the data
After running the import, after two minutes we have a Neo4j database (227MB) that contains all those connections. I uploaded it to our sample dataset site. Please get a Neo4j server and put the content of the zip-file into data/graph.db
then it is easy to visualize the graph and run some interesting queries. I list a few but those should only be seen as a starting point, feel free to explore and find new and interesting insights.
Stations most often used by a user
START n=node(205)
MATCH n-[:TRIP]->(t)-[:`START`|END]->stat
RETURN stat.name,count(*)
ORDER BY count(*) desc LIMIT 5;
+------------------------------------------------+
| stat.name | count(*) |
+------------------------------------------------+
| "South Station - 700 Atlantic Ave." | 22 |
| "Post Office Square" | 21 |
| "TD Garden - Legends Way" | 10 |
| "Boylston St. at Arlington St." | 5 |
| "Rowes Wharf - Atlantic Ave" | 5 |
+------------------------------------------------+
5 rows
31 ms
Most beloved bikes
START bike=node:Bike("bikeId:*")
MATCH bike<-[:BIKE]->trip
RETURN bike.bikeId,count(*)
ORDER BY count(*) DESC LIMIT 5;
+------------------------+
| bike.bikeId | count(*) |
+------------------------+
| "B00145" | 1074 |
| "B00114" | 1065 |
| "B00538" | 1061 |
| "B00490" | 1059 |
| "B00401" | 1057 |
+------------------------+
5 rows
2906 ms
Heroku
The data can also be easily added to a Heroku Neo4j Add-On and from there you can use any programming language and rendering framework (d3, jsplumb, raphael, processing) to visualize the dataset.
What’s next
Next steps for us are to import the supplied shapefile for Boston and the stations as well into the Neo4j database and connect them with the data and create a cool visualization. I rely on @maxdemarzi for it to be awesome. Another path to follow is to craft more advanced cypher queries for exploring the dataset and making them and their results available.
Boston Hubway Data-Challenge Hackaton
Hubway will host a Hack Day at The Bocoup Loft in Downtown Boston on Saturday, October 27, 2012. Register here and spread some graph love.
The Source-Code is available here on GitHub and Max de Marzi wrote a great follow-up post visualizing the results.