Streaming REST API – Interview with Michael Hunger

Streaming REST API – Interview with Michael Hunger

Recently, Michael Hunger blogged about his lab work to use streaming in Neo4j’s REST interface. On lab days, everyone on the Neo4j team gets to bump the priority of  any engineering work that had been lingering in a background thread. I chatted with Michael about his work with streaming.

ABK:  What inspired you to focus on streaming for Neo4j?
MH:  Because it is a major aspect for Neo4j to behave as performant as possible, especially with so many languages / stacks connecting via the REST API. The existing approach is several orders of magnitude slower than embedded [note: Neo4j is embeddable on the JVM] and not just one as was originally envisioned.

ABK:  What do you mean by “streaming” in this context, is this http streaming?
MH:  Yes, it is http streaming combined with json streaming and having the internal calls to Neo4j generate lazy results (Iterables) instead of pulling all results from the db in one go. So writing to the stream will advance the database operations (or their “cursors”). This applies to: indexing, cypher,  and traversals.

ABK:  Ah, so this isn’t for streaming binary data, like video or something, right?
MH:  Yes. The “binary” data is actually json-results from the Neo4j REST API .

ABK:  Does it require any changes to the existing clients?
MH:  The only change that is required is to signal to the server to return the data in a streaming manner. Right now that is through an extended accept header (application/json;stream=true) but that will probably change to a more standards compliant transport encoding header.

ABK:  Could a client use this streaming to “page” results?
MH:  Good question. In theory yes, but then it would have to keep open the connection and stop receiving until the next page is requested which would probably result in connection hogs and a timeout. But it can be used to just retrieve as much as is needed and then close the connection.

ABK:  Is multipart/mixed used to indicate chunks, or is the json stream left in-tact?
MH:  Right now the json-stream is left intact. The chunking will be part of the transport encoding negotiation that will be added later. We put that in now (without requiring additional changes on the client) so that we can gather feedback from driver authors and users.

ABK:  When will this be available in Neo4j for public review?
MH:  it is already available in the SNAPSHOT version of Neo4j and part of the first 1.8 milestone release which is due this week.

ABK:  Thanks, Michael. I’m looking forward to trying it out.
MH:  And I look forward to your feedback. Thanks, Andreas.

Performance Results

With a freshly installed Neo4j 1.8-SNAPSHOT server started, I tried out the streaming as Michael recommended. First, I created a sample data set of 50,000 nodes with a Gremlin one-liner:
(0..50000).inject(0) { count,idx -> v2=g.addVertex(); g.addEdge(g.v(idx),v2,"TYPE"); count+1;}
Then from bash I ran curl to compare retrieving query results with and without streaming, the difference just being the “;stream=true” in the accept header.
curl -i -o streamed.txt -XPOST -d'{ "query" : "start n=node(*) match p=n-[r:TYPE]->m return n,r,m,p" }' -H "accept:application/json;stream=true" -H content-type:application/json https://localhost:7474/db/data/cypher
curl -i -o nonstreamed.txt -XPOST -d'{ "query" : "start n=node(*) match p=n-[r:TYPE]->m return n,r,m,p" }' -H "accept:application/json" -H content-type:application/json https://localhost:7474/db/data/cypher
Running on my humble Mac laptop, the streaming took 10 seconds to return a complete result transferring between 8 to 15 MB/s for 130MB of data. The normal non-streaming result took 1 minute, 8 seconds to provide the same result and a Heap of 2GB.  Pretty impressive. This is something to look forward to in the upcoming milestone release. Check out Michael’s blog for even more detail.