The Spark of Neo4j

Data Engineer, LARUS Business Automation
5 min read


TL;DR If you just want to see how to use the connector API, jump down to the “Final Result” section, or check out our examples in the docs.We wanted to give users an official library to avoid them to use custom “hacky” solutions, and to offer a continuous service, both from development and support points of view. We decided to deprecate the old connector, in favour of a complete rewriting of the code using the new DataSource API V2 that enable us to leverage the multi-language support of Apache Spark. We also focused a lot on the documentation website, keeping the docs up to date is actually part of our backlog. I also presented the story at the GraphRM meetup in Rome, if you rather want to watch the video:
Challenges
Getting to where we are now wasn’t really straightforward. We faced some problems and here’s a quick overview on how we dealt with them.Lack of Documentation
Being DataSource API V2 relatively new, we couldn’t find so much official Databricks documentation (especially for Spark 3). Examples, videos, and tutorials were not going enough in-depth for what we have to do, so we had to find another way build things; the good old “look at the source code” mantra came handy in this case and we were able to figure out what we were doing wrong when the documentation was not enough.Breaking Changes
We found breaking changes even between minor versions, and of course also between Spark 2.4 and Spark 3.0.
Deal with Versions
Dealing with these breaking changes among the versions wasn’t easy. We couldn’t address all the Spark versions on the first release so we had to pick one. Spark 2.3 was relatively old and would have been unsupported in the future. On the other hand Spark 3.0 was new, and not widely adopted; so we decided to start with Spark 2.4. Finally after some month of coding we did out first pre-release on September 30th! 🎉

- Spark 2.4 supports Scala 2.11 and Scala 2.12.
- Spark 3.0 supports Scala 2.12 and Scala 2.13, but dropped the support for Scala 2.11.
Neo4j vs Tables
The pain of integrating Spark and Neo4j is that Spark works with tables, and Neo4j doesn’t have anything similar to that, but instead nodes and relationships. We had to find solutions to a couple of problems here as well.Tables or Labels?
To map a graph made of nodes and relationships into a table we created a column for each property of the nodes, and two columns for the relationship, that contain the source node ID and the target node ID. In this way we are able to represent nodes and relationship in a tabular way to work with the DataSource APIs.
Schema or Not Schema?
Tables have schema, Neo4j doesn’t. We had to find a way to extract a schema from a schema-less graph. For doing this, once we got the result from Neo4j, we flatten the results, and go through each property, extracting the type and eventually creating the schema of the result set.
Final Result — And How To Use It
Let’s quickly look at the API of the Neo4j Connector for Apache Spark Here you can see how to read all the nodes that have the labels :Person:Admin using Scala, Python and R.



The Spark of Neo4j was originally published in Neo4j Developer Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.