Want to Extend the Graph Capabilities of Spark?
Let’s get straight to it: The Apache Spark community is joining hands with the growing graph data movement, and we need your help to keep moving forward.
Databricks contributor Xiangrui Meng is sponsoring the idea of property graphs built on Spark DataFrames, with Cypher queries, along with graph algorithms from the GraphFrames project. This would bring Cypher queries into the core Spark project as part of Spark 3.0 (slated for release mid-year 2019).
In a recent post on the Spark users list, Xiangrui pointed out a joint Spark Project Improvement Proposal from Neo4j and Databricks technical staff.
If you approve of extending the graph capabilities of Spark, please express your support and describe how this would benefit you and the community. We especially need your feedback in advance of an upcoming vote in the Spark community.
Please leave a reply to Xiangrui’s thread with your thoughts and feedback. Even a simple +1 will go a long way in moving this forward.
A Bit of Background
We helped progress this collaboration – after an initial discussion back last September – because we felt that our work on the Cypher for Apache Spark project had proved that a DataFrame-based property graph data model, with built-in graph schema, is a great fit for Spark.
We could also see that graph algorithms would benefit from a strongly-typed property graph model and from a well-developed open source graph query language like Cypher.
We’d like to see Spark Cypher mirror Spark SQL: the right language for the right model, in one common analytics environment. We’re particularly excited by the way Spark shows how tabular and graph data (as well as queries) can be mixed together to give data engineers, analysts and data scientists the information they need, in the shape they need it.
Similarities to the Push for Graph Query Language (GQL)
Recent work by the machine-learning community on using graph networks, and the push this past year for industry unity around Graph Query Language (GQL), add to the sense that we’re at a turning point for graph data management. Spark is well positioned to take advantage of that development.
(GQL is a proposed ISO standard for property graph querying to stand alongside SQL for relational data. The Cypher language has been implemented by many products and projects, and its designers back the move towards GQL as the next step in property graph querying.)
It’s Time to Take Action
If you want to see this amazing vision for graph and Cypher capabilities in Apache Spark, it will only become a reality if you express your support in a comment or +1 vote to the Xiangrui’s thread. (If you aren’t already on the Apache Spark User List, a brief registration step is required.)
Show My Support
About the Author
Alastair Green & Martin Junghanns , Neo4j Cypher Team
Alastair Green leads Neo4j’s work on graph query language development and standards, and he is part of the team making the openCypher language available in Apache Spark. He has a background in enterprise data integration and transaction processing product design and deployment.
Martin Junghanns is part of the Cypher for Apache Spark Engineering team at Neo4j. He has a research background in distributed graph analytics. His main interests are query engines, graph algorithms and bringing graph querying into the world of Apache Spark. Martin holds a MSc Computer Science degree from the University of Leipzig.