Graph Data Science

Can’t Stop, Won’t Stop: Graph Data Science 2.1 Is Better Than Ever

Senior Director of Product Management, Graph Data Science

June 23, 2022

4 min read

We don’t take breaks at Neo4j – we’re following up GraphConnect with yet another awesome release of the Graph Data Science (GDS) library. Our engineers are constantly raising the bar, and some of the highlights in this release:

🆕 New algorithms: K-means clustering and Leiden for community detection

👌 Source and Target node filtering for KNN and Node Similarity

🔮 Node regression pipelines: now you can predict numerical properties

🪄 Autotuning for ML pipelines: we figure out all the right parameters to generate the best possible model

🔛 Arrow Support for fast graph projection, database creation, and graph export, enabling you to move up to 30M objects/second of data!

🏃🏿 ML Performance improvements: Train models up to 10 times faster than before!

With this release, not only do you get more algorithms than ever before, but you also get access to the easiest to use and most scalable framework available.

Want to learn more? Keep reading 😉

Making Graph Data Science Simple

With every release, we deliver features to make graph data science easier to use. This 2.1 update delivers on that promise with autotuning for machine learning, visual progress logging in the Python client, and filtering for similarity.

Autotuning: ML pipelines (nodeClassification, nodeRegression, linkPrediction) now support automated tuning for hyperparameters. Users configure the system, Neo4j Graph Data Science finds the best parameter combinations to provide the best performing models possible.
Source and Target filtering for KNN and Node Similarity: Similarity algorithms are some of the most popular, but often users do not need to compare every possible pair of nodes in their graph. Source and target filtering lets users limit the scope of similarity calculations to just the relevant nodes for each use case.

Visual Progress Logging in the Graph Data Science Python Client: Now, when users run algorithms or project graphs, a progress bar is displayed that shows the status of tasks.

Essential Data Science Capabilities

If we’re the Graph Data Science library, we have to empower data scientists to make sense of connected data. In this release, we’ve added new community detection algorithms and regression pipelines.

New alpha tier algorithm – Leiden: new community detection algorithm, a hierarchical clustering algorithm that guarantees well-connected communities. Similar to Louvain, users have requested this methodology to create more cohesive communities.

New alpha tier algorithm – K-means clustering: community detection algorithm intended to cluster nodes based on properties (like embeddings). Users can specify the numbers of clusters desired and Graph Data Science finds the optimal groupings.
New alpha tier ML pipeline – Node Regression: users can predict numerical property values for nodes using node regression pipelines. Node regression lets users fill in missing property values based on other node properties and graph topology.

Enterprise Scalability

It doesn’t matter how cool our codebase is if we can’t handle your enterprise data. One of the biggest improvements to GDS 2.1 is the integration of Apache Arrow, so you can directly import and export data from your graph projections – at crazy fast speeds.

Apache Arrow Integration for Graph Projections: import and export massive graphs directly into Graph Data Science at speeds up to 30 million objects/second.

Leveraging Arrow to directly build graph connections makes it simple to Insert Graph Data Science seamlessly into your existing ML pipelines and run analytics that need to be exported to a downstream system.

The Neo4j Graph Data Science Arrow integration provides: a built-in Arrow flight server, bundled with Graph Data Science, Arrow convenience functions in the Graph Data Science Python Client to load from and export to data frames, and access to a low level Arrow API to integrate with any Apache Arrow supported product like Google BigQuery, Beam, Parquet files, etc.

Note: Apache Arrow integration for graph projections is available to Graph Data Science Enterprise Edition customers only.

Performance Improvements for Machine Learning: Through optimization of internal machine learning code, the training time for GraphSAGE embeddings is up to 90 percent faster, Random Forest model training is up to 80 percent faster, and Logistic Regression is up to 40 percent faster.

But Wait, There’s More!

We know graph data science doesn’t exist in a vacuum, and it’s critical that we provide the tooling to integrate with the rest of your tech stack. We’ve introduced the data warehouse connector – to integrate with all your data sources – as well as continued to iterate on our native Python client.

Graph Data Science Python Client Improvements: The Graph Data Science Python Client can automatically use Apache Arrow for data movement on Enterprise licensed instances. Users can now specify the return format of data frames when streaming node properties or relationship results (pivoting rows and columns). The Graph Data Science Python Client supports all Graph Data Science 2.1 features.
Neo4j Data Warehouse Connector offers a simple way to move data between the Neo4j database and data warehouses like Snowflake, Google BigQuery, Amazon Redshift, or Microsoft Azure Synapse Analytics. It can be used as a Spark Submit Job (by providing a JSON configuration), or with a Scala/Python API that simplifies writing the Spark job to move data between the Neo4j database and the data warehouse.

Ready to Get Started?

Try out graph data science with our free Sandbox, pre-populated with data sets to teach you the fundamentals. Or you can build your own project on AuraDS, our data science as a service offering, including Enterprise Edition GDS and Bloom.

Start Your Sandbox

Get AuraDS Now