As Neo4j Graph Data Science approaches year three, we’re excited to announce Graph Data Science 2.3, which includes new algorithms, a new graph embedding, and other performance and integration improvements that augment the ease and speed you conduct your analytics.
Our goal is to continue helping you improve the performance and accuracy of your models. We want you to harness graphs to make better predictions and provide more ROI than conventional data science methods offer – because important relationships and topology can provide an uplift in model accuracy.
I’m going to highlight the three areas that I’m super excited about with the latest release.
If you want to find out everything we’ve added, improved, or fixed, please visit our release changelog on GitHub for full details, or watch Zach’s highlight reel:
1. Make Better, Faster Decisions With Knowledge Graph Embedding – HashGNN
We’ve heard our community of data scientists ask how they can generate embeddings for Heterogeneous graphs, so we released a new algorithm to our portfolio of embeddings: The Knowledge Graph Embedding – HashGNN.
This knowledge graph embedding enables you to make better predictions using fast, scalable, and high-performing graph ML by efficiently generating embeddings on heterogeneous graphs. It takes architectural inspiration from a Graph Neural Network (GNN) but avoids the high computational cost and complexity of model training that often hinders GNNs.
Knowledge Graph Embeddings – HashGNN Resources:
2. Reduce Time to Insight with New Graph Algorithms, Configuration Parameters, and Performance Boosts
We are constantly raising the bar to enable data science teams to make better predictions using any data source so they can discover what’s important, what’s unusual, and what’s next faster. Highlights from this release include:
Minimum Directed Steiner Tree: a new algorithm that is great for supply chain use cases. It’s a directed spanning tree that minimizes the sum of paths that exist from multiple source nodes to a single target node. This can be super useful in finding the shortest or least expensive supply chain routes.
Algorithm performance improvements: In Graph Data Science 2.3, we’ve also managed to boost performance for some of our most popular algorithms.
Undirected relationship types: Graphs can either be directed or undirected. “Directed” graphs mean the relationships (edges) flow in a specific direction, while the relationships in “undirected” graphs don’t have direction. With Graph Data Science 2.3, we’ve enabled all our API surfaces to now support undirected relationship projections, whether you’re using the Python client’s construct method or our Apache Arrow integration.
We’ve also included a convenience procedure to convert existing directed graphs to an undirected projection.
You can learn more about these enhancements in our documentation.
3. Improved Machine Learning Predictions on Imbalanced or Skewed Datasets
Train models faster on imbalanced datasets. In link prediction, you can have two imbalanced sets of positive and negative relationships, known as a class imbalance. Previous link prediction models required data scientists to manually select negative examples (when a link does not exist between new nodes) as part of a random selection test. With Graph Data Science 2.3, data scientists can train the model using their own negative relationship examples that tune and improve accuracy when working on imperfect data.
Learn more about ML prediction in our documentation.
Bonus Content 👀
Common Datasets in the Graph Data Science Python Client
If you just want to play around to get a better understanding of algorithm capabilities, we’ve introduced some common data science community datasets: Cora, Karate Club, & IMDB.
Cora: A well-known citation network introduced by Automating the Construction of Internet Portals with Machine Learning and used in many node classification or link prediction publications.
Karate club: A well-known social network introduced by Zachary.
IMDB: A heterogeneous graph that is used to benchmark node classification or link prediction models such as Heterogeneous Graph Attention Network, MAGNN: Metapath Aggregated Graph Neural Network for Heterogeneous Graph Embedding, and Graph Transformer Networks.
Learn more about these common datasets in our documentation.
Write Labels Based on Properties
From community identifiers to known fraudsters, our customers use our algorithms and pipelines to predict new properties, so we added a new procedure that allows you to write a new node label to the Neo4j database with filtering based on node property values.
Learn more about these node operations in our documentation.
Cypher Aggregation Is the New Cypher Projection
Cypher Aggregation is targeted to become the primary surface for projecting graphs from the database using Cypher.
Cypher Aggregation is more intuitive and expressive than the current Cypher Projection API. Plus, it can directly interact with results from arbitrary Cypher queries. The best part about Cypher Aggregations is that they can be used in conjunction with Composite Databases architectures (previously known as Neo4j Fabric).
Learn more about Cypher aggregation in our documentation.
Shard Local Algorithm Execution With Composite Databases
The Neo4j database uses Fabric to shard your data across multiple machines. Composite Databases allow for Cypher queries to be coordinated and distributed across the different shards. Today, we support running algorithms locally on each of the shards.
It’s not currently possible to run algorithms across shards, but we’re hoping to remove this limitation in subsequent releases. If you’re really interested in this feature, please reach out to me!
Learn more about Fabric in our documentation.
2023 is set to be an exciting year. Keep an eye out for the general availability announcements of our fully managed graph data science as a service: AuraDS on Amazon Web Services (AWS) and Microsoft’s Azure Platform.