A large bank in the European Union faced exactly this issue. With a growing database, batch loading of incoming transactions became increasingly time consuming, up to a point when the decision was made for the single Neo4j cluster to be scaled out across different clusters. This is where Neo4j Fabric comes in.
Neo4j allows a very large graph database to be divided into a set of smaller databases, called shards. Each shard is in a separate database that can reside on any server in the cluster. Conveniently, you can create a Neo4j Fabric composite database to facilitate federated queries across the shards, as if they were still a single database. Time-based data structures like financial and accounting transactions lend themselves extremely well to sharding. Partitioning by year or by financial quarter matches the bank’s archiving and retention policies for each database, and is a natural way to split a graph database of transactions.
- Workload distribution. For large analytical workloads, bundling the power of five machines instead of one will give a significant performance boost, since queries are performed in parallel across all shards.
- Easier batch operations. Common database techniques such as daily differential loads become easier and faster to execute because the daily transaction load can occur in its own shard.
- Easier maintenance. It is a lot easier to perform operational tasks like backup and restore in parallel across smaller size shards versus performing them against a multi-terabyte database.
- Less infrastructure cost. Having shards in place allows you to put older and infrequently accessed data on cheaper hardware while you can keep your most recent data on the best hardware available.
While sharding a very large database provides compelling benefits, for a graph database like Neo4j it is important to think about your optimal sharding strategies up front. Different situations require different sharding strategies, since you do not want your graph data to be randomly distributed all over the shards.
While in this use case the bank ended up choosing a sharding strategy based on time window (good for analytics and event-based systems like transactions), other choices include sharding by logical domain entity such as Customer ID (good for SaaS and multi-tenant applications), or by geographical location (good for social networks, sensor networks, and more). However, once that important decision is made, you’ll be pleasantly surprised about the ease of distributing a graph database across a scaled out infrastructure.