Chapter 4. Clustering

This chapter describes the Neo4j clustering solutions Causal Clustering and Highly Available Cluster.

The clustering features are available in Neo4j Enterprise Edition.

This chapter describes Neo4j’s architecture with regards to the clustering features.

Neo4j offers two separate solutions for ensuring redundancy and performance in a high-demand production environment:

Clustering for the enterprise

Enterprise IT requirements are demanding. Our solutions are expected to provide high throughput, continuous availability, and reliability. Further, in most IT ecosystems we often want to run long-lived queries on operational data for analytics and reporting purposes. When designing our solutions, we must ensure that any technology choices we make can underpin those critical enterprise requirements.

High throughput

To meet demanding graph workloads, Neo4j clusters allow work to be federated across a number of cooperating machines.

Figure 4.1. Throughput
througput

In a clustered environment, throughput goals (graph queries) can be met by allowing each machine to process a subset of the overall queries. This scheme also reduces latency as we can see by the (logical) queue length in the diagram above.

Continuous availability

A fundamental requirement for most enterprise-grade systems is high availability. That is, even in the presence of failures, the system continues to deliver its functionality to end users (humans or other computer systems).

Figure 4.2. Availability
availability

Neo4j’s clustering architecture is an automated solution for ensuring that Neo4j is continuously available. The premise is that we deploy redundancy into the cluster such that if failures occur they can be masked by the remaining live instances. In the case above a single failed instance does not cause the cluster to stop (though the throughput of the cluster may be lower).

Disaster recovery

Disaster recovery is the ability to recover from major service outages, greater than can be accommodated by the redundant capacity in a continuously available cluster. Typically these are manifested as data center outages, physical network severance, or even denial of service attacks that render large amounts of infrastructure inoperable.

Figure 4.3. Safety
safety

In these cases a disaster recovery strategy can define a failover datacenter along with a strategy for bringing services back online. Neo4j clustering can accommodate disaster recovery strategies that require very short-windows of downtime or low tolerances for data loss in disaster scenarios. By deploying a cluster instance to an alternate location, you have an active copy of your database up and available in your designated disaster recovery location that is up to date with the transactions executed against your operational database cluster.

In anticipating a disaster recovery instance or instances we are helping to minimize downtime and ensure the safety of data. Given disaster recovery happens by definition at stressful and inconvenient times, having a well designed recovery scenario as part of the database cluster is a sensible plan, albeit one we hope to never action.

Analytics and reporting

Operational data is the lifeblood of our online processing systems. However other stakeholders in the enterprise require access to that data for their own business intelligence purposes. Analytics and reporting queries are often ad-hoc from the database’s perspective. Queries may be speculative or wide-ranging as new analyses are performed. This means the workload can be unpredictable and onerous. Such workloads risk upsetting the balance of work in the system leaving fewer resources available for the online workloads (e.g. customers). We must be amenable to servicing the needs of the analytics requests too. Fortunately Neo4j clustering can be used to provide separate instances entirely in support of query analytics, either from end users or from BI tools. As a consequence of being part of the cluster, the analytics instances are up to date and do not require any external ETL jobs or other complexity.