4.2.11. Data center disaster recovery

This section describes how to recover your Neo4j Causal Cluster following a data center failure. Specifically it covers safely turning a small number of surviving instances from a read-only state back into a fully operational cluster of read/write instances. Data center loss scenario

This section describes how to recover a multi-data center deployment which owing to external circumstances has reduced the cluster below half of its members. It is most easily typified by a 2x2 deployment with 2 data centers each containing two instances. This deployment topology can either arise because of other data center failures, or be a deliberate choice to ensure the geographic survival of data for catastrophe planning.

Under normal operation this provides a stable majority quorum where the fastest three out of four machines will execute users' transactions, as we see highlighted in Figure 4.19, “Two Data Center Deployment with Four Core Instances”.

Figure 4.19. Two Data Center Deployment with Four Core Instances
dc recovery 1

However if an entire data center becomes offline because of some disaster, then a majority quorum cannot be formed in this case.

Neo4j Core clusters are based on the Raft consensus protocol for processing transactions. The Raft protocol requires a majority of cluster members to agree in order to ensure the safety of the cluster and data. As such, the loss of a majority quorum results in a read-only situation for the remaining cluster members.

When data center is lost abruptly in a disaster rather than having the instances cleanly shut down, the surviving members still believe that they are part of a larger cluster. This is different from even the case of rapid failures of individual instances in a live data center which can often be detected by the underlying cluster middleware, allowing the cluster to automatically reconfigure.

Conversely if we lose a data center, there is no opportunity for the cluster to automatically reconfigure. The loss appears instantaneous to other cluster members. However, because each remaining machine has only a partial view of the state of the cluster (its own), it is not safe to allow any individual machine to make an arbitrary decision to reform the cluster.

In this case we are left with two surviving machines which cannot form a quorum and thus make progress.

Figure 4.20. Data Center Loss Requires Guided Recovery
dc recovery 2

But, from a birds’s eye view, it’s clear we have surviving machines which are sufficient to allow a non-fault tolerant cluster to form under operator supervision.

Groups of individual cluster members (e.g. those in a single data center) may become isolated from the cluster during network partition for example. If they arbitrarily reformed a new, smaller cluster there is a risk of split-brain. That is from the clients' point of view there may be two or more smaller clusters that are available for reads and writes depending on the nature of the partition. Such situations lead to divergence that is tricky and laborious to reconcile and so best avoided.

To be safe, an operator or other out-of-band agent (e.g. scripts triggered by well-understood, trustworthy alerts) that has a trusted view on the whole of the system estate must make that decision. In the surviving data center, the cluster can be rebooted into a smaller configuration whilst retaining all data committed to that point. While end users may experience unavailability during the switch over, no committed data will be lost. Procedure for recovering from data center loss

The following procedure for performing recover of a data center should not be done lightly. It assumes that we are completely confident that a disaster has occurred and our previously data center-spanning cluster has been reduced to a read-only cluster in a single data center. Further it assumes that the remaining cluster members are fit to provide a seed from which a new cluster can be created from a data quality point of view.

Having acknowledged the above, the procedure for returning the cluster to full availability following catastrophic loss of all but one data centers is as follows:

  1. Ensure that a catastrophe has occurred and that we have access to the surviving members of the cluster in the surviving data center. Then for each instance:
  2. Stop the instance with bin/neo4j stop or shut down the service.
  3. Change the configuration in neo4j.conf such that the causal_clustering.initial_discovery_members property contains the DNS names or IP addresses of the other surviving instances.
  4. Ensure that the setting causal_clustering.expected_core_cluster_size=2 (assuming 2 surviving instances) appears in the neo4j.conf settings file.
  5. Start the instance with bin/neo4j start or start the neo4j service.

Once this procedure is completed for each instance, they will form a cluster that is available for reads and writes. It recommended at this point that other cluster members are incorporated into the cluster to improve its load handling and fault tolerance. See Section 4.2.3, “Create a new Causal Cluster” for details of how to configure instances to join the cluster from scratch.