4.2.5. Backup planning for a Causal Cluster

This section discusses considerations when designing the backup strategy for a Neo4j Causal Cluster.

In a Neo4j Causal Clustering cluster, both Core Servers and Read Replicas support the backup protocol. Servers of either role can be used for cluster backups. This section discusses some considerations that you should regard before determining which backup strategy to use. Detailed instructions on the backup and recovery commands are described in the backup chapter. Read replica backups

Generally we prefer to select Read Replicas to act as our backup providers since they are far more numerous than Core Servers in typical cluster deployments.

However since Read Replicas are asynchronously replicated from Core Servers, it is possible for them to be some way behind in applying transactions with respect to the Core cluster. It may even be possible for a Read replica to become orphaned from a Core Server such that its contents are quite stale. The pathologically bad case here is that we take a backup right now whose contents end up being less up to date than a previous backup.

Fortunately we can check the last transaction ID processed on any server and in doing so we can verify that it is sufficiently close to the latest transaction ID processed by the Core Server. If it is in the right ball-park, then we can safely proceed to backup from our Read replica in confidence that it is quite up to date with respect to the Core Servers.

Transaction IDs in Neo4j are strictly increasing integer values. A higher transaction ID is therefore more recent than a lower one.

Neo4j servers expose the last transaction ID processed through JMX and via the Neo4j browser. The latest transaction ID can be found by exposing Neo4j metrics or via the Neo4j Browser. To view the latest processed transaction ID (and other metrics) in the Neo4j Browser, type :sysinfo at the prompt. Core Server backups

In a Core-only cluster, we don’t have the luxury of numerous Read Replicas to scale out workload. As such we pick a server based on factors like its physical proximity, bandwidth, performance, liveness and so forth.

Generally speaking, the cluster will function as normal even while large backups are taking place. However, backing up will place additional IO burdens on the backup server which may impact its performance.

A very conservative view would be to treat the backup server as an unavailable instance, assuming its performance will be lower than the other instances in the cluster. In such cases, it is recommended that there is sufficient redundancy in the cluster such that one slower server does not reduce the capacity to mask faults.

We can factor this conservative strategy into our cluster planning. The equation M = 2F + 1 demonstrates the relationship between M being the number of members in the cluster required to tolerate F faults. To tolerate the possibility of one slower machine in the cluster during backup we increase F. Thus if we originally envisaged a cluster of three Core Servers to tolerate one fault, we could increase that to five to maintain a plainly safe level of redundancy.