4.2.2. Causal Cluster lifecycle

This section describes the lifecycle of a Neo4j Causal Cluster.

Section 4.2.1, “Introduction” provided an overview of a Causal Cluster. In this section we will develop some deeper knowledge of how the cluster operates. By developing our understanding of how the cluster works we will be better equipped to design, deploy, and troubleshoot our production systems.

Our in-depth tour will follow the lifecycle of a cluster. We will boot a Core cluster and pick up key architectural foundations as the cluster forms and transacts. We will then add in Read Replicas and show how they bootstrap join the cluster and then catchup and remain caught up with the Core Servers. We will then see how backup is used in live cluster environments before shutting down Read Replicas and Core Servers. Discovery protocol

The discovery protocol is the first step in forming a Causal Cluster. It takes in some hints about existing Core cluster servers and using these hints to initiate a network join protocol.

Figure 4.6. Causal Cluster discovery protocol: Core-to-Core or Read replica-to-Core only.
causal clustering discovery

From these hints the server will either join an existing cluster or form one of its own (don’t worry about forming split brained clusters, Core cluster formation is safe since it is underpinned by the Raft protocol).

The discovery protocol targets Core Servers only regardless of whether it is a Core Server or Read replica performing discovery. It is because we expect Read Replicas to be both numerous and, relatively speaking, transient whereas Core Servers will likely be fewer in number and relatively stable over time.

The hints are delivered as causal_clustering.initial_discovery_members in neo4j.conf, typically as dotted-decimal IP addresses and advertised ports. On consuming the hints the server will try to handshake with the other listed servers. On successful handshake with another server or servers the current server will discover the whole current topology.

The discovery service continues to run throughout the lifetime of the Causal Cluster and is used to maintain the current state of available servers and to help clients route queries to an appropriate server via the client-side drivers. Core membership

If it is a Core Server that is performing discovery, once it has made a connection to the one of the existing Core Servers it then joins the Raft protocol.

Raft is a distributed algorithm for maintaining a consistent log across multiple shared-nothing servers designed by Diego Ongaro for his 2014 Ph.D. thesis. See the Raft thesis for details.

Raft handles cluster membership by making it a normal part of keeping a distributed log in sync. Joining a cluster involves the insertion of a cluster membership entry into the Raft log which is then reliably replicated around the existing cluster. Once that entry is applied to enough members of the Raft consensus group (those machines running the specific instance of the algorithm), they update their view of the cluster to include the new server. Thus membership changes benefit from the same safety properties as other data transacted via Raft (see Section, “Transacting via the Raft protocol” for more information).

The new Core Server must also catch up its own Raft log with respect to the other Core Servers as it initializes its internal Raft instance. This is the normal case when a cluster is first booted and has performed few operations. There will be a delay before the new Core Server becomes available if it also needs to catch up (as per Section, “Catchup protocol”) graph data from other servers. This is the normal case for a long lived cluster where the servers holds a great deal of graph data.

When an instance establishes a connection to any other instance, it determines the current state of the cluster and ensures that it is eligible to join. To be eligible the Neo4j instance must host the same database store as other members of the cluster (although it is allowed to be in an older, outdated, state), or be a new deployment without a database store. Read replica membership

When a Read replica performs discovery, once it has made a connection to any of the available Core clusters it proceeds to add itself into a shared whiteboard.

Figure 4.7. All Read replicas registered with shared whiteboard.
read replica discovery

This whiteboard provides a view of all live Read Replicas and is used both for routing requests from database drivers that support end-user applications and for monitoring the state of the cluster.

The Read Replicas are not involved in the Raft protocol, nor are they able to influence cluster topology. Hence a shared whiteboard outside of Raft comfortably scales to very large numbers of Read Replicas.

The whiteboard is kept up to date as Read Replicas join and leave the cluster, even if they fail abruptly rather than leaving gracefully. Transacting via the Raft protocol

Once bootstrapped, each Core Server spends its time processing database transactions. Updates are reliably replicated around Core Servers via the Raft protocol. Updates appear in the form of a (committed) Raft log entry containing transaction commands which is subsequently applied to the graph model.

One of Raft’s primary design goals is to be easily understandable so that there are fewer places for tricky bugs to hide in implementations. As a side-effect, it is also easy for database operators to reason about their Core Servers in their Causal Clusters.

The Raft Leader for the current term (a logical clock) appends the transaction (an 'entry' in Raft terminology) to the head of its local log and asks the other instances to do the same. When the Leader can see that a majority instances have appended the entry, it can be considered committed into the Raft log. The client application can now be informed that the transaction has safely committed since there is sufficient redundancy in the system to tolerate any (non-pathological) faults.

The Raft protocol describes three roles that an instance can be playing: Leader, Follower, and Candidate. These are transient roles and any Core Server can expect to play them throughout the lifetime of a cluster. While it is interesting from a computing science point of view to understand those states, operators should not be overly concerned: they are an implementation detail.

For safety, within any Raft protocol instance there is only one Leader able to make forward progress in any given term. The Leader bears the responsibility for imposing order on Raft log entries and driving the log forward with respect to the Followers.

Followers maintain their logs with respect to the current Leader’s log. Should any participant in the cluster suspect that the Leader has failed, then they can instigate a leadership election by entering the Candidate state. In Neo4j Core Servers this is happens at ms timescale, around 500ms by default.

Whichever instance is in the best state (including the existing Leader, if it remains available) can emerge from the election as Leader. The "best state" for a Leader is decided by highest term, then by longest log, then by highest committed entry.

The ability to fail over roles without losing data allows forward progress even in the event of faults. Even where Raft instances fail, the protocol can rapidly piece together which of the remaining instances is best placed to take over from the failed instance (or instances) without data loss. This is the essence of a non-blocking consensus protocol which allows Neo4j Causal Clustering to provide continuous availability to applications. Catchup protocol

Read Replicas spend their time concurrently processing graph queries and applying a stream of transactions from the Core Servers to update their local graph store.

Figure 4.8. Transactions shipped from Core to Read replica.
read replica tx polling

Updates from Core Servers to Read Replicas are propagated by transaction shipping. Transaction shipping is instigated by Read Replicas frequently polling any of the Core Servers specifying the ID of the last transaction they received and processed. The frequency of polling is an operational choice.

Neo4j transaction IDs are strictly monotonic integer values (they always increase). This makes it simple to determine whether or not a transaction has been applied to a Read Replica by comparing its last processed transaction ID with that of a Core Server.

If there is a large difference between an Read replica’s transaction history and that of a Core Server, polling may not result in any transactions being shipped. This is quite expected, for example when a new Read replica is introduced to a long-running cluster or where a Read replica has been down for some significant period of time. In such cases the catchup protocol will realize the gap between the Core Servers and Read replica is too large to fill via transaction shipping and will fall back to copying the database store directly from Core Server to Read replica. Since we are working with a live system, at the end of the database store copy the Core Server’s database is likely to have changed. The Read replica completes the catchup by asking for any transactions missed during the copy operation before becoming available.

A very slow database store copy could conceivably leave the Read replica too far behind to catch up via transaction log shipping as the Core Server has substantially moved on. In such cases the Read replica server repeats the catchup protocol. In pathological cases the operator can intervene to snapshot, restore, or file copy recent store files from a fast backup. Backup protocol

During the lifetime of the Causal Cluster, operators will want to back up the cluster state for disaster recovery purposes. Backup is a strategy that places a deliberate gap between the online system and its recent state such that the two do not share common failure points (such as the same cloud storage). Backup is in addition to and orthogonal to any strategies for spreading Core Servers and Read Replicas across data centers.

For operational details on how to backup a Neo4j cluster, see Section 4.2.5, “Backup planning for a Causal Cluster”.

The Backup protocol is actually implemented as an instance of the Catchup protocol. Instead of the client being a Read replica, it is in fact the neo4j-admin backup tool that spools the data out to disk rather than to a live database.

Both full and incremental backups can be taken via neo4j-admin backup and both Core Servers and Read Replicas can fulfil support backups. However, given the relative abundance of Read Replicas it is typical for backups to target one of them rather than the less plentiful Core Servers (see Section 4.2.5, “Backup planning for a Causal Cluster” for more on Core versus Read replica backups). Read replica shutdown

On clean shutdown, a Read replica will invoke the discovery protocol to remove itself from the shared whiteboard overview of the cluster. It will also ensure that the database is cleanly shutdown and consistent, immediately ready for future use.

On an unclean shutdown such as a power outage, the Core Servers maintaining the overview of the cluster will notice that the Read replica’s connection has been abruptly been cut. The discovery machinery will initially hide the Read replica’s whiteboard entry, and if the Read replica does not reappear quickly its modest memory use in the shared whiteboard will be reclaimed.

On unclean shutdown it is possible the Read replica will not have entirely consistent store files or transaction logs. On subsequent reboot the Read replica will rollback any partially applied transactions such that the database is in a consistent state. Core shutdown

A clean Core Server shutdown, like Core Server booting, is handled via the Raft protocol. When a Core Server is shut down, it appends a membership entry to the Raft log which is then replicated around the Core Servers. Once a majority of Core Servers have committed that membership entry the leaver has logically left the cluster and can safely shut down. All remaining instances accept that the cluster has grown smaller, and is therefore less fault tolerant. If the leaver happened to be playing the Leader role at the point of leaving, it will be transitioned to another Core Server after a brief election.

An unclean shutdown does not directly inform the cluster that a Core Server has left. Instead the Core cluster size remains the same for purposes of computing majorities for commits. Thus an unclean shutdown in a cluster of 5 Core Servers now requires 3/4 members to agree to commit which is a tighter margin than 3/5 before the unclean shutdown.

Of course when Core Servers fail, operators or monitoring scripts can be alerted so that they can intervene in the cluster if necessary.

If the leaver was playing the Leader role, there will be a brief election to produce a new Leader. Once the new Leader is established, the Core cluster continues albeit with less redundancy. However even with this failure, a Core cluster of 5 servers reduced to 4 can still tolerate one more fault before becoming read-only.