7.6. Internals of clustering

This section details a few selected internals of a Neo4j Causal Cluster. Understanding the internals is not vital but can be helpful in diagnosing and resolving operational issues.

This section includes:

7.6.1. Elections and leadership

The Core Servers in a Causal Cluster use the Raft protocol to ensure consistency and safety. An implementation detail of Raft is that it uses a Leader role to impose an ordering on an underlying log with other instances acting as Followers which replicate the leader’s state. Specifically in Neo4j, this means that writes to the database are ordered by the Core instance currently playing the Leader role for the respective database. If multiple databases have been installed, each one of those databases will operate within a logically separate Raft group, and therefore each have an individual leader. This means that a Core Server may act both as Leader for some databases, and as Follower for other databases.

If a follower has not heard from the leader for a while, then it can initiate an election and attempt to become the new leader. The follower makes itself a Candidate and asks other Cores to vote for it. If it can get a majority of the votes, then it assumes the leader role. Cores will not vote for a candidate which is less up-to-date than itself. There can only be one leader at any time per database, and that leader is guaranteed to have the most up-to-date log.

It is expected for elections to occur during the normal running of a cluster and they do not pose an issue in and of itself. If you are experiencing frequent re-elections and they are disturbing the operation of the cluster then you should try to figure out what is causing them. Some common causes are environmental issues (e.g. a flaky networking) and work overload conditions (e.g. more concurrent queries and transactions than the hardware can handle).

7.6.2. Leadership balancing

Write transactions will always be routed to the leader for the respective database. As a result, unevenly distributed leaderships may cause write queries to be disproportionately directed to a subset of instances. By default, Neo4j avoids this by automatically transferring database leaderships so that they are evenly distributed throughout the cluster.

7.6.3. Multi-database and the reconciler

Databases operate as independent entities in a Neo4j DBMS, both in standalone and in a cluster. Since a cluster consists of multiple independent server instances, the effects of administrative operations like creating a new database happen asynchronously and independently for each server. However, the immediate effect of an administrative operation is to safely commit the desired state in the system database.

The desired state committed in the system database gets replicated and is picked up by an internal component called the reconciler. It runs on every instance and takes the appropriate actions required locally on that instance for reaching the desired state; creating, starting, stopping and dropping databases.

Every database runs in an independent Raft group and since there are two databases in a fresh cluster, system and neo4j, this means that it also has two Raft groups. Every Raft group also has an independent leader and thus a particular Core server could be the leader for one database and a follower for another.

7.6.4. Store copy

Store copies are initiated when an instance does not have an up-to-date copy of the database. For example, this will be the case when a new instance is joining a cluster (without a seed). It can also happen as a consequence of falling behind the rest of the cluster, for reasons such as connectivity issues or having been shutdown. Upon re-establishing connection with the cluster, an instance will recognize that it is too far behind and fetch a new copy from the rest of the cluster.

A store copy is a major operation which may disrupt the availability of instances in the cluster. Store copies should not be a frequent occurrence in a well-functioning cluster, but rather be an exceptional operation that happens due to specific causes, e.g. network outages or planned maintenance outages. If store copies happen during regular operation, then the configuration of the cluster, or the workload directed at it, might have to be reviewed so that all instances can keep up, and that there is enough of a buffer of Raft logs and transaction logs to handle smaller transient issues.

The protocol used for store copies is robust and configurable. The network requests will be directed at an upstream member according to configuration and they will be retried despite transient failures. The maximum amount of time to retry every request can be modified by setting causal_clustering.store_copy_max_retry_time_per_request. If a request fails and the maximum retry time has elapsed then it will stop retrying and the store copy will fail.

Use causal_clustering.catch_up_client_inactivity_timeout to configure the inactivity timeout for any particular request.

This setting is for all requests from the catchup client, including the pulling of transactions.

The default upstream strategy differs for Cores and Read Replicas. Cores will always send the initial request to the leader to get the most up-to-date information about the store. The strategy for the file and index requests for Cores is to vary every other request to a random Read Replica and every other to a random Core member.

Read Replicas use the same strategy for store copies as it uses for pulling transactions. The default is to pull from a random Core member.

If you are running a multi-data center cluster, then upstream strategies for both Cores and Read Replicas can be configured. Remember that for Read Replicas this also affects from where transactions are pulled. See more in Configure for multi-data center operations.

7.6.5. On-disk state

The on-disk state of cluster instances is different to that of standalone instances. The biggest difference being the existence of additional cluster state. Most of the files there are relatively small, but the Raft logs can become quite large depending on the configuration and workload.

It is important to understand that once a database has been extracted from a cluster and used in a standalone deployment, it must not be put back into an operational cluster. This is because the cluster and the standalone deployment now have separate databases, with different and irreconcilable writes applied to them.

If you try to reinsert the modified database back into the cluster, then the logs and stores will mismatch. Operators should not try to merge standalone databases into the cluster in the optimistic hope that their data will become replicated. That will not happen and will likely lead to unpredictable cluster behavior.