Internals of clustering

This section details a few selected internals of a Neo4j cluster. Understanding the internals is not vital but can be helpful in diagnosing and resolving operational issues.

1. Elections and leadership

The Core instances used as Primary servers in a cluster use the Raft protocol to ensure consistency and safety. See Advanced Causal Clustering for more information on the Raft protocol. An implementation detail of Raft is that it uses a Leader role to impose an ordering on an underlying log with other instances acting as Followers which replicate the leader’s state. Specifically in Neo4j, this means that writes to the database are ordered by the Core instance currently playing the Leader role for the respective database. If multiple databases have been installed, each one of those databases will operate within a logically separate Raft group, and therefore each have an individual leader. This means that a Core instance may act both as Leader for some databases, and as Follower for other databases.

If a follower has not heard from the leader for a while, then it can initiate an election and attempt to become the new leader. The follower makes itself a Candidate and asks other Cores to vote for it. If it can get a majority of the votes, then it assumes the leader role. Cores will not vote for a candidate which is less up-to-date than itself. There can only be one leader at any time per database, and that leader is guaranteed to have the most up-to-date log.

Elections are expected to occur during the normal running of a cluster and they do not pose an issue in and of itself. If you are experiencing frequent re-elections and they are disturbing the operation of the cluster then you should try to figure out what is causing them. Some common causes are environmental issues (e.g. a flaky networking) and work overload conditions (e.g. more concurrent queries and transactions than the hardware can handle).

2. Leadership balancing

Write transactions will always be routed to the leader for the respective database. As a result, unevenly distributed leaderships may cause write queries to be disproportionately directed to a subset of instances. By default, Neo4j avoids this by automatically transferring database leaderships so that they are evenly distributed throughout the cluster. Additionally, Neo4j will automatically transfer database leaderships away from instances where those databases are configured to be read-only using dbms.databases.read_only or similar.

3. Multi-database and the reconciler

Databases operate as independent entities in a Neo4j DBMS, both in standalone and in a cluster. Since a cluster can consist of multiple independent server instances, the effects of administrative operations like creating a new database happen asynchronously and independently for each server. However, the immediate effect of an administrative operation is to safely commit the desired state in the system database.

The desired state committed in the system database gets replicated and is picked up by an internal component called the reconciler. It runs on every instance and takes the appropriate actions required locally on that instance for reaching the desired state; creating, starting, stopping and dropping databases.

Every database runs in an independent Raft group and since there are two databases in a fresh cluster, system and neo4j, this means that it also has two Raft groups. Every Raft group also has an independent leader and thus a particular Core instance could be the leader for one database and a follower for another.

This does not apply to clusters where a Single instance is the Primary server. In such clusters, the Single instance is the leader of all databases and there is no Raft at all.

4. Server-side routing

Server-side routing is a complement to the client-side routing, performed by a Neo4j Driver.

In a cluster deployment of Neo4j, Cypher queries may be directed to a cluster member that is unable to run the given query. With server-side routing enabled, such queries will be rerouted internally to a cluster member that is expected to be able to run it. This situation can occur for write-transaction queries, when they address a database for which the receiving cluster member is not the leader.

The cluster role for core cluster members is per database. Thus, if a write-transaction query is sent to a cluster member which is not the leader for the specified database (specified either via the Bolt Protocol or by the Cypher syntax; USE clause), server-side routing will be performed, if properly configured.

Server-side routing is enabled by the DBMS, by setting dbms.routing.enabled=true for each cluster member. The listen address (dbms.routing.listen_address) and advertised address (dbms.routing.advertised_address) also need to be configured for server-side routing communication.

Client connections need to state that server-side routing should be used and this is only available for Neo4j Drivers that use the Bolt Protocol.

In order to determine which cluster member to query, Neo4j Drivers query the cluster for routing information, using dbms.cluster.routing.getRoutingTable(), when needed. When you use the neo4j:// URI scheme, then the Neo4j Drivers do this using client-side routing.

However, when using the neo4j:// URI scheme, the Neo4j Drivers also state that server-side routing should be used.

With the exception of the Python Driver, if you use the bolt://  URI scheme, the Neo4j Drivers do not state that server-side routing should be used.

The specification for setting the flag, that states whether server-side routing should be used, was introduced in Bolt Protocol 4.1 - server-side routing.

When using the neo4j:// URI scheme, a Neo4j Driver does:

  1. Query the cluster for routing information.

  2. Use the routing table (client-side routing) to send a query.

  3. State that server-side routing is allowed to be used.

Provided that server-side routing has been configured for the cluster, when the Neo4j Driver sends a write-transaction query to a non-leader cluster member, then that cluster member tries to route it correctly to the cluster member that is the leader.

The configuration dbms.routing.default_router=SERVER configures a cluster member to make its routing table behave like a standalone instance. The implication is that if a Neo4j Driver connects to a cluster member configured with dbms.routing.default_router=SERVER, the Neo4j Driver sends all requests to that cluster member. The default configuration is dbms.routing.default_router=CLIENT. See dbms.routing.default_router for more information.

The instructions for server-side routing assume that the cluster member is configured as:

dbms.routing.default_router=CLIENT

The table below show the criteria, when server-side routing is performed:

Table 1. Server-side routing criteria
CLIENT - Neo4j Driver (Bolt Protocol) SERVER - Neo4j Cluster member

URI scheme

Client-side routing

Request server-side routing

Transaction type

Server - Instance > Role (per database)

Server-side routing enabled

Routes the query

neo4j://

write

Primary - Single

neo4j://

read

Primary - Single

neo4j://

write

Primary - Core > leader

neo4j://

read

Primary - Core > leader

neo4j://

write

Primary - Core > follower

neo4j://

read

Primary - Core > follower

neo4j://

write

Secondary - Read Replica

neo4j://

read

Secondary - Read Replica

bolt://

write

Primary - Single

bolt://

read

Primary - Single

bolt://

write

Primary - Core > leader

bolt://

read

Primary - Core > leader

bolt://

write

Primary - Core > follower

bolt://

read

Primary - Core > follower

bolt://

write

Secondary - Read Replica

bolt://

read

Secondary - Read Replica

bolt:// (Python Driver)

write

Primary - Single

bolt:// (Python Driver)

read

Primary - Single

bolt:// (Python Driver)

write

Primary - Core > leader

bolt:// (Python Driver)

read

Primary - Core > leader

bolt:// (Python Driver)

write

Primary - Core > follower

bolt:// (Python Driver)

read

Primary - Core > follower

bolt:// (Python Driver)

write

Secondary - Read Replica

bolt:// (Python Driver)

write

Secondary - Read Replica

Server-side routing connector configuration

Rerouted queries are communicated over the Bolt Protocol using a designated communication channel. The receiving end of of the communication is configured using the following settings:

Server-side routing driver configuration

Server-side routing uses the Neo4j Java driver to connect to other cluster members. This driver is configured with settings of the format:

Server-side routing encryption

Encryption of server-side routing communication is configured by the cluster SSL policy. For more information, see Cluster Encryption.

5. Store copy

Store copies are initiated when an instance does not have an up-to-date copy of the database. For example, this is the case when a new instance is joining a cluster (without a seed). It can also happen as a consequence of falling behind the rest of the cluster, for reasons such as connectivity issues or having been shutdown. Upon re-establishing connection with the cluster, an instance recognizes that it is too far behind and fetch a new copy from the rest of the cluster.

A store copy is a major operation which may disrupt the availability of instances in the cluster. Store copies should not be a frequent occurrence in a well-functioning cluster, but rather be an exceptional operation that happens due to specific causes, e.g. network outages or planned maintenance outages. If store copies happen during regular operation, then the configuration of the cluster, or the workload directed at it, might have to be reviewed so that all instances can keep up, and that there is enough of a buffer of Raft logs and transaction logs to handle smaller transient issues.

The protocol used for store copies is robust and configurable. The network requests are directed at an upstream member according to configuration and they are retried despite transient failures. The maximum amount of time to retry every request can be configured with causal_clustering.store_copy_max_retry_time_per_request. If a request fails and the maximum retry time has elapsed then it stops retrying and the store copy fails.

Use causal_clustering.catch_up_client_inactivity_timeout to configure the inactivity timeout for any particular request.

The causal_clustering.catch_up_client_inactivity_timeout configuration is for all requests from the catchup client, including the pulling of transactions.

The default upstream strategy is not applicable to Single instances and it differs for Core and Read Replica instances. Core instances always send the initial request to the leader to get the most up-to-date information about the store. The strategy for the file and index requests for Core instances is to vary every other request to a random Read Replica instance and every other to a random Core instance.

Read Replicas use the same strategy for store copies as it uses for pulling transactions. The default is to pull from a random Core member.

Read Replica instances use the same strategy for store copies as it uses for pulling transactions. The default is to pull from a random Core instance.

If you are running a multi-data center cluster, then upstream strategies for both Core and Read Replica instances can be configured. Remember that for Read Replica instances, this also affects from where transactions are pulled. See more in Configure for multi-data center operations.

  • Do not transform a Read Replica instance into a Core instance.

  • Do not transform a Core instance into a Read Replica instance.

6. On-disk state

The on-disk state of cluster instances is different to that of standalone instances. The biggest difference is the existence of additional cluster state. Most of the files there are relatively small, but the Raft logs can become quite large depending on the configuration and workload.

It is important to understand that once a database has been extracted from a cluster and used in a standalone deployment, it must not be put back into an operational cluster. This is because the cluster and the standalone deployment now have separate databases, with different and irreconcilable writes applied to them.

If you try to reinsert a modified database back into the cluster, then the logs and stores will mismatch. Operators should not try to merge standalone databases into the cluster in the optimistic hope that their data will become replicated. That does not happen and instead it likely leads to unpredictable cluster behavior.