4.3. Create a new cluster

This section describes how to deploy a new Neo4j Causal Cluster.

In this section we will learn how to set up a brand new Causal Cluster consisting of three Core instances; the minimum number of servers needed for a fault-tolerant Core cluster. We then learn how to add more Core servers as well as Read Replicas. From this basic pattern, we can extend what we have learned here to create any sized cluster.

The following subjects are described:

For a description of the clustering architecture and cluster concepts encountered here, please refer to Introduction to Causal Clustering. We will not describe how to import data from an existing Neo4j instance; for help on using an existing Neo4j database to seed a new Causal Cluster, please see Seed a Causal Cluster.

If you want to try to set up a Causal Cluster on your local machine, refer to Section B.1, “Set up a local Causal Cluster”.

4.3.1. Configure a Core-only cluster

The instructions in this section assume that you have already downloaded the appropriate version of Neo4j Enterprise Edition from the Neo4j download site and have installed it on the servers that will make up the cluster.

When deploying a new Causal Cluster, the following configuration settings are important to consider for basic cluster operation.

The settings below are located in neo4j.conf under the header "Network connector configuration".

dbms.connectors.default_listen_address
The address or network interface this machine uses to listen for incoming messages. Uncommenting this line sets this value to 0.0.0.0 which allows Neo4j to bind to any and all network interfaces. Uncomment the line dbms.connectors.default_listen_address=0.0.0.0
dbms.connectors.default_advertised_address
The address that other machines are told to connect to. In the typical case, this should be set to the public IP address of this server. For example, if the IP address is 33.44.55.66, this setting should be: dbms.connectors.default_advertised_address=33.44.55.66

The settings below are located in neo4j.conf under the header "Causal Clustering Configuration".

dbms.mode
The operating mode of a single database instance. For Causal Clustering, there are two possible modes: CORE or READ_REPLICA. On three of our instances, we will use the CORE mode. Uncomment the line: #dbms.mode=CORE.
causal_clustering.expected_core_cluster_size
The initial cluster size at startup. It is necessary for achieving an early stable membership state and subsequently for safe writes to the cluster. This value is the number of Core instances you intend to have as part of your cluster. For example, causal_clustering.expected_core_cluster_size=3 will specify that the cluster has three Core members.
causal_clustering.initial_discovery_members
The network addresses of an initial set of Core cluster members that are available to bootstrap this Core or Read Replica instance. How this value is interpreted is determined by causal_clustering.discovery_type. When discovery type is set to LIST, which is its default value, the initial discovery members are given as a comma-separated list of address/port pairs, for example core1:5000,core2:5000,core3:5000. The default port for the discovery service is :5000. You should include the address of the local machine in this setting.
causal_clustering.discovery_type
The mechanism to use along with the value provided for config_causal_clustering.initial_discovery_members to determine the addresses of other members of the cluster on startup. In the simplest case, config_causal_clustering.discovery_type is set to LIST and each of the specified network addresses resolves to a working Neo4j Core server instance. This value (LIST) requires you to know either the hostnames or the IP addresses of the Core instances in the cluster. If you do not know the addresses of the other Core instances, alternative discovery mechanisms are described in Initial discovery of cluster members with DNS.

Apply these settings to the configuration file on each instance. The values can be the same for each.

Start the Neo4j servers as usual. Note that the startup order does not matter.

Example 4.1. Start the Core-only cluster
server-1$ ./bin/neo4j start
server-2$ ./bin/neo4j start
server-3$ ./bin/neo4j start
Startup Time

If you want to follow along with the startup of a server you can follow the messages in neo4j.log. While an instance is joining the cluster, the server may appear unavailable.

Now you can access the three servers and check their status. Open the locations below in a web browser and issue the following query: CALL dbms.cluster.overview(). This will show you the status of the cluster and information about each member of the cluster.

You now have a Neo4j Causal Cluster of three instances running.

4.3.2. Add a Core server to an existing cluster

Adding instances to the Core cluster is simply a matter of starting a new database server with the appropriate configuration as described in Section 4.3.1, “Configure a Core-only cluster”. Following those instructions, we need to change neo4j.conf to reflect the new Core Server’s desired configuration like so:

  • Set dbms.mode=CORE.
  • Set causal_clustering.initial_discovery_members to be identical to the corresponding parameter on the existing Core servers.

Once we’ve done that, start the server and the new server will integrate itself with the existing cluster. In the case where an instance is joining a cluster with lots of data, it may take a number of minutes for the new instance to download the data from the cluster and become available. When the server has copied over the graph data from its peers it will become available.

4.3.3. Add a Read Replica to an existing cluster

Initial Read replica configuration is provided similarly to Core Servers via neo4j.conf. Since Read Replicas do not participate in cluster quorum decisions, their configuration is shorter. They simply need to know the addresses of some of the Core Servers which they can bind to in order to run the discovery protocol (see: Section 4.2.1, “Discovery protocol” for details). Once it has completed the initial discovery the Read Replica becomes aware of the available Core Servers and can choose an appropriate one from which to catch up (see: Section 4.2.5, “Catchup protocol” for how that happens).

In the neo4j.conf file in the section "Causal Clustering Configuration", the following settings need to be changed:

  • Set operating mode to Read Replica: dbms.mode=READ_REPLICA.
  • Set the parameter causal_clustering.initial_discovery_members to be identical to the corresponding parameter on the Core servers.

4.3.4. Remove a Core from a cluster

A Core Server can be downgraded to a standalone instance, using the neo4j-admin unbind command.

Once a server has been unbound from a cluster, the store files are equivalent to a Neo4j standalone instance. From this point those files could be used to run a standalone instance by restarting it in SINGLE mode.

The on-disk state of Core server instances is different to that of standalone server instances. It is important to understand that once an instance unbinds from a cluster, it cannot be re-integrated with that cluster. This is because both the cluster and the single instance are now separate databases with different and irreconcilable writes having been applied to them. Technically the cluster will have written entries to its Raft log, whilst the standalone instance will have written directly to the transaction log (since there is no Raft log in a standalone instance).

If we try to reinsert the standalone instance into the cluster, then the logs and stores will mismatch. Operators should not try to merge standalone databases into the cluster in the optimistic hope that their data will become replicated. That will not happen and will likely lead to unpredictable cluster behavior.

4.3.5. Bias cluster leadership with follower-only instances

The Core servers in a Causal Cluster use the Raft protocol to ensure consistency and safety. An implementation detail of Raft is that it uses a Leader role to impose an ordering on an underlying log with other instances acting as Followers that replicate the leader’s state. In Neo4j terms this means writes to the database are ordered by the Core instance currently playing the leader role.

In some situations, operators might want to actively prevent some instances from taking on the leader role. For example in a multi-data center scenario, we might want to ensure that the leader remains in a favored geographic location for performance or governance reasons. In Neo4j Causal Clustering we can configure instances to refuse to become leader, which is equivalent to always remaining a follower. This is done by configuring the setting causal_clustering.refuse_to_be_leader. It is not generally advisable to use this option, since the first priority in the cluster is to maximize availability. The highest availability stems from having any healthy instance take leadership of the cluster on pathological failure.

Despite superficial similarities, a non-leader capable Core instance is not the same as a Read Replica. Read Replicas do not participate in transaction processing, nor are they permitted to be involved in cluster topology management.

Conversely, a follower-only Core instance will still process transactions and cluster membership requests as per Raft to ensure consistency and safety.

4.3.6. Initial discovery of cluster members

When a Neo4j Core server starts up it needs to contact other Core members, to form a new cluster or to join an existing cluster. To contact the other members, the Core server needs to know their addresses. The address consist of the hostname or IP address, and the port on which the discovery service is listening.

When a cluster member discovers another, they share what they know of the rest of the cluster. It is therefore not necessary for each instance to have the address of every other instance in the cluster. However, if the initial discovery members form two or more disjoint sets, then a cluster will not be able to form.

When the addresses of the other cluster members are known they can be listed explicitly. In this case we can use the default causal_clustering.discovery_type=LIST and hard code the addresses in the configuration of each machine, for example causal_clustering.initial_discovery_members=10.0.0.1:5000,10.0.0.2:5000,10.0.0.3:5000.

Explicitly listing the addresses of Core members for discovery is convenient, but has limitations.

  • First, if Core members are replaced and the new members have different addresses, the list will become outdated. An outdated list can be avoided by ensuring that the new members can be reached via the same address as the old members, but his is not always practical.
  • Second, under some circumstances the addresses are not known when configuring the cluster. This can be the case for example when using container orchestration to deploy a Causal Cluster.

For cases where it is not practical or possible to explicitly list the addresses of cluster members to discover, additional discovery mechanisms are provided using DNS.

4.3.7. Initial discovery of cluster members with DNS

Beyond the default LIST discovery type, there are two further mechanisms that can be used to get the addresses of Core cluster members for discovery. Both methods use DNS.

  • With config_causal_clustering.discovery_type=DNS, the initial discovery members will be resolved from DNS A records to find the IP addresses to contact.
  • With config_causal_clustering.discovery_type=SRV, the initial discovery members will be resolved from DNS SRV records to find the IP addresses/hostnames and discovery service ports to contact.

When using discovery_type=DNS the causal_clustering.initial_discovery_members should be set to a single domain name and the port of the discovery service, for example causal_clustering.initial_discovery_members=cluster1.neo4j.example.com:5000. The domain name should return an A record for every Core member when a DNS lookup is performed. Each A record returned by DNS should contain the IP address of the Core member. The configured Core server will use all the IP addresses from the A records to join or form a cluster.

If discovery_type=DNS is used, the discovery port on all Cores must be the same. If this is not possible, consider using the discovery_type=SRV configuration.

When using discovery_type=SRV, the causal_clustering.initial_discovery_members should be set to a single domain name and the port set to 0, for example causal_clustering.initial_discovery_members=cluster1.neo4j.example.com:0. The domain name should return a single SRV record when a DNS lookup is performed. The SRV record returned by DNS should contain the IP address or hostname, and the discovery port, for the Core servers to be discovered. The configured Core server will use all the addresses from the SRV record to join or form a cluster.

For both DNS and SRV configurations, the DNS record lookup is performed when an instance starts up. Once an instance has joined a cluster, further membership changes are communicated amongst Core members as part of the discovery service.

4.3.8. Store copy

Causal Clustering uses a robust and configurable store copy protocol. When a store copy is started it will first send a prepare request to the specified instance. If the prepare request is successful the client will send file and index requests, one request per file or index, to provided upstream members in the cluster. The retry logic per request can be modified through causal_clustering.store_copy_max_retry_time_per_request. If a request fails and that maximum retry time is met it will stop retrying and the store copy will fail.

Use causal_clustering.catch_up_client_inactivity_timeout to configure the inactivity timeout for any catchup request. Bear in mind that this setting is for all requests from the catchup client, including pulling of transactions.

There are three scenarios that will start a store copy. The upstream selection strategy is different for each scenario.

Backup
Upstream strategy is set to a fixed member by the neo4j-admin backup command. All requests will go to the specified member.
Seeding a new member with empty store
Will use configured upstream strategy for that instance.
Instance falling to far behind
Will use configured upstream strategy for that instance.

The upstream strategy differs for Cores and Read Replicas. Cores will always send the prepare request to the leader to get the most up-to-date information of the store. The strategy for the file and index requests for Cores is to vary every other request to random Read Replica and every other to random Core member. Read Replicas' strategy is the same as for pulling transactions. The default is to pull from a random Core member.

If you are running a multi-data-center cluster, both Cores and Read Replicas upstream strategy can be configured. Remember that for Read Replicas this also effect from whom transactions are pulled. See more in Configure for multi-data center operations