A method to replicate a Causal Cluster to new hardware with minimum downtime

If the opportunity arises such that you are in need of replicating your existing Causal Cluster cluster to a new hardware setup, the following can be used to allow for minimal downtime.

Let us first start with an existing 3 instance cluster with the following characteristics

call dbms.cluster.overview
+---------------------------------------------------------------------------------------------------------------------------------------------+
| id                                     | addresses                                                                    | role       | groups |
+---------------------------------------------------------------------------------------------------------------------------------------------+
| "ffc16977-4ab8-41b5-a4e2-e0e32e8abd6f" | ["bolt://10.1.1.1:7617", "http://10.1.1.1:7474", "https://10.1.1.1:7473"] | "LEADER"   | []     |
| "f0a78cd1-7ba3-45f6-aba3-0abb60d785ef" | ["bolt://10.1.1.2:7627", "http://10.1.1.2:7474", "https://10.1.1.2:7473"] | "FOLLOWER" | []     |
| "2fe26571-6fcc-4d1e-9f42-b81d08579057" | ["bolt://10.1.1.3:7637", "http://10.1.1.3:7474", "https://10.1.1.3:7473"] | "FOLLOWER" | []     |
+---------------------------------------------------------------------------------------------------------------------------------------------+

and each instance has defined its conf/neo4j.conf with causal_clustering.expected_core_cluster_size=3 and causal_clustering.initial_discovery_members defined as

causal_clustering.expected_core_cluster_size=3
causal_clustering.initial_discovery_members=10.1.1.1:5001,10.1.1.2:5002,10.1.1.3:5003

and all other ports referenced are using the default values.

To add 3 new instances, for example at IP address 10.2.2.1, 10.2.2.2 and 10.2.2.3 perform the following steps

  1. install and create the new 3 instance cluster at IP addresses 10.2.2.1, 10.2.2.2 and 10.2.2.3.
  2. in each of these 3 new instances conf/neo4j.conf define their ha.initial_hosts to be defined as
    causal_clustering.initial_discovery_members=10.1.1.1:5001,10.1.1.2:5001,10.1.1.3:5001
  3. Start up each instance at 10.2.2.1, 10.2.2.2, and 10.2.2.3. These 3 new instances will then join the initial cluster at 10.1.1.1, 10.1.1.2 and 10.1.1.3 and copy down the databases\graph.db. Running dbms.cluster.overview(); will return output similar to
    +---------------------------------------------------------------------------------------------------------------------------------------------+
    | id                                     | addresses                                                                    | role       | groups |
    +---------------------------------------------------------------------------------------------------------------------------------------------+
    | "ffc16977-4ab8-41b5-a4e2-e0e32e8abd6f" | ["bolt://10.1.1.1:7687", "http://10.1.1.1:7474", "https://10.1.1.1:7473"] | "LEADER"   | []     |
    | "f0a78cd1-7ba3-45f6-aba3-0abb60d785ef" | ["bolt://10.1.1.2:7687", "http://10.1.1.2:7474", "https://10.1.1.2:7473"] | "FOLLOWER" | []     |
    | "2fe26571-6fcc-4d1e-9f42-b81d08579057" | ["bolt://10.1.1.3:7687", "http://10.1.1.3:7474", "https://10.1.1.3:7473"] | "FOLLOWER" | []     |
    | "847b74c2-34a9-4458-b0e2-ea36cf25fdbf" | ["bolt://10.2.2.1:7687", "http://10.2.2.1:7474", "https://10.2.2.1:7473"] | "FOLLOWER" | []     |
    | "39f92686-f581-4454-b288-a2254d38ea5c" | ["bolt://10.2.2.2:7687", "http://10.2.2.2:7474", "https://10.2.2.2:7473"] | "FOLLOWER" | []     |
    | "e4114ad2-dcd1-4d22-8f56-a085524c9ed0" | ["bolt://10.2.2.2:7687", "http://10.2.2.3:7474", "https://10.2.2.3:7473"] | "FOLLOWER" | []     |
    +---------------------------------------------------------------------------------------------------------------------------------------------+
  4. Once the 3 new instances have completed the copy of graph.db from master, one can then cleanly stop the 3 initial instances at 10.1.1.1, 10.1.1.2, and 10.1.1.3 via a bin/neo4j stop. The 3 remaining instances will continue to run.
    +---------------------------------------------------------------------------------------------------------------------------------------------+
    | id                                     | addresses                                                                    | role       | groups |
    +---------------------------------------------------------------------------------------------------------------------------------------------+
    | "847b74c2-34a9-4458-b0e2-ea36cf25fdbf" | ["bolt://10.2.2.1:7687", "http://10.2.2.1:7474", "https://10.2.2.1:7473"] | "LEADER"   | []     |
    | "39f92686-f581-4454-b288-a2254d38ea5c" | ["bolt://10.2.2.2:7687", "http://10.2.2.2:7474", "https://10.2.2.2:7473"] | "FOLLOWER" | []     |
    | "e4114ad2-dcd1-4d22-8f56-a085524c9ed0" | ["bolt://10.2.2.2:7687", "http://10.2.2.3:7474", "https://10.2.2.3:7473"] | "FOLLOWER" | []     |
    +---------------------------------------------------------------------------------------------------------------------------------------------+
  5. If a Load Balancer was in front of the 3 instance cluster at 10.1.1.1, 10.1.1.2, and 10.1.1.3 it should be updated to now point to 10.2.2.1, 10.2.2.2, and 10.2.2.3.
  6. Since the intial 3 intances have been shut down and so as to provide ability for the 3 new instances to successfully restart at some later time, update the causal_clustering.initial_discovery_members of the new 3 instances and change
    causal_clustering.initial_discovery_members=10.1.1.1:5001,10.1.1.2:5001,10.1.1.3:5001

    to

    causal_clustering.initial_discovery_members=10.2.2.1:5001,10.2.2.2:5001,10.2.2.3:5001
  7. If you are currently using the Bolt driver to connect to the cluster you would then need to update the connection string to reference a new url, for example changing bolt+routing://10.1.1.1:7678 to bolt+routing://10.2.2.1:7678