Disaster recovery

Databases can become unavailable for different reasons. For the purpose of this section, an unavailable database is defined as a database that is incapable of serving writes, while still may be able to serve reads. Databases not performing as expected for other reasons are not considered unavailable and cannot be helped by this section. This section contains a step-by-step guide on how to recover databases that have become unavailable. By performing the actions described here, the unavailable databases are recovered and made fully operational with as little impact as possible on the other databases in the cluster.

There are many reasons why a database becomes unavailable and it can be caused by issues on different levels in the system. For example, a data-center failover may lead to the loss of multiple serves which in turn may cause a set of databases to become unavailable. It is also possible for databases to become quarantined due to a critical failure in the system which may lead to unavailability even without loss of servers.

If all servers in a Neo4j cluster are lost in a data-center failover, it is not possible to recover the current cluster. A new cluster has to be created and the databases restored. See Deploy a basic cluster and Seed a database for more information.

Faults in clusters

Databases in clusters follow an allocation strategy. This means that they are allocated differently within the cluster and may also have different numbers of primaries and secondaries. The consequence of this is that all servers are different in which databases they are hosting. Losing a server in a cluster may cause some databases to lose a member while others are unaffected. Consequently, in a disaster where multiple servers go down, some databases may keep running with little to no impact, while others may lose all their allocated resources.

Guide to disaster recovery

The following guide defines three stages of disaster recovery. Always run the guide in the order it is described and only move to the next stage once the current one is validated to work properly.

In this section, an offline server is a server that is not running but may be restartable. A lost server however, is a server that is currently not running and cannot be restarted.

Disasters may sometimes affect the routing capabilities of the driver and may prevent the use of the neo4j scheme for routing. One way to remedy this is to connect directly to the server using bolt instead of neo4j. See Server-side routing for more information on the bolt scheme.

Restore the system database

The first step of recovery is to ensure that the system database is available. The system database is required for clusters to function properly.

  1. Start all servers that are offline. (If a server is unable to start, inspect the logs and contact support personnel. The server may have to be considered indefinitely lost.)

  2. Validate the system database’s availability.

    1. Run SHOW DATABASE system. If the response doesn’t contain a writer, the system database is unavailable and needs to be recovered, continue to step 3.

    2. Run CREATE USER 'temporaryUser' SET PASSWORD 'temporaryPassword' to create a temporary user.

      1. Confirm that the query was executed successfully and the temporary user was created as expected, by running SHOW USERS, then continue to Recover servers. If not, continue to step 3.

  3. Restore the system database.

    Only do the steps below if the system database’s availability could not be validated by the first two steps in this section.

    The following steps creates a new system database from a backup of the current system database. This is required since the current system database has lost too many members in the server failover.

    1. Shut down the Neo4j process on all servers. Note that this causes downtime for all databases in the cluster.

    2. On each server, run the following neo4j-admin command neo4j-admin dbms unbind-system-db to reset the system database state on the servers. See neo4j-admin commands for more information.

    3. On each server, run the following neo4j-admin command neo4j-admin database info system to find out which server is most up-to-date, ie. has the highest last-committed transaction id.

    4. On the most up-to-date server, take a dump of the current system database by running neo4j-admin database dump system --to-path=[path-to-dump] and store the dump in an accessible location. See neo4j-admin commands for more information.

    5. Ensure there are enough system database primaries to create the new system database with fault tolerance. Either:

      1. Add completely new servers (see Add a server to the cluster) or

      2. Change the system database mode (server.cluster.system_database_mode) on the current system database’s secondary servers to allow them to be primaries for the new system database.

    6. On each server, run neo4j-admin database load system --from-path=[path-to-dump] --overwrite-destination=true to load the current system database dump.

    7. Ensure that dbms.cluster.discovery.endpoints are set correctly on all servers, see Cluster server discovery for more information.

    8. Return to step 1.

Recover servers

Once the system database is available, the cluster can be managed. Following the loss of one or more servers, the cluster’s view of servers must be updated, ie. the lost servers must be replaced by new servers. The steps here identify the lost servers and safely detach them from the cluster.

  1. Run SHOW SERVERS. If all servers show health AVAILABLE and status ENABLED continue to Recover databases.

  2. On each UNAVAILABLE server, run CALL dbms.cluster.cordonServer("unavailable-server-id").

  3. On each CORDONED server, run DEALLOCATE DATABASES FROM SERVER cordoned-server-id.

  4. On each server that failed to deallocate with one of the following messages:

    1. Could not deallocate server [server]. Can’t move databases in single mode [database]

      or

      Could not deallocate server [server]. Database [database] has lost quorum of servers, only found [existing number of primaries] of [expected number of primaries]. Cannot be safely deallocated. Please drop the database before retrying.

      First ensure that there is a backup for the database in question (see Online backup), and then drop the database by running DROP DATABASE database-name. Return to step 3.

    2. Could not deallocate server [server]. Cannot change allocations for database [stopped-db] because it is offline.

      Try to start the offline database by running START DATABASE stopped-db WAIT. If it starts successfully, return to step 3. Otherwise, ensure that there is a backup for the database before dropping it with DROP DATABASE stopped-db. Return to step 3.

      A database can be set to READ-ONLY-mode before it is started to avoid updates on a database that is desired to be stopped with the following: ALTER DATABASE database-name SET ACCESS READ ONLY.

    3. Could not deallocate server [server]. Reallocation of [database] not possible, no new target found. All existing servers: [existing-servers]. Actual allocated server with mode [mode] is [current-hostings].

      Add new servers and enable them and then return to step 3, see Add a server to the cluster for more information.

  5. Run SHOW SERVERS YIELD * once all enabled servers host the requested databases (hosting-field contains exactly the databases in the requestedHosting field), proceed to the next step. Note that this may take a few minutes.

  6. For each deallocated server, run DROP SERVER deallocated-server-id.

  7. Return to step 1.

Recover databases

Once the system database is verified available, and all servers are online, the databases can be managed. The steps here aim to make the unavailable databases available.

  1. If you have previously dropped databases as part of this guide, re-create each one from backup. See the Administrative commands section for more information on how to create a database.

  2. Run SHOW DATABASES. If all databases are in desired states on all servers (requestedStatus=currentStatus), disaster recovery is complete.