Neo4j’s clustering provides these main features:
Safety: Servers hosting databases in primary mode provide a fault tolerant platform for transaction processing which remains available while a simple majority of those Primary Servers are functioning.
Scale: Servers hosting databases in secondary mode provide a massively scalable platform for graph queries that enables very large graph workloads to be executed in a widely distributed topology.
Causal consistency: When invoked, a client application is guaranteed to read at least its own writes.
Operability: Database management is separated from server management.
Together, this allows the end-user system to be fully functional and both read and write to the database in the event of multiple hardware and network failures and makes reasoning about database interactions straightforward. Additionally, the administration of a cluster is uncomplicated, including scaling the size of the cluster and distributing and balancing the available resources.
The remainder of this section contains an overview of how clustering works in production, including both operational and application aspects.
From an operational point of view, it is useful to view the cluster as a homogenous pool of servers which run a number of databases. The servers have two different database-hosting capabilities, referred to as Primary and Secondary modes. A server can simultaneously act as a primary host for one or more databases and as a secondary host for other databases. Similarly, it is possible for a database to be hosted on only one server, even when that server is part of a cluster. In such cases, the server is always hosting that database in primary mode.
The two modes are foundational in any production deployment but are managed at different scales from one another and undertake different roles in managing the fault tolerance and scalability of the overall cluster.
A server hosting a database in primary mode allows read and write operations. A database can be hosted by one or more primary hosts.
To achieve high availability, a database should be created with multiple primaries. If high availability is not required, then a database may be created with a single primary for minimum write latency. The remainder of this section assumes a database has multiple primaries.
Database primaries achieve high availability by replicating all transactions using the Raft protocol. Raft ensures that the data is safely durable by waiting for a majority of primaries in a database (N/2+1) to acknowledge a transaction, before acknowledging its commit to the end user application. In practice, only one of the multiple primaries execute write transactions from clients. This writer is elected automatically from amongst a database’s primaries and may change over time. The writer primary synchronously replicates writes to the other primaries. The database secondaries replicates the writes asynchronously from more up-to-date members of the cluster.
This synchronous replication has an impact on write transaction latency. Implicitly, write transactions are acknowledged by the fastest majority, but as the number of primaries of the database grows, so does the size of the majority needed to acknowledge a write.
The fault tolerance for a database is calculated with the formula M = 2F + 1, where M is the number of primaries required to tolerate F faults. For example:
In order to tolerate two failed primaries, you need a topology of five servers hosting your database in primary mode.
The smallest fault-tolerant cluster, a cluster that can tolerate one fault, must have three database primaries.
It is also possible to create a cluster consisting of only two primaries. However, that cluster is not fault-tolerant. If one of the two servers fails, the remaining server becomes read-only.
A database with a single primary server cannot tolerate any faults either. Therefore it is recommended to have three or more primaries to achieve high availability.
With database primaries, should the database suffer enough primary failures, it can no longer process writes and becomes read-only to preserve safety.
Database secondaries are asynchronously replicated from primaries via transaction log shipping. They periodically poll an upstream server for new transactions and have these shipped over. Many secondaries can be fed data from a relatively small number of primaries, allowing for a large fan out of the query workload for scale.
Databases can typically have relatively large numbers of secondaries. Losing a secondary does not impact the database’s availability, aside from the loss of its fraction of graph query throughput. It does not affect the fault tolerance of the database.
The main responsibility of database secondaries is to scale out read workloads. Secondaries act like caches for the graph data and are fully capable of executing arbitrary (read-only) queries and procedures.
Due to its asynchronous nature, secondaries may not provide all transactions committed on the primary server(s).
While the operational mechanics of the cluster are interesting from an application point of view, it is also helpful to think about how applications use the database to get their work done. In many applications, it is typically desirable to both read from the graph and write to the graph. Depending on the nature of the workload, it is common to want reads from the graph to take into account previous writes to ensure causal consistency.
Causal consistency is one of numerous consistency models used in distributed computing. It ensures that causally related operations are seen by every instance in the system in the same order. Consequently, client applications are guaranteed to read their own writes, regardless of which instance they communicate with. This simplifies interaction with large clusters, allowing clients to treat them as a single (logical) server.
Causal consistency makes it possible to write to databases hosted on servers in primary mode and read those writes from databases hosted on servers in secondary mode (where graph operations are scaled out). For example, causal consistency guarantees that the write which created a user account is present when that same user subsequently attempts to log in.
On executing a transaction, the client can ask for a bookmark which it then presents as a parameter to subsequent transactions. Using that bookmark, the cluster can ensure that only servers which have processed the client’s bookmarked transaction will run its next transaction. This provides a causal chain which ensures correct read-after-write semantics from the client’s point of view.
Aside from the bookmark everything else is handled by the cluster. The database drivers work with the cluster topology manager to choose the most appropriate servers to route queries to. For instance, routing reads to database secondaries and writes to database primaries.