Introduction: Neo4j clustering architecture

Overview

Neo4j’s clustering provides these main features:

  1. Scalability: A Neo4j cluster is a set of servers running multiple databases. Servers and databases are decoupled: servers provide computation and storage power for databases to use. Each database has its own independent topology, organized into primaries and secondaries (for read scaling).

  2. Fault tolerance: Primary database allocations provide a fault tolerant platform for transaction processing. A database remains available for writes as long as a simple majority of its primary allocations are functioning.

  3. Operability: Database management is separated from server management. For details, see Managing databases in a cluster and Managing servers in a cluster.

  4. Causal consistency: When invoked, a client application is guaranteed to read at least its own writes.

For information about cluster design patterns and anti-patterns, see Designing a resilient multi-data center cluster.

Operational view

From an operational point of view, it is useful to view the cluster as a homogenous pool of servers which run a number of databases.

Note that primary and secondary are roles for a copy of a database. Servers do not have roles. Instead, they can be constrained by a modeConstraint set to PRIMARY, SECONDARY, or NONE. That means they can host copies of standard databases that have either a primary role or a secondary role (see servers 1 or 3 in the Figure 1). If the modeConstraint is set to NONE, a server can host copies of databases that have either role. In other words, a server can host primaries for some databases and secondaries for other databases (see server 6 in the Figure 1).

Similarly, it is possible for a database to be hosted on only one server, even when that server is part of a cluster. In such cases, the database allocation is always primary. See single primary for details.

operational view new
Figure 1. Cluster architecture

Database primaries

A primary is a copy of a database that is a participant in the processing of write operations and can be the writer for that database. A database can have one or more primary allocations within a cluster.

Only primaries are eligible to act as the writer for a database. At any given time, only one primary is automatically elected as a writer among the database’s primaries. The writer may change over time.

The database writer synchronously pushes writes to other primaries and does not allow a commit to be completed until it receives confirmation that the data has been written to enough members.

For high availability, create a database with multiple primaries. If high availability is not required, a database can be created with a single primary to achieve minimum write latency.

If too many primaries fail, the database can no longer process writes and becomes read-only.

Primaries provide:

  • The ability to write to your database (with optional fault tolerance).

  • The ability to read from your database.

  • Fault tolerance for different failure scenarios.

The fault tolerance is calculated with the formula:
M = 2F + 1, where M is the number of primaries required to tolerate F faults.

Generally speaking, fault tolerance is the number of primary database copies you can lose without affecting a certain operation. For instance, with one primary copy, you have no fault tolerance, because if it goes offline, nothing is available. If you have two servers, each with a primary copy of the database, and one goes offline, the other will still have some copy of the data, so read availability would be preserved.

Types of fault tolerance, listed from easiest to hardest to lose, are as follows:

  • Write availability:
    If write availability is lost, your database cannot accept any more writes.

  • Read availability:
    If read availability is lost, your database cannot serve any more reads.

  • Durability:
    If durability is lost, the data written to your database is lost, and you need to restore the database from a backup.

Operations such as shutting Neo4j process down to upgrade the binaries, or taking the server it runs on offline for maintenance, are included as faults in this context, since they make the database copy unavailable.

If you want upgrades with no downtime, you need fault tolerance.

Therefore, you need minimum three primaries (and three servers to host each copy) to be able to maintain write availability with the failure of one member. This is enough for most deployments.

If you want to retain write availability with the failure of two primary members, you need five primaries.

The maximum number of primaries you can have is nine, but it is not recommended having that many primaries. Because the more primaries you have, the more servers you have to contact for each write operation, which can increase the latency of writes.

Database secondaries

A secondary is a database copy asynchronously replicated from primaries via transaction log shipping. Secondaries periodically check an upstream database member for new transactions, which are then transferred to them.

The main purpose of database secondaries is to scale out read workloads. Secondaries act like caches for graph data and can execute arbitrary read-only queries and procedures.

Multiple secondaries can be fed data from a relatively small number of primaries, providing a significant distribution of query workloads for better scalability. Databases can have a fairly large number of secondaries.

The loss of a secondary does not affect the database’s availability; however, it reduces the query throughput. It also does not affect the database fault tolerance.

While secondaries serve as a copy of your database, providing some level of durability (what is committed cannot be lost), they do not guarantee it completely. Secondaries pull updates from a selected upstream member on their own schedule. Due to their asynchronous nature, secondaries may temporarily lag behind the primary, meaning recently committed transactions may not be immediately visible on secondaries.

Secondaries typically provide read scaling, i.e. if you have more read queries happening than your primaries can handle, you can add secondaries to share the load; or even configure the query routing so that reads preferentially target secondaries to leave the primaries free to handle just the write workload.

So, there is no hard rule about the number of secondaries to have. Starting with zero and adding more until your read performance and cluster stability is acceptable is the usual approach.

The maximum number of secondaries you can have is 20.

Primaries and secondaries for the system database

The system database, which records what databases are present in the DBMS, also can be in a primary or secondary mode. However, unlike standard databases, it is not configured using Cypher commands to define the topology. Instead, it is controlled through the server.cluster.system_database_mode setting.

Use the following guidelines when deciding how many primary and secondary system databases to have and which servers should host them:

  • Stable, long-lived servers are good candidates to host a system primary, since they are expected to remain online and can be intentionally shut down when needed.

  • Ephemeral or frequently changing servers are good candidates to host a system secondary, as they may be added or removed more often.

  • A single system primary provides no fault tolerance for writes to the system database. Therefore, in a typical cluster deployment, it is best to start with three system primaries to ensure write availability.

  • Although the write volume for the system database is low and it can tolerate higher write latency, allowing to have more than nine system primaries, doing so is generally not recommended.

Examples of database topologies

For information about the cluster deployment across multiple data centers, refer to the Designing a resilient multi-data center cluster.

Single primary

If you have a single copy of the database, it is a primary. All writes and reads goes through this copy. If the copy becomes unavailable, no writes or reads are possible. If the disk for that copy is lost or corrupted, durability is lost and you must restore from the latest full backup.

Three primaries

In a cluster with three primaries, the members elect a leader to process write operations. Each write is replicated to at least one additional primary before being considered committed (durable). This ensures that if any single primary fails, that update remains available on another member. That includes if the database copy is fully lost, including the disk being unrecoverable.

follower writer
Figure 2. Communication between three primaries and a client

If one primary copy fails, the database is still write-available with the remaining two primaries, but it no longer has fault tolerance for its write availability.

Another failure would prevent any new writes from being processed until either one of the other members is brought back, or the database is recreated with new members. The database would still be read-available on the last member though.

The non-writer primaries also provide read capacity and fault tolerance. By default, read queries are routed away from the writer (see dbms.routing.reads_on_writers_enabled).

See Geo-distribution of user database primaries for the pattern of deploying a cluster with three primaries across three data centers.

Five primaries

If you want fault tolerance greater than one arbitrary database member, deploy five primaries. You will have tolerance to the failure of any two primaries. The remaining three primaries can still maintain quorum and ensure the database continues to operate.

For information about deploying five primaries across multiple data centers, see Resilient multi-region cluster deployment → Designing a resilient multi-data center cluster.

Single primary plus secondaries

As described above, a single primary provides no fault tolerance for both write availability or durability. If the single primary fails, no write operations can be processed, and if its disk is lost, the most recent updates may be lost.

However, adding one or more secondaries means that read availability can be maintained despite the loss of the primary.

Keep in mind that the secondaries may not have the most up to date data, which is only guaranteed to be present on the primary.

The secondaries typically handle all of the read queries (see dbms.routing.reads_on_writers_enabled).

Three primaries plus secondaries

As described above, three primaries provide fault tolerance for both write availability and durability. Both secondaries and non-writer primaries can handle read queries. Although, you can configure only secondaries to handle reads (see dbms.routing.reads_on_primaries_enabled) if you need primaries to focus on the write workload.

The loss of any single database copy does not affect write availability, read availability, or durability.

If all the secondaries fail, then primaries that are not acting as the writer start handling read queries to maintain read availability.

Causal consistency

While the operational mechanics of the cluster are interesting from an application point of view, it is also helpful to think about how applications use the database to get their work done. In many applications, it is typically desirable to both read from the graph and write to the graph. Depending on the nature of the workload, it is common to want reads from the graph to take into account previous writes to ensure causal consistency.

Causal consistency is one of numerous consistency models used in distributed computing. It ensures that causally related operations are seen by every instance in the system in the same order. Consequently, client applications are guaranteed to read their own writes, regardless of which instance they communicate with. This simplifies interaction with large clusters, allowing clients to treat them as a single (logical) server.

Causal consistency makes it possible to write to databases hosted on servers in primary mode and read those writes from databases hosted on servers in secondary mode (where graph operations are scaled out). For example, causal consistency guarantees that the write which created a user account is present when that same user subsequently attempts to log in.

On executing a transaction, the client can ask for a bookmark which it then presents as a parameter to subsequent transactions. Using that bookmark, the cluster can ensure that only servers which have processed the client’s bookmarked transaction will run its next transaction. This provides a causal chain which ensures correct read-after-write semantics from the client’s point of view.

Aside from the bookmark everything else is handled by the cluster. The database drivers work with the cluster topology manager to choose the most appropriate servers to route queries to. For instance, routing reads to database secondaries and writes to database primaries.