Introduction to the Neo4j Causal Clustering architecture.

1. Overview

Neo4j’s Causal Clustering provides three main features:

  1. Safety: Core Servers provide a fault tolerant platform for transaction processing which will remain available while a simple majority of those Core Servers are functioning.

  2. Scale: Read Replicas provide a massively scalable platform for graph queries that enables very large graph workloads to be executed in a widely distributed topology.

  3. Causal consistency: when invoked, a client application is guaranteed to read at least its own writes.

Together, this allows the end-user system to be fully functional and both read and write to the database in the event of multiple hardware and network failures and makes reasoning about database interactions straightforward.

In the remainder of this section we will provide an overview of how causal clustering works in production, including both operational and application aspects.

2. Operational view

From an operational point of view, it is useful to view the cluster as being composed of servers with two different roles, referred to as Primary and Secondary servers.

cluster architecture
Figure 1. Causal Cluster Architecture

The two roles are foundational in any production deployment but are managed at different scales from one another and undertake different roles in managing the fault tolerance and scalability of the overall cluster.

3. Primary servers

The Primary servers are based on two types on instances:

  • Single instance is an instance that operates without redundancy within the set of Primary servers and allows read and write operations. Redundancy is achieved by adding Secondary servers, which guarantee causal consistency but they do not safeguard data as Primary servers do. Therefore, clusters based on a Single instance as Primary server are good for read scalability, but they are not fault tolerant. If a fault occurs on the Single instance, there is a potential risk of data loss: it is the responsibility of the application or of the tooling around the cluster to eliminate or minimize such risk.

  • Core instance is an instance that allows read and write operations and its main responsibility is to safeguard data. Core instances do so by replicating all transactions using the Raft protocol. Raft ensures that the data is safely durable before confirming a transaction commit to the end user application. In practice, this means once a majority of Core instances in a cluster (N/2+1) have accepted the transaction, it is safe to acknowledge the commit to the end user application.

    The safety requirement has an impact on write latency. Implicitly, writes are acknowledged by the fastest majority, but as the number of Core instances in the cluster grows, so does the size of the majority needed to acknowledge a write.

    In practice, this means that there are relatively few machines in a typical Core instance cluster, enough to provide sufficient fault tolerance for the specific deployment. This is calculated with the formula M = 2F + 1, where M is the number of Core instances required to tolerate F faults. For example:

    • In order to tolerate two failed Core instances, you need to deploy a cluster of five Core instances.

    • The smallest fault tolerant cluster, a cluster that can tolerate one fault, must have three Core instances.

    • It is also possible to create a Causal Cluster consisting of only two Core instances. However, that cluster is not fault-tolerant. If one of the two servers fails, the remaining server becomes read-only.

With Core instances, should the cluster suffer enough Core failures, it can no longer process writes and becomes read-only to preserve safety.

In version 4.4 of Neo4j Causal Cluster, Primary servers cannot be mixed: either one Single instance is the Primary server or a set of Core instances are the Primary servers.

4. Secondary servers

In version 4.4 of Neo4j Causal Cluster, Secondary servers can only be one type of instance, called Read Replica instances.

Read Replica instances are asynchronously replicated from Primary Servers via transaction log shipping. They will periodically poll an upstream server for new transactions and have these shipped over. Many Read Replicas can be fed data from a relatively small number of Primary Servers, allowing for a large fan out of the query workload for scale.

Read Replica instances should typically be run in relatively large numbers and treated as disposable. Losing a Read Replica does not impact the cluster’s availability, aside from the loss of its fraction of graph query throughput. It does not affect the fault tolerance capabilities of the cluster.

Read Replicas are read-only, but you should not create one (or convert the dbms.mode of a CORE to READ_REPLICA) if you simply desire a neo4j instance which temporarily cannot execute write queries. Instead you should use the dbms.databases.default_to_read_only config setting to prevent writes to databases on the instance in question.

The main responsibility of Read Replica instances is to scale out read workloads. Read Replica instances act like caches for the graph data and are fully capable of executing arbitrary (read-only) queries and procedures.

When the Primary server is a Single instance, Secondary servers may be part of a Disaster Recovery strategy. Due to its asynchronous nature, Read Replica instances may not provide all transactions committed on the Primary server, but they may be set as a new Primary server in case the Single instance is no longer available. The change of a Read Replica instance into a Single instance is a manual operation that must be executed by a Database Administrator or by some tooling and it requires careful checks, in order to identify the most up-to-date instance and the status of the other instances.

5. Causal consistency

While the operational mechanics of the cluster are interesting from an application point of view, it is also helpful to think about how applications will use the database to get their work done. In many applications, it is typically desirable to both read from the graph and write to the graph. Depending on the nature of the workload, it is common to want reads from the graph to take into account previous writes to ensure causal consistency.

Causal consistency is one of numerous consistency models used in distributed computing. It ensures that causally related operations are seen by every instance in the system in the same order. Consequently, client applications are guaranteed to read their own writes, regardless of which instance they communicate with. This simplifies interaction with large clusters, allowing clients to treat them as a single (logical) server.

Causal consistency makes it possible to write to Core Servers (where data is safe) and read those writes from a Read Replica (where graph operations are scaled out). For example, causal consistency guarantees that the write which created a user account will be present when that same user subsequently attempts to log in.

causal clustering drivers
Figure 2. Causal Cluster setup with causal consistency via Neo4j drivers

On executing a transaction, the client can ask for a bookmark which it then presents as a parameter to subsequent transactions. Using that bookmark the cluster can ensure that only servers which have processed the client’s bookmarked transaction will run its next transaction. This provides a causal chain which ensures correct read-after-write semantics from the client’s point of view.

Aside from the bookmark everything else is handled by the cluster. The database drivers work with the cluster topology manager to choose the most appropriate Core Servers and Read Replicas to provide high quality of service.

Since Neo4j clusters are causally consistent, in the remainder of this chapter, the terms causal cluster or cluster are used to denote Neo4j installations consisting of primary and secondary servers.