4.2.10. Load balancing for multi-data center systems

This section describes the topology-aware load balancing options available for client applications in a multi-data center Neo4j deployment. It describes how to configure the load balancing for the cluster so that client applications can direct its workload at the most appropriate cluster members, such as those nearby.

Enabling load balancing

The load balancing functionality is part of the separately licensed multi-data center package and must be specifically enabled. See the section called “Licensing for multi-data center operations” for details.

4.2.10.1. Introduction

When deploying a multi-data center cluster we often wish to take advantage of locality to reduce latency and improve performance. For example, we would like our graph-intensive workloads to be executed in the local data center at LAN latencies rather than in a faraway data center at WAN latencies. Neo4j’s enhanced load balancing for multi-data center scenarios facilitates precisely this and can also be used to define fall-back behaviors. This means that failures can be planned for upfront and persistent overload conditions be avoided.

The load balancing system is a cooperative system where the driver asks the cluster on a recurring basis where it should direct the different classes of its workload (e.g. writes and reads). This allows the driver to work independently for long stretches of time, yet check back from time to time to adapt to changes like for example a new server having been added for increased capacity. There are also failure situations where the driver will ask again immediately, for example when it cannot use any of its allocated servers.

This is mostly transparent from the perspective of a client. On the server side we configure the load balancing behaviors and expose them under a named load balancing policy which the driver can bind to. All server-side configuration is performed on the Core Servers.

Use load balancing from Neo4j drivers

This chapter describes how to configure a Causal Cluster to use custom load balancing policies Once enabled and configured, the custom load balancing feature is used by drivers to route traffic as intended. See the Drivers chapter for instructions on how to configure drivers to use custom load balancing.

4.2.10.2. Prerequisite configuration

Enable multi-data center operations

In order to configure a cluster for load balancing we must enable the multi-data center functionality. This is described in the section called “Licensing for multi-data center operations”.

Server groups

In common with server-to-server catchup, load balancing across multiple data centers is predicated on the server group concept. Servers can belong to one or more potentially overlapping server groups, and decisions about where to route requests from client to cluster member are parameterized based on that configuration. For details on server group configuration, refer to Section 4.2.9.2, “Server groups”.

Cores for reading

Depending on the deployment and the available number of servers in the cluster different strategies make sense for whether or not the reading workload should be routed to the Core Servers. The following configuration will allow the routing of read workload to Core Servers. Valid values are true and false.

causal_clustering.cluster_allow_reads_on_followers=true

4.2.10.3. The load balancing framework

The load balancing system is based on a plugin architecture for future extensibility and for allowing user customizations. The current version ships with exactly one such canned plugin called the server policies plugin.

The server policies plugin is selected by setting the following property:

causal_clustering.load_balancing.plugin=server_policies

Under the server policies plugin, a number of load balancing policies can be configured server-side and be exposed to drivers under unique names. The drivers, in turn, must on instantiation select an appropriate policy by specifying its name. Common patterns for naming policies are after geographical regions or intended application groups.

It is of crucial importance to define the exact same policies on all core machines since this is to be regarded as cluster-wide configuration and failure to do so will lead to surprising behavior. Similarly, policies which are in active use should not be removed or renamed since it will break applications trying to use these policies. It is perfectly acceptable and expected however that policies be modified under the same name.

If a driver asks for a policy name which is not available, then it will not be able to use the cluster. A driver which does not specify any name at all will get the behavior of the default policy as configured. The default policy, if left unchanged, distributes the load across all servers. It is possible to change the default policy to any behavior that a named policy can have.

A misconfigured driver or load balancing policy will result in suboptimal routing choices or even prevent successful interactions with the cluster entirely.

The details of how to write a custom plugin is not documented here. Please get in contact with Neo4j Professional Services if you think that you need a custom plugin.

Policy definitions

The configuration of load balancing policies is transparent to client applications and expressed via a simple DSL. The syntax consists of a set of rules which are considered in order. The first rule to produce a non-empty result will be the final result.

rule1; rule2; rule3

Each rule in turn consists of a set of filters which limit the considered servers, starting with the complete set. Note that the evaluation of each rule starts fresh with the complete set of available servers.

There is a fixed set of filters which compose a rule and they are chained together using arrows

filter1 -> filter2 -> filter3

If there are any servers still left after the last filter then the rule evaluation has produced a result and this will be returned to the driver. However, if there are no servers left then the next rule will be considered. If no rule is able to produce a usable result then the driver will be signalled a failure.

Policy names

The policies are configured under the namespace of the server policies plugin and named as desired. Policy names can contain alphanumeric characters and underscores, and they are case sensitive. Below is the property key for a policy with the name mypolicy.

causal_clustering.load_balancing.config.server_policies.mypolicy=

The actual policy is defined in the value part using the DSL.

The default policy name is reserved for the default policy. It is possible to configure this policy like any other and it will be used by driver clients which do not specify a policy.

Additionally, any number of policies can be created using unique policy names. The policy name can suggest a particular region or an application for which it is intended to be used.

Filters

There are four filters available for specifying rules, detailed below. The syntax is similar to a method call with parameters.

  • groups(name1, name2, …​)

    • Only servers which are part of any of the specified groups will pass the filter.
    • The defined names must match those of the server groups.
  • min(count)

    • Only the minimum amount of servers will be allowed to pass (or none).
    • Allows overload conditions to be managed.
  • all()

    • No need to specify since it is implicit at the beginning of each rule.
    • Implicitly the last rule (override this behavior using halt).
  • halt()

    • Only makes sense as the last filter in the last rule.
    • Will stop the processing of any more rules.

The groups filter is essentially an OR-filter, e.g. groups(A,B) which will pass any server in either A, B or both (the union of the server groups). An AND-filter can also be created by chaining two filters as in groups(A) -> groups(B), which will only pass servers in both groups (the intersect of the server groups).

4.2.10.4. Load balancing examples

In our discussion on multi-data center clusters we introduced a four region, multi-data center setup. We used the cardinal compass points for regions and numbered data centers within those regions. We’ll use the same hypothetical setup here too.

Figure 4.18. Mapping regions and data centers onto server groups
nesw regions and dcs

We configure the behavior of the load balancer in the property causal_clustering.load_balancing.config.server_policies.<policy-name>. The rules we specify will allow us to fine tune how the cluster routes requests under load.

In the examples we will make use of the line continuation character \ for better readability. It is valid syntax in neo4j.conf as well and it is recommended to break up complicated rule definitions using this and a new rule on every line.

The most restrictive strategy would be to insist on a particular data center to the exclusion of all others:

Example 4.16. Specific data center only
causal_clustering.load_balancing.config.server_policies.north1_only=\
groups(north1)->min(2); halt();

In this case we’re stating that we are only interested in sending queries to servers in the north1 server group, which maps onto a specific physical data center, provided there are two of them available. If we cannot provide at least two servers in north1 then we should halt(), i.e. not try any other data center.

While the previous example demonstrates the basic form of our load balancing rules, we can be a little more expansive:

Example 4.17. Specific data center preferably
causal_clustering.load_balancing.config.server_policies.north1=\
groups(north1)->min(2);

In this case if at least two servers are available in the north1 data center then we will load balance across them. Otherwise we will use any server in the whole cluster, falling back to the implicit, final all() rule.

The previous example considered only a single data center before resorting to the whole cluster. If we have a hierarchy or region concept exposed through our server groups we can make the fall back more graceful:

Example 4.18. Gracefully falling back to neighbors
causal_clustering.load_balancing.config.server_policies.north_app1=\
groups(north1,north2)->min(2);\
groups(north);\
all();

In this case we’re saying that the cluster should load balance across the north1 and north2 data centers provided there are at least two machines available across them. Failing that, we’ll resort to any instance in the north region, and if the whole of the north is offline we’ll resort to any instances in the cluster.