Mitigating Causal Cluster re-elections caused by high GCs
This article describes the effects of JVM stop-the-world GC pauses, on a causal cluster. A brief introduction to garbage collection, heap sizing and memory leak troubleshooting, is followed by a discussion on best practices and configurations to help mitigate the resulting heartbeat timeouts and cluster re-elections.
A JVM (java virtual machine, e.g. the Neo4J server) allocates required memory to newly created java objects. This memory is part
of the total heap memory allocation configured at startup via neo4j.conf. The JVM periodically checks for unused objects in heap,
which are subsequently discarded to reclaim heap memory via a process called
Garbage Collection or
GC in short.
Assuming heap sizing has been optimally done using Neo4j memory configurations (see https://neo4j.com/developer/kb/understanding-memory-consumption/), spikes in usage/data volume may still lead to increased heap utilisation and consequently longer and/or more frequent garbage collection. Memory leaks are another cause of increased heap utilisation, in many cases leading to a heap out of memory situation. A memory leak in Java is a situation where some objects are no longer used by the application but Garbage Collection fails to recognise this.
In such cases, adding more heap may simply postpone the JVM running out of heap space
java.lang.OutOfMemoryError: Java heap space error).
Additionally, increasing the amount of Java heap space also tends to increase the length of GC pauses.
One can execute heap dumps and use e.g. Eclipse MAT (ensure that your machine has sufficient memory for the analysis)
to diagnose memory leaks.
Other tools like JDK,
jstat and indeed the neo4j gc.log may help identify any outlier transactions.
One common consequence of the above long GCs or memory leaks, is heartbeat timeouts in a cluster, which is essentially the cluster members not being able to timely reach other members. That is because a full GC enforces a stop-the-world pause, during which time, the JVM halts all other operations, including network communication. In a causal cluster, this often results in a re-election, whereby unreachable members are removed from the cluster and a new cluster leader is elected. Note that there are other factors that may lead to re-elections but they generally benign and are out of the scope of this article.
Re-election in a Causal Cluster, is a quick, seamless process, which under normal operation, occurs within a few tens/hundreds milliseconds without any noticeable effects on transaction clients. Re-elections are a natural consequence of what is, effectively, the leader becoming unavailable for the duration of the pause, the followers don’t know that the leader is garbage collecting, nor should they care. In high workload situations though, coupled with frequent high GC pauses, re-elections may be undesirable. That is because the number of inbound and therefore, rejected transactions per unit time are high, during a re-election. Also because the newly elected leader may need to commit any pending transactions to store via checkpointing as well as become the new source for followers to catch up from, whilst a high number of newly incoming transactions will again require frequent and possibly long GC pauses, having a cumulative effect on the OS CPU and memory resources. Additionally, the previous leader may rejoin the cluster within a short timespan, still having higher transaction IDs in its logs and therefore becoming the leader again. Importantly, whilst an election itself may last a few milliseconds, during an election, the cluster will not accept writes. These are some of the key factors which may result in cyclic continuous leader switches, especially under heavy workload situations.
Election timeout refers to the time after which a
FOLLOWER will attempt to start an election, becoming a
timeouts continue during elections, and a
CANDIDATE will either become a
FOLLOWER or start another election if it fails to
get elected, or hear from a newly elected
LEADER, after another
causal_clustering.leader_election_timeout period. Long
election timeouts are undesirable in general. They represent a greater period of downtime during leader fail over. The aim should
not be to increase it, but rather manage the workload to cause less GC. Having a slightly longer period for the second timeout,
may reduce the overall downtime caused by elections. The documented definition of
does not mention heartbeat timeouts since heartbeats are an implementation detail. Most intra-cluster messages are also treated
as heartbeats. If a cluster member receives a message from the leader, it is considered to have heard from that leader and so
the election timeout resets the timeout after which a new leader_election will be triggered.
Causal Cluster heartbeat timeouts are controlled via the configuration
causal_clustering.leader_election_timeout which defaults
to 7s and the recommended value is less than 10s. For graphs with large heaps, GC pauses may sometimes be of the order of
minutes, in which case, such timeout values wouldn’t prevent re-elections. Note that it is not large heaps per se that cause GC,
but allocating far more than the GC is able to collect, eventually causing
stop-the-world pauses. If there were numerous smaller
pauses which were very occasionally causing leadership elections (so just occasionally missing the 7s timeout), then increasing
causal_clustering.leader_election_timeout by a small value could be a solution, but the only real solution is to generate
less garbage, which usually involves rewriting cypher queries and avoiding large data ingest queries etc.
Query optimisation is beyond the scope of this article, but is part of the Neo4J cypher manual, available at:
Another configuration which helps prevent frequent re-elections, is
causal_clustering.enable_pre_voting (added in Neo4j 3.2.9),
which controls whether "pre-elections" are enabled. Pre_voting is where an instance asks other cluster members: “if I was to call
an election, would you vote for me?”. It is an optimistic approach to stop potentially unfruitful elections (which as we know
stop writes while they’re happening). “Fruitless” in such case typically means the previous leader remains leader and all we
get is election number of seconds of unavailability while the “Fruitless” election happens. Following an election, the cluster
member with the highest "term" wins and becomes the new leader. Note that members do not increase their term unless
another member votes for them, and that leaders must step down if a follower has a higher term, it means the leader is behind.
Pre-voting therefore helps reduce the number of unnecessary elections in two ways:
It ensures that an election is not triggered unless a quorum has lost heartbeat with the leader.
It ensures that a member at a term lesser than others in the cluster, does not trigger an election.
Is this page helpful?