tags : Consensus Protocols, Distributed Systems, Nomad

Leader selection

  • The protocol doesn’t require a quorum (majority) for a leader election vote to pass, it can function on 2 servers.
  • When the 3rd one returns, it will see it’s raft log is out of date and will synchronise back up and start working again. During that phase it will not make itself a candidate for leader election

Failure

  • A Raft group can tolerate ƒ failures given 2ƒ+1 nodes. For example, in a cluster with five nodes and a topic with a replication factor of five, the topic remains fully operational if two nodes fail.
  • Raft is a majority vote algorithm. For a leader to acknowledge that an event has been committed to a partition, a majority of its replicas must have written that event to their copy of the log. When a majority (quorum) of responses have been received, the leader can make the event available to consumers and acknowledge receipt of the event when acks=all (-1).

TODO What happens

  1. node in the quorum goes down, a timeout happens

1.1 We assume we have enough replicas

  1. Raft elect new leader, and make it

Election

Raft uses a heartbeat mechanism to maintain leader authority and to trigger leader elections. The partition leader sends a periodic heartbeat to all followers to assert its leadership in the current term (default = 150 milliseconds). A term is an arbitrary period of time that starts when a leader election is triggered. If a follower does not receive a heartbeat over a period of time (default = 1.5 seconds), then it triggers an election to choose a new partition leader. The follower increments its term and votes for itself to be the leader for that term. It then sends a vote request to the other nodes and waits for one of the following scenarios:

It receives a majority of votes and becomes the leader. Raft guarantees that at most one candidate can be elected the leader for a given term.

Another follower establishes itself as the leader. While waiting for votes, the candidate may receive communication from another node in the group claiming to be the leader. The candidate only accepts the claim if its term is greater than or equal to the candidate’s term; otherwise, the communication is rejected and the candidate continues to wait for votes.

No leader is elected over a period of time. If multiple followers timeout and become election candidates at the same time, it’s possible that no candidate gets a majority of votes. When this happens, each candidate increments its term and triggers a new election round. Raft uses a random timeout between 150-300 milliseconds to ensure that split votes are rare and resolved quickly.

As long as there is a timing inequality between heartbeat time, election timeout, and mean time between node failures (MTBF), then Raft can elect and maintain a steady leader and make progress. A leader can maintain its position as long as one of the ten heartbeat messages it sends to all of its followers every 1.5 seconds is received; otherwise, a new leader is elected.

If a follower triggers an election, but the incumbent leader subsequently springs back to life and starts sending data again, then it’s too late. As part of the election process, the follower (now an election candidate) incremented the term and rejects requests from the previous term, essentially forcing a leadership change. If a cluster is experiencing wider network infrastructure problems that result in latencies above the heartbeat timeout, then back-to-back election rounds can be triggered. During this period, unstable Raft groups may not be able to form a quorum. This results in partitions rejecting writes, but data previously written to disk is not lost. Redpanda has a Raft-priority implementation that allows the system to settle quickly after network outages.