Atomic Broadcast and Leader Election
SummaryThis section introduces the core mechanics of crash-fault-tolerant...
This section introduces the core mechanics of crash-fault-tolerant...
This section introduces the core mechanics of crash-fault-tolerant consensus via Paxos and Raft. It defines the consensus problem as log replication, implemented through atomic broadcast. The two key sub-problems are detailed: Leader Election (establishing a unique coordinator per term via quorum votes and monotonic term numbers) and Log Replication (the leader appending and replicating entries, committing via quorum, applying to a state machine). The immutable safety guarantee is explained through the Quorum Intersection property, which ensures at most one leader per term and prevents committed entries from being lost. Key components—Term Number, Quorum, Leader Election, AppendEntries RPC, Commit Index, and Election Timeout—are analyzed as trade-offs between safety (CP), liveness, and performance. The section is anchored by a detailed Python RaftServer class and diagrams illustrating quorum intersection and replication flow.
CH5-S1: Atomic Broadcast and Leader Election
Log replication is the core consensus problem in distributed systems: ensuring all nodes agree on a total order of operations despite failures and network partitions. This agreement—atomic broadcast—requires mechanisms that enforce strict ordering and durability across replicas. The fundamental trade-off is immutable: safety (consistency) is prioritized over availability during partitions (CP), enforced through quorum-based coordination. Paxos and Raft resolve this by structuring consensus around leader election and deterministic log replication, ensuring that once a value is committed, it remains so across failures.
Leader Election
Leader election establishes a single coordinator responsible for sequencing log entries and responding to clients within a term. The process enforces an immutable constraint: at most one leader may exist per term, ensuring safety. Nodes operate with a monotonically increasing term number, transitioning to candidate state when leadership fails. A candidate increments its term, votes for itself, and solicits votes from peers. Only upon receiving a quorum of votes does it become leader, guaranteeing exclusivity through majority agreement.
class RaftServer:
def __init__(self, server_id, all_servers):
self.id = server_id
self.all_servers = all_servers
self.current_term = 0
self.voted_for = None
self.log = [] # list of entries: {'term': int, 'command': any}
self.commit_index = -1
self.last_applied = -1
self.state = 'follower'
self.election_timeout = random.uniform(150, 300) # ms
self.leader_id = None
self.next_index = {s: 0 for s in all_servers}
self.match_index = {s: -1 for s in all_servers}
# --- Leader Election ---
def become_candidate(self):
"""Transition to candidate, start election."""
self.state = 'candidate'
self.current_term += 1 # Increment term for new election
self.voted_for = self.id # Vote for self
self.leader_id = None
self.reset_election_timer()
# Request votes from all other servers
votes_received = 1 # self-vote
for server in self.all_servers:
if server != self.id:
# Send RequestVote RPC
last_log_index = len(self.log) - 1
last_log_term = self.log[-1]['term'] if self.log else 0
# ... RPC logic to collect votes
# If votes_received > len(self.all_servers)/2: become_leader()
# --- Log Replication ---
def leader_append_entries(self, command):
"""Leader appends a new entry and starts replication."""
if self.state != 'leader':
return False
new_entry = {'term': self.current_term, 'command': command}
self.log.append(new_entry)
# Replicate to followers via AppendEntries
for follower in self.all_servers:
if follower != self.id:
self.send_append_entries(follower)
return True
def send_append_entries(self, follower_id):
"""Send AppendEntries RPC to a follower."""
prev_log_index = self.next_index[follower_id] - 1
prev_log_term = self.log[prev_log_index]['term'] if prev_log_index >= 0 else 0
entries = self.log[self.next_index[follower_id]:]
# Send RPC with: term, leaderId, prevLogIndex, prevLogTerm, entries, leaderCommit
# On successful reply, update next_index and match_index for follower
# If reply fails due to log inconsistency, decrement next_index and retry
def advance_commit_index(self):
"""Leader advances commitIndex based on quorum match."""
# For N servers, quorum size = N//2 + 1
quorum_size = len(self.all_servers) // 2 + 1
for idx in range(self.commit_index + 1, len(self.log)):
# Count followers that have replicated this index
count = 1 # leader has it
for s in self.all_servers:
if s != self.id and self.match_index.get(s, -1) >= idx:
count += 1
if count >= quorum_size and self.log[idx]['term'] == self.current_term:
self.commit_index = idx
# Apply committed entries to state machine
self.apply_committed()
def apply_committed(self):
"""Apply committed entries to state machine in order."""
while self.last_applied < self.commit_index:
self.last_applied += 1
entry = self.log[self.last_applied]
# Apply entry['command'] to the state machine
print(f"Server {self.id} applying: {entry['command']} at index {self.last_applied}")
The code above implements the core state machine of a Raft server. The `become_candidate` method initiates leader election upon timeout, enforcing progress via randomized election timers. Upon securing a quorum of votes, the node becomes leader and drives log replication through `leader_append_entries`, ensuring total order.
### Log Replication
Log replication ensures all nodes maintain identical command sequences by having the leader replicate entries to followers. The leader tracks `next_index` and `match_index` per follower to manage replication state. Entries are appended only after achieving quorum-level replication, at which point they are marked committed and applied to the state machine in order. This mechanism enforces the Log Matching Property: if two logs contain an entry at the same index and term, then all preceding entries are identical.
### Quorum Intersection and Safety
Quorum intersection is the foundational invariant ensuring safety in consensus protocols. Any two quorums must share at least one node, preventing conflicting decisions. This property guarantees that a new leader’s log contains all previously committed entries, preserving consistency across leadership changes.
#### Diagram: Quorum Intersection Ensuring Safety
**Description:** A Venn diagram showing two overlapping circles within a larger set representing all cluster nodes (e.g., 5 nodes: A, B, C, D, E).
- **Circle 1 (Quorum Q1):** Contains nodes {A, B, C}. This represents a majority (3 out of 5).
- **Circle 2 (Quorum Q2):** Contains nodes {C, D, E}. This represents a different majority.
- **Intersection Region:** Contains node {C}. This illustrates the quorum intersection property: any two majorities must share at least one node (C).
- **Implication Text:**
1. **Leader Election:** Candidate X winning Q1 and candidate Y winning Q2 both require node C’s vote. Since a node votes at most once per term, only one candidate can achieve a majority.
2. **Log Commitment:** An entry committed by leader L1 on Q1 must be present on node C. For leader L2 to be elected on Q2, it must receive C’s vote, ensuring L2’s log includes all entries known to L1. This enforces the Leader Completeness Property.
### Component Trade-offs
The following table outlines the key components of consensus protocols like Paxos and Raft, their purposes, and the immutable trade-offs they enforce:
| Component | Purpose | Key Property | Trade-off/Immutable Constraint |
|-----------|---------|--------------|--------------------------------|
| **Term Number** | Logical timeline; unique leader identifier per term. | Monotonically increasing. | Immutable: Prevents stale leaders but requires strict monotonicity (cannot regress). |
| **Quorum (Majority)** | Decision threshold for election & commitment. | Quorums must intersect. | Immutable: Provides safety (CP) but reduces write availability during partitions. |
| **Leader Election** | Establish a coordinator for log replication. | At most one leader per term. | Immutable: Provides liveness via timeouts, but timeout tuning is critical (too fast causes churn, too slow reduces availability). |
| **AppendEntries RPC** | Replicate log entries and heartbeat. | Enforces log matching property. | Immutable: Provides consistency but requires leader to manage follower log divergence (complex recovery). |
| **Commit Index** | Tracks which log entries are committed (safe to apply). | Entry committed only if stored on a quorum. | Immutable: Ensures durability but adds latency (waiting for quorum acknowledgments). |
| **Election Timeout** | Triggers new election if leader is presumed dead. | Randomized to avoid split votes. | Immutable: Provides liveness but introduces a timing assumption (circumvents FLP). |
Each component enforces an invariant critical to safety or liveness. The use of majority quorums, for example, makes split-brain impossible but sacrifices availability under partition—a non-negotiable trade-off in CP systems.
### Conclusion
Leader election and log replication are not merely protocol steps; they are mechanisms that collectively enforce the Leader Completeness Property. By requiring leaders to prove quorum membership during election and ensuring all committed entries are replicated before application, the system guarantees that no committed entry is ever lost, even amid failures and leadership transitions. This design ensures linearizable consistency: operations are applied in a globally agreed order, and recovery is deterministic. The cost—reduced availability during network partitions—is the immutable price of safety in asynchronous distributed systems.