Atomic Broadcast and Leader Election - data systems mechanics invariants in distributed architectures • Dev|Journal

CH5-S1: Atomic Broadcast and Leader Election

Log replication is the core consensus problem in distributed systems: ensuring all nodes agree on a total order of operations despite failures and network partitions. This agreement—atomic broadcast—requires mechanisms that enforce strict ordering and durability across replicas. The fundamental trade-off is immutable: safety (consistency) is prioritized over availability during partitions (CP), enforced through quorum-based coordination. Paxos and Raft resolve this by structuring consensus around leader election and deterministic log replication, ensuring that once a value is committed, it remains so across failures.

Leader Election

Leader election establishes a single coordinator responsible for sequencing log entries and responding to clients within a term. The process enforces an immutable constraint: at most one leader may exist per term, ensuring safety. Nodes operate with a monotonically increasing term number, transitioning to candidate state when leadership fails. A candidate increments its term, votes for itself, and solicits votes from peers. Only upon receiving a quorum of votes does it become leader, guaranteeing exclusivity through majority agreement.

class RaftServer:
    def __init__(self, server_id, all_servers):
        self.id = server_id
        self.all_servers = all_servers
        self.current_term = 0
        self.voted_for = None
        self.log = []  # list of entries: {'term': int, 'command': any}
        self.commit_index = -1
        self.last_applied = -1
        self.state = 'follower'
        self.election_timeout = random.uniform(150, 300)  # ms
        self.leader_id = None
        self.next_index = {s: 0 for s in all_servers}
        self.match_index = {s: -1 for s in all_servers}

    # --- Leader Election ---
    def become_candidate(self):
        """Transition to candidate, start election."""
        self.state = 'candidate'
        self.current_term += 1  # Increment term for new election
        self.voted_for = self.id  # Vote for self
        self.leader_id = None
        self.reset_election_timer()
        # Request votes from all other servers
        votes_received = 1  # self-vote
        for server in self.all_servers:
            if server != self.id:
                # Send RequestVote RPC
                last_log_index = len(self.log) - 1
                last_log_term = self.log[-1]['term'] if self.log else 0
                # ... RPC logic to collect votes
        # If votes_received > len(self.all_servers)/2: become_leader()

    # --- Log Replication ---
    def leader_append_entries(self, command):
        """Leader appends a new entry and starts replication."""
        if self.state != 'leader':
            return False
        new_entry = {'term': self.current_term, 'command': command}
        self.log.append(new_entry)
        # Replicate to followers via AppendEntries
        for follower in self.all_servers:
            if follower != self.id:
                self.send_append_entries(follower)
        return True

    def send_append_entries(self, follower_id):
        """Send AppendEntries RPC to a follower."""
        prev_log_index = self.next_index[follower_id] - 1
        prev_log_term = self.log[prev_log_index]['term'] if prev_log_index >= 0 else 0
        entries = self.log[self.next_index[follower_id]:]
        # Send RPC with: term, leaderId, prevLogIndex, prevLogTerm, entries, leaderCommit
        # On successful reply, update next_index and match_index for follower
        # If reply fails due to log inconsistency, decrement next_index and retry

    def advance_commit_index(self):
        """Leader advances commitIndex based on quorum match."""
        # For N servers, quorum size = N//2 + 1
        quorum_size = len(self.all_servers) // 2 + 1
        for idx in range(self.commit_index + 1, len(self.log)):
            # Count followers that have replicated this index
            count = 1  # leader has it
            for s in self.all_servers:
                if s != self.id and self.match_index.get(s, -1) >= idx:
                    count += 1
            if count >= quorum_size and self.log[idx]['term'] == self.current_term:
                self.commit_index = idx
        # Apply committed entries to state machine
        self.apply_committed()

    def apply_committed(self):
        """Apply committed entries to state machine in order."""
        while self.last_applied < self.commit_index:
            self.last_applied += 1
            entry = self.log[self.last_applied]
            # Apply entry['command'] to the state machine
            print(f"Server {self.id} applying: {entry['command']} at index {self.last_applied}")

The code above implements the core state machine of a Raft server. The `become_candidate` method initiates leader election upon timeout, enforcing progress via randomized election timers. Upon securing a quorum of votes, the node becomes leader and drives log replication through `leader_append_entries`, ensuring total order.

### Log Replication

Log replication ensures all nodes maintain identical command sequences by having the leader replicate entries to followers. The leader tracks `next_index` and `match_index` per follower to manage replication state. Entries are appended only after achieving quorum-level replication, at which point they are marked committed and applied to the state machine in order. This mechanism enforces the Log Matching Property: if two logs contain an entry at the same index and term, then all preceding entries are identical.

### Quorum Intersection and Safety

Quorum intersection is the foundational invariant ensuring safety in consensus protocols. Any two quorums must share at least one node, preventing conflicting decisions. This property guarantees that a new leader’s log contains all previously committed entries, preserving consistency across leadership changes.

#### Diagram: Quorum Intersection Ensuring Safety

**Description:** A Venn diagram showing two overlapping circles within a larger set representing all cluster nodes (e.g., 5 nodes: A, B, C, D, E).
- **Circle 1 (Quorum Q1):** Contains nodes {A, B, C}. This represents a majority (3 out of 5).
- **Circle 2 (Quorum Q2):** Contains nodes {C, D, E}. This represents a different majority.
- **Intersection Region:** Contains node {C}. This illustrates the quorum intersection property: any two majorities must share at least one node (C).
- **Implication Text:** 
  1. **Leader Election:** Candidate X winning Q1 and candidate Y winning Q2 both require node C’s vote. Since a node votes at most once per term, only one candidate can achieve a majority.
  2. **Log Commitment:** An entry committed by leader L1 on Q1 must be present on node C. For leader L2 to be elected on Q2, it must receive C’s vote, ensuring L2’s log includes all entries known to L1. This enforces the Leader Completeness Property.

### Component Trade-offs

The following table outlines the key components of consensus protocols like Paxos and Raft, their purposes, and the immutable trade-offs they enforce:

| Component | Purpose | Key Property | Trade-off/Immutable Constraint |
|-----------|---------|--------------|--------------------------------|
| **Term Number** | Logical timeline; unique leader identifier per term. | Monotonically increasing. | Immutable: Prevents stale leaders but requires strict monotonicity (cannot regress). |
| **Quorum (Majority)** | Decision threshold for election & commitment. | Quorums must intersect. | Immutable: Provides safety (CP) but reduces write availability during partitions. |
| **Leader Election** | Establish a coordinator for log replication. | At most one leader per term. | Immutable: Provides liveness via timeouts, but timeout tuning is critical (too fast causes churn, too slow reduces availability). |
| **AppendEntries RPC** | Replicate log entries and heartbeat. | Enforces log matching property. | Immutable: Provides consistency but requires leader to manage follower log divergence (complex recovery). |
| **Commit Index** | Tracks which log entries are committed (safe to apply). | Entry committed only if stored on a quorum. | Immutable: Ensures durability but adds latency (waiting for quorum acknowledgments). |
| **Election Timeout** | Triggers new election if leader is presumed dead. | Randomized to avoid split votes. | Immutable: Provides liveness but introduces a timing assumption (circumvents FLP). |

Each component enforces an invariant critical to safety or liveness. The use of majority quorums, for example, makes split-brain impossible but sacrifices availability under partition—a non-negotiable trade-off in CP systems.

### Conclusion

Leader election and log replication are not merely protocol steps; they are mechanisms that collectively enforce the Leader Completeness Property. By requiring leaders to prove quorum membership during election and ensuring all committed entries are replicated before application, the system guarantees that no committed entry is ever lost, even amid failures and leadership transitions. This design ensures linearizable consistency: operations are applied in a globally agreed order, and recovery is deterministic. The cost—reduced availability during network partitions—is the immutable price of safety in asynchronous distributed systems.