Skip to main content
data systems mechanics invariants in distributed architectures

Replication Topologies: Leaders and Followers

7 min read Chapter 6 of 28
Summary

This section introduces replication topologies as fundamental mechanisms...

This section introduces replication topologies as fundamental mechanisms for achieving high availability and fault tolerance in distributed database systems. It contrasts single-leader and multi-leader replication strategies, highlighting their trade-offs between consistency, availability, and performance. Single-leader replication provides strong consistency by funneling all writes through a designated leader node, creating a potential write bottleneck but ensuring a single source of truth. Multi-leader replication improves write availability by allowing multiple nodes to accept writes independently, but introduces complexity through required conflict resolution mechanisms like last-write-wins (LWW) and application-defined merge logic (CRDTs). The section explores practical challenges including replication lag, stale reads, and failure handling through automatic failover. Key terminology introduced includes: single-leader replication (one designated node accepts all writes), multi-leader replication (multiple nodes accept writes requiring conflict resolution), synchronous replication (waits for follower acknowledgments), asynchronous replication (confirms writes before replication), and replication lag (delay between leader commit and follower apply). The included comparison table systematically evaluates consistency models, write availability, read scalability, and failure handling across different replication types. Code examples demonstrate synchronous vs. asynchronous write paths and conflict resolution strategies.

Replication Topologies: Enforcing Consistency Through Invariants

Replication is not a feature—it is a necessary consequence of distributed systems’ exposure to failure. The only question is how consistency is enforced when data exists in multiple places simultaneously. Every replication topology makes immutable tradeoffs between consistency, availability, and operational complexity. These tradeoffs are not negotiable; they are dictated by the laws of distributed computing. This section defines replication topologies by their consistency invariants, the mechanisms that enforce them, and the failure recovery protocols that dominate real-world operation.

Single-Leader Replication: Strong Consistency at the Cost of Write Bottleneck

Invariant: All clients observe a total order of writes. Reads reflect the latest committed write.

This guarantee is enforced by routing all writes through a single leader. The leader serializes writes using a durable log, ensuring global consistency. Followers apply the log in the same order, converging to the same state. This model provides strong consistency but introduces a write bottleneck: throughput is limited by the capacity of one node.

Failure is the default state. If the leader fails, the system cannot accept writes until a new leader is elected. Automatic failover is required, but it risks split-brain if not coordinated with quorum enforcement.

from typing import List, Dict, Optional
from dataclasses import dataclass
from enum import Enum
import time

class ReplicationMode(Enum):
    SYNC = "sync"
    ASYNC = "async"

@dataclass
class WALRecord:
    key: str
    value: str
    term: int  # Leader's epoch
    index: int  # Log position
    timestamp: float = time.time()


class FollowerNode:
    def __init__(self, node_id: str):
        self.node_id = node_id
        self.last_applied_index: int = -1
        self.log: Dict[int, WALRecord] = {}

    def replicate(self, record: WALRecord) -> bool:
        """Append to local log and return success."""
        self.log[record.index] = record
        return True

    def apply_up_to(self, index: int) -> None:
        """Apply log entries in order up to index."""
        while self.last_applied_index + 1 <= index:
            next_index = self.last_applied_index + 1
            if next_index in self.log:
                record = self.log[next_index]
                # Apply to state machine (e.g., key-value store)
                self.apply_record(record)
                self.last_applied_index = next_index

    def apply_record(self, record: WALRecord) -> None:
        # Simulate state update
        pass


class LeaderNode:
    def __init__(self):
        self.current_term: int = 0
        self.log: List[WALRecord] = []
        self.commit_index: int = -1
        self.wal_offset: int = 0

    def appendToWAL(self, key: str, value: str) -> WALRecord:
        record = WALRecord(
            key=key,
            value=value,
            term=self.current_term,
            index=len(self.log)
        )
        self.log.append(record)
        return record

    def apply(self, record: WALRecord) -> None:
        # Apply to local state machine
        pass


class ReplicatedDatabase:
    def __init__(self, mode: ReplicationMode = ReplicationMode.SYNC):
        self.leader = LeaderNode()
        self.followers: List[FollowerNode] = [FollowerNode(f"f{i}") for i in range(3)]
        self.mode = mode

    def write(self, key: str, value: str) -> bool:
        # 1. Write to leader's WAL
        record = self.leader.appendToWAL(key, value)

        # 2. Replicate based on mode
        if self.mode == ReplicationMode.SYNC:
            # Wait for all followers to acknowledge
            for follower in self.followers:
                if not follower.replicate(record):
                    return False  # Fail fast on any failure

        else:  # ASYNC
            for follower in self.followers:
                # Fire and forget
                follower.replicate(record)

        # 3. Apply to leader's state
        self.leader.apply(record)
        self.leader.commit_index = record.index

        # 4. Apply to followers asynchronously in background
        if self.mode == ReplicationMode.ASYNC:
            self._replicate_async(record)

        return True

    def _replicate_async(self, record: WALRecord) -> None:
        """Background task to push to followers."""
        for follower in self.followers:
            try:
                follower.replicate(record)
                # Background apply
                follower.apply_up_to(record.index)
            except Exception:
                # Retry logic omitted
                pass

    def read(self, key: str, consistent: bool = False) -> Optional[str]:
        """If consistent, read from leader. Otherwise, read from any follower."""
        if consistent:
            # Simulate read from leader's state
            return self._read_from_leader(key)
        else:
            # Read from any follower (eventual consistency)
            return self.followers[0].log.get(self.followers[0].last_applied_index, None)  # Simplified

    def _read_from_leader(self, key: str) -> Optional[str]:
        # Simulate key-value lookup
        for record in reversed(self.leader.log):
            if record.key == key:
                return record.value
        return None

Split-brain occurs when two nodes believe they are the leader. Recovery requires fencing: the old leader must be isolated before the new one takes over. This is enforced via distributed locks (e.g., using Raft or Paxos) or lease-based coordination. Without fencing, concurrent writes violate the total order invariant and cause irreversible divergence.

Multi-Leader Replication: Write Availability at the Cost of Conflict Complexity

Invariant: Writes succeed on any leader during network partition. Consistency is eventual and application-defined.

This model allows multiple leaders to accept writes independently. It maximizes write availability but sacrifices strong consistency. Concurrent writes to the same key on different leaders create conflicting versions. The system cannot resolve these automatically without policy.

Conflict resolution is not optional—it is the core design constraint. The choice of strategy determines correctness.

from typing import List, Dict, Tuple
from dataclasses import dataclass
import time

@dataclass
class VersionedValue:
    value: str
    timestamp: float = time.time()
    leader_id: str = ""
    version_vector: Dict[str, int] = None  # Per-leader counter

    def __post_init__(self):
        if self.version_vector is None:
            self.version_vector = {self.leader_id: 1}

    def increment_version(self) -> None:
        self.version_vector[self.leader_id] = self.version_vector.get(self.leader_id, 0) + 1
        self.timestamp = time.time()


def resolve_conflict_lww(conflicting: List[VersionedValue]) -> VersionedValue:
    """Last-write-wins by timestamp. Loses data on clock skew."""
    return max(conflicting, key=lambda v: v.timestamp)


def resolve_conflict_version_vector(conflicting: List[VersionedValue]) -> List[VersionedValue]:
    """Detect concurrent writes. Return all conflicting versions for application resolution."""
    # Simplified: compare version vectors for causality
    # If A > B: A overwrites B
    # If A || B: concurrent, return both
    survivors = []
    for v in conflicting:
        is_overwritten = False
        for other in conflicting:
            if other != v and _causally_dominates(other.version_vector, v.version_vector):
                is_overwritten = True
                break
        if not is_overwritten:
            survivors.append(v)
    return survivors


def _causally_dominates(a: Dict[str, int], b: Dict[str, int]) -> bool:
    """True if a >= b for all keys and a > b for at least one."""
    if set(a.keys()) < set(b.keys()):
        return False
    greater_or_equal = all(a.get(k, 0) >= b[k] for k in b)
    strictly_greater = any(a.get(k, 0) > b[k] for k in b)
    return greater_or_equal and strictly_greater


def merge_shopping_carts(cart_a: Dict[str, int], cart_b: Dict[str, int]) -> Dict[str, int]:
    """CRDT-style merge: sum quantities per item."""
    result = cart_a.copy()
    for item, qty in cart_b.items():
        result[item] = result.get(item, 0) + qty
    return result

Split-brain is not a failure mode—it is a normal operating condition in multi-leader systems. Leaders operate independently during partitions. Recovery requires merging divergent histories using version vectors or application-specific logic. Read repair detects inconsistencies during reads and triggers reconciliation.

Replication Strategy Tradeoffs

Replication TypeConsistency GuaranteeWrite AvailabilityRead ScalabilityFailure RecoveryConflict Resolution
Single-Leader (Sync)Strong (total order)Low (single writer)High (read from followers)Requires leader election + fencingNone (single source of truth)
Single-Leader (Async)Eventual (stale reads possible)ModerateHighRisk of data loss; requires WAL replayNone
Multi-LeaderEventual (with conflicts)HighHighAutomatic during partition; merge on healRequired (LWW, version vectors, CRDTs)
Leaderless (Quorum)Configurable (R + W > N)HighHighRead repair, anti-entropyVersion vectors, reconciliation

Design Implications

  • Strong consistency requires a single serialization point. This is enforced by a leader, a lock service, or a consensus algorithm. There is no alternative.
  • Write availability requires relinquishing immediate consistency. Multi-leader and leaderless systems achieve this by allowing concurrent writes, but they shift conflict resolution to the application.
  • Replication lag is not a performance issue—it is a consistency boundary. Asynchronous replication decouples durability from latency but exposes stale reads. Read-after-write consistency requires routing recent reads to the leader or tracking replication watermarks.
  • Split-brain recovery is a mandatory design requirement. Systems must detect and resolve divergent states. Fencing (via leases or consensus) prevents dual leadership in single-leader systems. Multi-leader systems must merge concurrent writes using causal metadata.
  • The choice of replication topology is determined by the application’s tolerance for inconsistency. Financial ledgers require strong consistency. Collaborative editors tolerate eventual consistency with merge logic. Choose accordingly.