Skip to main content
data systems mechanics invariants in distributed architectures

Lambda and Kappa Architectures

5 min read Chapter 27 of 28
Summary

This section introduces Lambda and Kappa Architectures as...

This section introduces Lambda and Kappa Architectures as patterns for unifying batch and stream processing. Lambda Architecture combines a batch layer (source of truth) with a speed layer (low-latency approximations), enforcing the invariant that batch overrides speed. Kappa Architecture uses a single stream processing engine for all data via stream replay, enforcing the invariant that all data flows through one layer. The comparison table highlights trade-offs: Lambda has high code duplication but clear separation; Kappa has low duplication but requires state management across replay. Key concepts include immutable logs as the system of record, materialized views as derived data, and idempotent/deterministic processing for fault tolerance. The section synthesizes previous concepts (batch vs. stream, immutable logs) into architectural patterns, emphasizing trade-offs between latency, throughput, and fault tolerance.

Lambda and Kappa Architectures

The design of resilient data systems is governed by immutable trade-offs. There is no neutral ground: every architectural decision enforces a strict invariant at the cost of a measurable penalty. This chapter formalizes how systems achieve deterministic recovery, maintain consistency under failure, and evolve without corruption—by treating data as an append-only sequence of facts. The log is the system of record; all views are derived. This outside-in perspective—where the database is a materialization of the log—is foundational to both Lambda and Kappa architectures.

Lambda Architecture

Invariant: The batch layer is the source of truth. The speed layer provides low-latency approximations that are eventually superseded by batch results.

Lambda enforces accuracy by separating processing into two paths: a batch layer for correctness and a speed layer for latency. This duality ensures that even if the speed layer fails or produces stale results, the batch layer can deterministically reconstruct the ground truth. The cost is code duplication and operational complexity.

Batch Layer

Invariant: Batch processing is deterministic, idempotent, and fully recomputable from the raw log.

The batch layer consumes the entire event log and applies deterministic transformations to produce batch views. These views are written to the serving layer and serve as the authoritative state. Because processing is idempotent, recomputation yields identical results, enabling recovery from any failure by replaying the log.

from typing import List, Dict
from hashlib import sha256

def compute_batch_view(events: List[Dict]) -> Dict[str, any]:
    """
    Deterministic batch processing: same input always yields same output.
    Idempotency is enforced via key-based upserts.
    """
    state = {}
    for event in sorted(events, key=lambda e: e["timestamp"]):
        key = event["key"]
        # Deterministic merge: latest event wins, but order is fixed
        state[key] = {**state.get(key, {}), **event["value"]}
    return state

Speed Layer

Invariant: Real-time views are approximate and ephemeral; they may be incorrect but never inconsistent with the log.

The speed layer processes events in real time, maintaining stateful operators to compute incremental updates. These real-time views are served alongside batch views but are not durable. Upon failure, the speed layer restarts from the log and rebuilds state. Approximations are acceptable because the batch layer will eventually correct them.

Serving Layer

Invariant: Query results are consistent with the batch view for all data older than the batch window.

The serving layer merges results from the batch and speed layers. For historical data, it returns the batch view. For recent data, it returns the real-time view. When the batch layer updates the view, it atomically overwrites the corresponding segment, ensuring that approximations are bounded in time and scope.

Kappa Architecture

Invariant: All processing is unified under a single, replayable stream processor.

Kappa eliminates code duplication by routing all data—historical and real-time—through a single stream processing engine. Historical computation is achieved by replaying the log from the beginning. The cost is increased complexity in state management and longer recovery times.

Stream Processing Engine

Invariant: The stream processor can reset its state and replay the log to produce bit-identical results.

The engine consumes the event log and applies the same logic for both real-time and historical processing. State is checkpointed to enable recovery. Reprocessing requires the ability to reset state and replay events in deterministic order.

from typing import Iterator
from dataclasses import dataclass

def replay_stream(log: Iterator[Dict], processor_fn) -> Dict:
    """
    Replay the entire stream from checkpoint 0 to reconstruct state.
    Processor must be deterministic and idempotent.
    """
    state = processor_fn.reset_state()
    for event in log:
        state = processor_fn(state, event)
    return state

Output Views

Invariant: All materialized views are derived from a single code path.

The stream processor writes directly to the serving layer. Because the same logic processes both new and replayed events, consistency is guaranteed by code unity. There is no divergence between batch and real-time logic.

Comparison of Lambda and Kappa Architectures

ArchitectureBatch LayerSpeed LayerConsistency GuaranteeLatencyThroughputFault ToleranceCode Duplication
LambdaYes (deterministic recomputation)Yes (approximate)Eventual (batch overrides speed)High (batch), Low (speed)High (batch), Moderate (speed)High (full recomputation)High (dual implementations)
KappaNoYes (single engine)Eventual (via deterministic replay)Low (stream)High (stream)High (checkpointing)Low (single codebase)

Trade-Offs

  • Code Duplication: Lambda requires two implementations of the same logic, increasing the risk of divergence. Kappa avoids this by design but demands that the stream processor support full replay.
  • Reprocessing: Lambda recomputes the batch view from scratch; Kappa replays the stream. Replay speed depends on state size and checkpoint frequency.
  • Latency: Lambda incurs batch latency for accurate results; Kappa delivers low-latency updates but may delay full consistency during replay.
  • State Management: Lambda recomputes batch state from the log; Kappa maintains state across runs, requiring robust checkpointing.
  • Failure Recovery: Lambda recovers by restarting batch jobs; Kappa recovers by restoring from checkpoints and replaying missed events.

Conclusion

Lambda and Kappa are not alternatives but expressions of the same principle: data systems must be designed for failure, recovery, and consistency through immutability. Lambda enforces correctness by isolating batch truth from real-time approximation, accepting code duplication as the price. Kappa unifies processing to eliminate duplication but demands that the stream engine support deterministic replay and state reset. The invariant in both is the immutable log; the divergence is in how they manage the cost of recomputation. Choose Lambda when processing logic is stable and low-latency approximation is acceptable. Choose Kappa when code consistency is paramount and replay infrastructure is available. In both, the log is the system of record—everything else is a view.