Skip to main content
data systems from the ground up

Backpressure, Flow Control, and When to Slow Down

5 min read Chapter 33 of 36

Backpressure, Flow Control, and When to Slow Down

The Black Box

The logistics platform’s notification service consumes package events from Kafka and sends push notifications to delivery recipients. During a warehouse batch scan, 50,000 events arrive in 30 seconds. The notification service’s push notification API can handle 100 notifications per second. At 1,667 events per second, the notification service falls behind immediately. Without backpressure, the service buffers events in memory, exhausts heap, and crashes.

The Mechanism

Backpressure is the propagation of “slow down” signals from a slow consumer to a fast producer. The signal can propagate through multiple layers:

  1. TCP level: The receiver’s TCP window fills. The sender pauses.
  2. Kafka level: The consumer polls less frequently. Messages accumulate in the partition log.
  3. Application level: An in-memory queue fills. The consumer rejects new work.

Without explicit backpressure design, the default behavior is unbounded buffering followed by an OOM crash.

Bounded In-Memory Queue

// Concept: bounded queue with rejection as backpressure signal

// BLACK BOX: unbounded queue (leads to OOM under load)
Queue<PackageEvent> eventQueue = new LinkedList<>();  // Grows without limit

// MECHANISM: bounded queue with rejection policy
BlockingQueue<PackageEvent> eventQueue = new ArrayBlockingQueue<>(1000);

// Producer (Kafka consumer thread):
void onEvent(PackageEvent event) {
    boolean accepted = eventQueue.offer(event);  // Non-blocking
    if (!accepted) {
        // Queue full. Backpressure signal.
        // Option 1: Block until space is available (slows Kafka consumption)
        // Option 2: Drop the event and increment a counter
        // Option 3: Write to a spillover file (bounded disk buffer)
        droppedEvents.increment();
        log.warn("Event queue full. Dropped event for {}", event.packageId());
    }
}

// Consumer (notification sender thread pool):
void processEvents() {
    while (true) {
        PackageEvent event = eventQueue.take();  // Blocks if queue is empty
        sendPushNotification(event);              // 10ms per notification
    }
}

The bounded queue caps memory usage at 1000 × sizeof(PackageEvent). When the queue is full, the producer must decide: block (propagating backpressure to Kafka), drop (losing events), or spill to disk (trading memory for disk).

Kafka Consumer Backpressure

The Kafka consumer’s backpressure mechanism is the poll() loop. If the consumer processes slowly, it polls less frequently. Messages accumulate in the Kafka partition log. The partition log is a disk-backed buffer with configurable retention.

// Concept: Kafka consumer with processing-rate-limited consumption

Properties props = new Properties();
props.put("max.poll.records", 50);             // Fetch 50 messages per poll
props.put("max.poll.interval.ms", 60000);      // 60 seconds between polls

// At 100 notifications/sec and 50 records/poll:
// Processing 50 records takes 500ms.
// The consumer polls every ~500ms.
// Effective consumption rate: 100 events/sec.
// If production rate exceeds 100 events/sec, lag grows.
// The Kafka partition log absorbs the excess.

// The lag is the backpressure signal.
// If lag exceeds a threshold (e.g., 10,000 messages),
// either scale out the consumer group or reduce the production rate.

The Observable Consequence

The logistics platform without backpressure during a 50,000-event burst:

TimeEvents producedEvents consumedIn-memory bufferStatus
T+0s000Normal
T+5s8,3355007,835Buffer growing
T+10s16,6701,00015,670Heap pressure
T+15s25,0051,50023,505GC pauses starting
T+20s33,3402,00031,340OOM crash imminent

With a bounded queue (capacity 1,000) and Kafka-level backpressure:

TimeEvents producedEvents consumedKafka lagDroppedStatus
T+0s0000Normal
T+5s8,3355007,8350Lag growing in Kafka
T+30s50,0003,00047,0000Burst complete, consumer catches up
T+8min50,00050,00000Fully caught up

Kafka absorbed the burst in its partition log. The consumer processed at its maximum rate (100/sec) and caught up in 8 minutes. No data was lost. No OOM crash.

The Code

Monitoring backpressure health:

# Concept: monitoring consumer lag as a backpressure indicator

# Consumer lag over time (Kafka)
kafka-consumer-groups.sh --describe --group notification-service \
    | awk 'NR>1 {sum += $6} END {print "Total lag: " sum}'

# Total lag: 47000

# If total lag is:
# < 100:   Healthy. Consumer keeps pace.
# 100-10000: Elevated. Consumer is slower than producer but within tolerance.
# > 10000:  Alert. Consumer needs scaling or optimization.

# Queue depth (application-level, exposed as Prometheus metric)
# notification_queue_size{service="notification"} 842
# notification_dropped_events_total{service="notification"} 0

The Decision Rule

Design backpressure before the first burst. A system that works under steady load but crashes under burst load is missing backpressure.

For Kafka consumers: the partition log is your buffer. Set max.poll.records to a value your consumer can process within max.poll.interval.ms. Monitor lag. Scale the consumer group when lag grows consistently.

For in-memory queues: always use bounded queues. Choose a queue size that absorbs normal variance (2-5 seconds of burst) without consuming excessive memory. When the queue is full, prefer dropping events with a metric over blocking the producer, unless the producer can tolerate blocking (the Kafka consumer can tolerate slower polling).

For cross-service communication: set deadlines (gRPC) or timeouts (HTTP) on every call. A slow downstream service that does not reject requests fast enough becomes a backpressure bottleneck. Reject early, signal the caller, and let the caller decide whether to retry or shed load. Chapter 12 covers the failure modes that arise when backpressure is absent or insufficient.