Backpressure, Flow Control, and When to Slow Down
Backpressure, Flow Control, and When to Slow Down
The Black Box
The logistics platform’s notification service consumes package events from Kafka and sends push notifications to delivery recipients. During a warehouse batch scan, 50,000 events arrive in 30 seconds. The notification service’s push notification API can handle 100 notifications per second. At 1,667 events per second, the notification service falls behind immediately. Without backpressure, the service buffers events in memory, exhausts heap, and crashes.
The Mechanism
Backpressure is the propagation of “slow down” signals from a slow consumer to a fast producer. The signal can propagate through multiple layers:
- TCP level: The receiver’s TCP window fills. The sender pauses.
- Kafka level: The consumer polls less frequently. Messages accumulate in the partition log.
- Application level: An in-memory queue fills. The consumer rejects new work.
Without explicit backpressure design, the default behavior is unbounded buffering followed by an OOM crash.
Bounded In-Memory Queue
// Concept: bounded queue with rejection as backpressure signal
// BLACK BOX: unbounded queue (leads to OOM under load)
Queue<PackageEvent> eventQueue = new LinkedList<>(); // Grows without limit
// MECHANISM: bounded queue with rejection policy
BlockingQueue<PackageEvent> eventQueue = new ArrayBlockingQueue<>(1000);
// Producer (Kafka consumer thread):
void onEvent(PackageEvent event) {
boolean accepted = eventQueue.offer(event); // Non-blocking
if (!accepted) {
// Queue full. Backpressure signal.
// Option 1: Block until space is available (slows Kafka consumption)
// Option 2: Drop the event and increment a counter
// Option 3: Write to a spillover file (bounded disk buffer)
droppedEvents.increment();
log.warn("Event queue full. Dropped event for {}", event.packageId());
}
}
// Consumer (notification sender thread pool):
void processEvents() {
while (true) {
PackageEvent event = eventQueue.take(); // Blocks if queue is empty
sendPushNotification(event); // 10ms per notification
}
}
The bounded queue caps memory usage at 1000 × sizeof(PackageEvent). When the queue is full, the producer must decide: block (propagating backpressure to Kafka), drop (losing events), or spill to disk (trading memory for disk).
Kafka Consumer Backpressure
The Kafka consumer’s backpressure mechanism is the poll() loop. If the consumer processes slowly, it polls less frequently. Messages accumulate in the Kafka partition log. The partition log is a disk-backed buffer with configurable retention.
// Concept: Kafka consumer with processing-rate-limited consumption
Properties props = new Properties();
props.put("max.poll.records", 50); // Fetch 50 messages per poll
props.put("max.poll.interval.ms", 60000); // 60 seconds between polls
// At 100 notifications/sec and 50 records/poll:
// Processing 50 records takes 500ms.
// The consumer polls every ~500ms.
// Effective consumption rate: 100 events/sec.
// If production rate exceeds 100 events/sec, lag grows.
// The Kafka partition log absorbs the excess.
// The lag is the backpressure signal.
// If lag exceeds a threshold (e.g., 10,000 messages),
// either scale out the consumer group or reduce the production rate.
The Observable Consequence
The logistics platform without backpressure during a 50,000-event burst:
| Time | Events produced | Events consumed | In-memory buffer | Status |
|---|---|---|---|---|
| T+0s | 0 | 0 | 0 | Normal |
| T+5s | 8,335 | 500 | 7,835 | Buffer growing |
| T+10s | 16,670 | 1,000 | 15,670 | Heap pressure |
| T+15s | 25,005 | 1,500 | 23,505 | GC pauses starting |
| T+20s | 33,340 | 2,000 | 31,340 | OOM crash imminent |
With a bounded queue (capacity 1,000) and Kafka-level backpressure:
| Time | Events produced | Events consumed | Kafka lag | Dropped | Status |
|---|---|---|---|---|---|
| T+0s | 0 | 0 | 0 | 0 | Normal |
| T+5s | 8,335 | 500 | 7,835 | 0 | Lag growing in Kafka |
| T+30s | 50,000 | 3,000 | 47,000 | 0 | Burst complete, consumer catches up |
| T+8min | 50,000 | 50,000 | 0 | 0 | Fully caught up |
Kafka absorbed the burst in its partition log. The consumer processed at its maximum rate (100/sec) and caught up in 8 minutes. No data was lost. No OOM crash.
The Code
Monitoring backpressure health:
# Concept: monitoring consumer lag as a backpressure indicator
# Consumer lag over time (Kafka)
kafka-consumer-groups.sh --describe --group notification-service \
| awk 'NR>1 {sum += $6} END {print "Total lag: " sum}'
# Total lag: 47000
# If total lag is:
# < 100: Healthy. Consumer keeps pace.
# 100-10000: Elevated. Consumer is slower than producer but within tolerance.
# > 10000: Alert. Consumer needs scaling or optimization.
# Queue depth (application-level, exposed as Prometheus metric)
# notification_queue_size{service="notification"} 842
# notification_dropped_events_total{service="notification"} 0
The Decision Rule
Design backpressure before the first burst. A system that works under steady load but crashes under burst load is missing backpressure.
For Kafka consumers: the partition log is your buffer. Set max.poll.records to a value your consumer can process within max.poll.interval.ms. Monitor lag. Scale the consumer group when lag grows consistently.
For in-memory queues: always use bounded queues. Choose a queue size that absorbs normal variance (2-5 seconds of burst) without consuming excessive memory. When the queue is full, prefer dropping events with a metric over blocking the producer, unless the producer can tolerate blocking (the Kafka consumer can tolerate slower polling).
For cross-service communication: set deadlines (gRPC) or timeouts (HTTP) on every call. A slow downstream service that does not reject requests fast enough becomes a backpressure bottleneck. Reject early, signal the caller, and let the caller decide whether to retry or shed load. Chapter 12 covers the failure modes that arise when backpressure is absent or insufficient.