Skip to main content
surviving the spike

Cache Failure Modes and How to Survive Them

14 min read Chapter 21 of 66

Cache Failure Modes and How to Survive Them

Every caching layer fails. The question is whether it fails in a way you anticipated and instrumented, or in a way that wakes up the on-call engineer at 3 AM with a revenue dashboard showing negative fares. This section covers the five cache failure modes that will hit your ride-hailing platform, the specific Redis configurations that cause each one, and the configurations that prevent them.

Failure Mode 1: Thundering Herd

The Symptom

Friday 18:52:30.000. The surge pricing cache for zone-47 expires. 847 concurrent requests hit the endpoint in the next 200ms. All 847 see a cache miss. All 847 call computationService.compute("zone-47"). PostgreSQL receives 847 identical GROUP BY queries in a single burst. Connection pool is exhausted. P95 latency spikes to 4,200ms across all endpoints. The error rate jumps from 0.1% to 12%.

This repeats every 30 seconds, synchronized to the TTL cycle.

The Cause

Fixed TTL with no stampede protection. When the key expires, Redis deletes it atomically. The next N requests arriving before the first recompute finishes all see the miss and all trigger recomputes. With 8,400 requests per minute (140/sec) and a 200ms recompute time, you get 28 concurrent recomputes per expiration cycle.

# Redis config that CAUSES thundering herd
# (no configuration prevents it, the fix is in the application layer)
maxmemory-policy allkeys-lru
# Fixed 30s TTL in application code, no XFetch, no locking

The Fix: XFetch + SETNX Distributed Lock

Two layers of protection. XFetch prevents most stampedes by recomputing before expiration. The SETNX lock prevents cold-start stampedes when a key does not exist at all.

// SCALED: XFetch with SETNX lock for complete stampede prevention
@Service
public class StampedeProtectedSurgeService {

    private final ReactiveRedisTemplate<String, String> redisTemplate;
    private final SurgeComputationService computationService;
    private final MeterRegistry meterRegistry;

    // Measured average recompute time in milliseconds
    private final AtomicDouble avgComputeTimeMs = new AtomicDouble(150.0);
    private static final Duration SURGE_TTL = Duration.ofSeconds(30);
    private static final double XFETCH_BETA = 1.0;

    public Mono<SurgeResponse> getSurge(String zoneId) {
        String key = "surge:" + zoneId;
        return redisTemplate.opsForHash().entries(key)
            .collectMap(e -> (String) e.getKey(), e -> (String) e.getValue())
            .zipWith(redisTemplate.getExpire(key).defaultIfEmpty(Duration.ZERO))
            .flatMap(tuple -> {
                Map<String, String> cached = tuple.getT1();
                Duration ttlRemaining = tuple.getT2();

                if (cached.isEmpty()) {
                    return computeWithLock(zoneId, key);
                }

                // XFetch: probabilistic early recomputation
                if (shouldRecomputeEarly(ttlRemaining)) {
                    meterRegistry.counter("surge.xfetch.triggered", "zone", zoneId).increment();
                    computeWithLock(zoneId, key).subscribe(); // fire-and-forget
                }

                return Mono.just(fromCache(zoneId, cached));
            });
    }

    private boolean shouldRecomputeEarly(Duration ttlRemaining) {
        double delta = avgComputeTimeMs.get();
        double xfetchValue = delta * XFETCH_BETA * Math.log(Math.random());
        // Negative log of random [0,1) gives positive values
        // As TTL remaining decreases, probability of recompute increases
        return ttlRemaining.toMillis() <= Math.abs(xfetchValue);
    }

    private Mono<SurgeResponse> computeWithLock(String zoneId, String key) {
        String lockKey = "lock:surge:" + zoneId;
        return redisTemplate.opsForValue()
            .setIfAbsent(lockKey, "1", Duration.ofSeconds(5))
            .flatMap(acquired -> {
                if (Boolean.TRUE.equals(acquired)) {
                    long start = System.nanoTime();
                    return computationService.compute(zoneId)
                        .flatMap(surge -> {
                            long elapsed = (System.nanoTime() - start) / 1_000_000;
                            avgComputeTimeMs.set(
                                avgComputeTimeMs.get() * 0.9 + elapsed * 0.1);
                            return writeSurgeCache(key, surge);
                        })
                        .doFinally(sig -> redisTemplate.delete(lockKey).subscribe());
                }
                // Lock held by another request. Retry from cache after short delay.
                return Mono.delay(Duration.ofMillis(50))
                    .then(readFromCacheOrWait(key, zoneId, 3));
            });
    }

    private Mono<SurgeResponse> readFromCacheOrWait(
            String key, String zoneId, int retriesLeft) {
        if (retriesLeft <= 0) {
            // All retries exhausted, compute directly (lock may be stuck)
            return computationService.compute(zoneId);
        }
        return redisTemplate.opsForHash().entries(key)
            .collectMap(e -> (String) e.getKey(), e -> (String) e.getValue())
            .filter(map -> !map.isEmpty())
            .map(cached -> fromCache(zoneId, cached))
            .switchIfEmpty(Mono.delay(Duration.ofMillis(100))
                .then(readFromCacheOrWait(key, zoneId, retriesLeft - 1)));
    }

    private Mono<SurgeResponse> writeSurgeCache(String key, SurgeResponse surge) {
        Map<String, String> fields = Map.of(
            "multiplier", String.valueOf(surge.multiplier()),
            "computedAt", surge.computedAt().toString(),
            "demand", String.valueOf(surge.demand()),
            "supply", String.valueOf(surge.supply()),
            "invalidatedBy", "compute"
        );
        return redisTemplate.opsForHash().putAll(key, fields)
            .then(redisTemplate.expire(key, SURGE_TTL))
            .thenReturn(surge);
    }

    private SurgeResponse fromCache(String zoneId, Map<String, String> cached) {
        return new SurgeResponse(
            zoneId,
            Double.parseDouble(cached.get("multiplier")),
            Instant.parse(cached.get("computedAt")),
            Long.parseLong(cached.get("demand")),
            Long.parseLong(cached.get("supply")),
            cached.get("invalidatedBy")
        );
    }
}

The SETNX lock TTL of 5 seconds is deliberate. If the lock holder crashes, the lock auto-releases. The 50ms retry delay prevents busy-waiting. The exponential moving average for compute time keeps XFetch calibrated as system load changes.

Failure Mode 2: Stale Reads

The Symptom

The rider app shows “15 drivers available” in zone-22 with a green availability badge. The rider requests a ride. The matching service searches zone-22 and finds 2 available drivers. Wait time: 11 minutes. The rider gives the app a 1-star rating.

The Cause

The Kafka consumer group for surge-cache-invalidator is rebalancing after a rolling deployment. Partitions 3, 7, and 12 are unassigned for 38 seconds. During this window, 94 driver-status-changed events queue in the topic. The TTL on the driver count cache expired 12 seconds ago, but the periodic recompute only runs every 30 seconds and last ran 18 seconds ago.

The Fix: Read-Repair with Freshness Check

Every cache read checks the computedAt timestamp. If the data is older than a configurable threshold, trigger a background recompute and serve the stale value. The rider sees slightly stale data for one request, but the cache is corrected for subsequent requests.

// SCALED: Read-repair for stale surge data
public Mono<SurgeResponse> getSurgeWithReadRepair(String zoneId) {
    String key = "surge:" + zoneId;
    Duration maxStaleness = Duration.ofSeconds(10);

    return readFromCache(key, zoneId)
        .flatMap(cached -> {
            Instant computedAt = cached.computedAt();
            Duration age = Duration.between(computedAt, Instant.now());

            if (age.compareTo(maxStaleness) > 0) {
                meterRegistry.counter("surge.read_repair.triggered",
                    "zone", zoneId,
                    "age_seconds", String.valueOf(age.toSeconds())).increment();
                // Background recompute, serve stale value now
                computeWithLock(zoneId, key).subscribe();
            }

            return Mono.just(cached);
        })
        .switchIfEmpty(computeWithLock(zoneId, key));
}

Redis configuration for supporting read-repair:

# Redis config that PREVENTS stale reads from going undetected
# Use volatile-ttl to prefer evicting keys closest to expiry
maxmemory-policy volatile-ttl

# Enable keyspace notifications for expired keys
notify-keyspace-events Ex

Keyspace notifications let you detect when a TTL expiration fires, which means the event-driven path failed. Subscribe to __keyevent@0__:expired and increment a metric. If expired-key events spike during deployments, your Kafka consumer rebalance time is too long.

Failure Mode 3: Cache Poisoning

The Symptom

The surge pricing endpoint returns a multiplier of -0.42 for zone-31. The fare calculation shows ”$-18.90” on the rider’s screen. 340 ride requests in zone-31 over the next 28 seconds all get the negative fare. Customer support receives 12 tickets before anyone notices.

The Cause

A bug in the fare calculation service computed a negative multiplier when supply exceeded demand by more than 10x (integer underflow in the demand-supply ratio). The negative value was cached with a 30-second TTL. No validation existed on the cache write path.

The Fix: Validation Before Cache Write

Never cache a value without checking invariants. Surge multipliers must be between 1.0 and 5.0. Driver counts must be non-negative. Fare estimates must be positive.

// SCALED: Validated cache writes with poisoning detection
public Mono<SurgeResponse> writeSurgeCacheValidated(
        String key, SurgeResponse surge) {
    // Invariant checks before caching
    if (surge.multiplier() < 1.0 || surge.multiplier() > 5.0) {
        meterRegistry.counter("surge.cache.poisoning_prevented",
            "reason", "multiplier_out_of_range",
            "value", String.valueOf(surge.multiplier())).increment();
        log.error("Attempted to cache invalid surge multiplier: {} for key: {}",
            surge.multiplier(), key);
        return Mono.error(new InvalidSurgeException(
            "Multiplier " + surge.multiplier() + " out of valid range [1.0, 5.0]"));
    }

    if (surge.demand() < 0 || surge.supply() < 0) {
        meterRegistry.counter("surge.cache.poisoning_prevented",
            "reason", "negative_demand_supply").increment();
        return Mono.error(new InvalidSurgeException(
            "Negative demand or supply values"));
    }

    Map<String, String> fields = Map.of(
        "multiplier", String.valueOf(surge.multiplier()),
        "computedAt", surge.computedAt().toString(),
        "demand", String.valueOf(surge.demand()),
        "supply", String.valueOf(surge.supply()),
        "invalidatedBy", "validated_write"
    );

    return redisTemplate.opsForHash().putAll(key, fields)
        .then(redisTemplate.expire(key, SURGE_TTL))
        .thenReturn(surge);
}

For defense in depth, add a read-side validation too. If a poisoned value somehow gets into the cache (race condition, manual redis-cli write, Lua script bug), catch it before serving to users.

// SCALED: Read-side validation as a second defense layer
private Mono<SurgeResponse> readFromCacheValidated(String key, String zoneId) {
    return readFromCache(key, zoneId)
        .flatMap(cached -> {
            if (cached.multiplier() < 1.0 || cached.multiplier() > 5.0) {
                meterRegistry.counter("surge.cache.poisoned_read_detected",
                    "zone", zoneId).increment();
                // Delete the poisoned entry and recompute
                return redisTemplate.delete(key)
                    .then(computeWithLock(zoneId, key));
            }
            return Mono.just(cached);
        });
}

Failure Mode 4: Memory Pressure

The Symptom

Redis memory usage hits the maxmemory limit of 2GB. Redis starts evicting keys. The surge pricing cache for the 8 highest-traffic zones gets evicted. Those zones now hit PostgreSQL on every request. The database connection pool saturates within 15 seconds. P95 latency across all endpoints jumps from 12ms to 3,400ms.

The Cause

The allkeys-lru eviction policy treats all keys equally. A debug logging key (debug:request-trace:*) written by a developer two weeks ago consumed 400MB and was never accessed again, but it has no TTL. Meanwhile, the surge pricing keys are small (200 bytes each) but are accessed thousands of times per minute. LRU keeps the large debug keys (recently written, so “recently used”) and evicts the small surge keys.

Redis Configs: Causing vs Preventing

# CONFIG THAT CAUSES MEMORY PRESSURE FAILURES
maxmemory 2gb
maxmemory-policy allkeys-lru
# Problem: all keys are eviction candidates regardless of importance
# LRU favors recently written keys, not frequently accessed ones
# CONFIG THAT PREVENTS MEMORY PRESSURE FAILURES
maxmemory 2gb
maxmemory-policy volatile-lfu
maxmemory-samples 10
# Only keys with an explicit TTL are eviction candidates
# LFU evicts least FREQUENTLY used keys, not least recently used
# Surge pricing keys (accessed 140/sec) survive
# Debug keys without TTL are never eviction candidates, so set TTL on everything

# Require explicit TTL on all keys via application convention
# Keys without TTL: surge (30s), drivers (60s), trips (300s), profiles (3600s)
# Debug/temp keys: always set TTL of 600s max

The critical distinction: volatile-lfu only considers keys with a TTL for eviction, and it prefers evicting keys accessed infrequently. Your surge pricing keys, accessed 140 times per second, will be the last keys evicted. That cold analytics cache from last Tuesday’s experiment? First to go.

This requires discipline. Every key your application writes must have an explicit TTL. If a key has no TTL under volatile-lfu, it can never be evicted, and it will eventually consume all available memory. Enforce this with a wrapper around your Redis template:

// SCALED: TTL-enforcing Redis wrapper
@Component
public class SafeRedisTemplate {

    private final ReactiveRedisTemplate<String, String> delegate;
    private static final Duration MAX_TTL = Duration.ofHours(24);

    public Mono<Boolean> setWithRequiredTtl(
            String key, String value, Duration ttl) {
        if (ttl == null || ttl.isZero() || ttl.isNegative()) {
            throw new IllegalArgumentException(
                "TTL is required for all Redis keys. Key: " + key);
        }
        if (ttl.compareTo(MAX_TTL) > 0) {
            throw new IllegalArgumentException(
                "TTL exceeds maximum of 24 hours. Key: " + key + ", TTL: " + ttl);
        }
        return delegate.opsForValue().set(key, value, ttl);
    }
}

Failure Mode 5: Eviction Policy Mismatch

The Symptom

The cache hit rate for driver location data drops from 94% to 57% after a Redis restart. The INFO stats output shows evicted_keys climbing at 200 keys per second. The hot driver location cache is being evicted while cold session data from expired user sessions survives.

The Cause

Redis is configured with allkeys-random. This policy evicts keys at random, with no consideration for access frequency or recency. The driver location cache has 50,000 entries, each 300 bytes. Session data has 12,000 entries, each 2KB. Random eviction hits the larger population (driver locations) more often purely by probability.

The Fix: Benchmarking Eviction Policies

Run a controlled benchmark against your actual workload to measure hit rate under each policy.

# Benchmark setup: same dataset, same workload, different policies
# Dataset: 50,000 driver locations + 12,000 sessions + 500 surge entries
# Workload: 80% driver reads, 15% session reads, 5% surge reads
# Memory limit: 512MB (forces eviction at ~60% dataset size)

# Test 1: allkeys-random
CONFIG SET maxmemory-policy allkeys-random
# Result: 57% hit rate, surge evicted 41% of the time

# Test 2: allkeys-lru
CONFIG SET maxmemory-policy allkeys-lru
# Result: 78% hit rate, surge evicted 12% of the time

# Test 3: allkeys-lfu
CONFIG SET maxmemory-policy allkeys-lfu
# Result: 89% hit rate, surge evicted 2% of the time

# Test 4: volatile-lfu (all keys have TTL)
CONFIG SET maxmemory-policy volatile-lfu
# Result: 91% hit rate, surge evicted 0.3% of the time

The volatile-lfu policy wins because it combines two advantages: only TTL-bearing keys are candidates (protecting critical keys if you need permanent ones), and frequency-based eviction keeps hot keys alive. The 91% hit rate vs 57% with random eviction translates directly to database load. At 8,400 requests per minute, the difference between 91% and 57% hit rates is 2,856 additional database queries per minute.

# PRODUCTION CONFIG: volatile-lfu with monitoring
maxmemory 2gb
maxmemory-policy volatile-lfu
maxmemory-samples 10

# LFU tuning
lfu-log-factor 10
lfu-decay-time 1

# lfu-log-factor 10: counter saturates at ~1M accesses (good for high-traffic keys)
# lfu-decay-time 1: counter halves every 1 minute of inactivity
# This means a key accessed 100 times/sec will maintain a high counter
# A key accessed once and never again will decay to evictable within 10 minutes

Locust Test: Reproducing Thundering Herd

This test proves the thundering herd exists and then proves XFetch eliminates it. Run it against both the unprotected and protected surge endpoints.

# locustfile_thundering_herd.py
from locust import HttpUser, task, between, events, LoadTestShape
import random
import time
import logging

logger = logging.getLogger(__name__)

class ThunderingHerdUser(HttpUser):
    wait_time = between(0.01, 0.05)  # aggressive, 20-100 req/sec per user

    @task
    def get_surge_unprotected(self):
        """Hit the endpoint WITHOUT stampede protection"""
        zone = "zone-47"
        start = time.time()
        with self.client.get(
            f"/api/v1/surge/unprotected/{zone}",
            name="/surge/unprotected [herd test]",
            catch_response=True
        ) as resp:
            elapsed = time.time() - start
            if resp.status_code == 200:
                if elapsed > 1.0:
                    resp.failure(f"Stampede detected: {elapsed:.2f}s response time")
            elif resp.status_code == 503:
                resp.failure("Service unavailable (connection pool exhausted)")

    @task
    def get_surge_xfetch(self):
        """Hit the endpoint WITH XFetch + SETNX protection"""
        zone = "zone-47"
        start = time.time()
        with self.client.get(
            f"/api/v1/surge/protected/{zone}",
            name="/surge/protected [herd test]",
            catch_response=True
        ) as resp:
            elapsed = time.time() - start
            if resp.status_code == 200:
                if elapsed > 1.0:
                    resp.failure(f"Unexpected slow response: {elapsed:.2f}s")

class HerdSpike(LoadTestShape):
    """
    Simulates a thundering herd: ramp to 500 users in 10 seconds,
    hold for 2 minutes, then drop. The spike coincides with cache
    expiration cycles.
    """
    stages = [
        {"duration": 10, "users": 50, "spawn_rate": 10},    # warm up
        {"duration": 20, "users": 500, "spawn_rate": 200},   # spike
        {"duration": 140, "users": 500, "spawn_rate": 10},   # sustained load
        {"duration": 160, "users": 50, "spawn_rate": 50},    # cool down
    ]

    def tick(self):
        run_time = self.get_run_time()
        for stage in self.stages:
            if run_time < stage["duration"]:
                return (stage["users"], stage["spawn_rate"])
        return None

Run with: locust -f locustfile_thundering_herd.py --run-time 3m

Before and After: Thundering Herd

MetricUnprotectedXFetch + SETNX
DB queries per TTL cycle281
p50 latency (during expiry)890ms3ms
p95 latency (during expiry)4,200ms12ms
p99 latency (during expiry)8,100ms45ms
Connection pool exhaustion events4 per minute0
Error rate (during spike)12%0.1%
Max concurrent DB queries8471

The unprotected endpoint shows a sawtooth pattern in latency: low for 29 seconds, then a 1-second spike at every TTL boundary. The XFetch-protected endpoint shows flat latency because recomputation happens before the key expires, and the SETNX lock ensures only one request does the work.

Summary of Redis Configurations

Failure ModeConfig That Causes ItConfig That Prevents It
Thundering herdFixed TTL, no app-level protectionXFetch + SETNX (application layer)
Stale readsEvent-only invalidation, no TTL fallbackHybrid invalidation + read-repair
Cache poisoningNo validation on write pathWrite-side + read-side invariant checks
Memory pressureallkeys-lruvolatile-lfu with mandatory TTLs
Eviction mismatchallkeys-randomvolatile-lfu with tuned lfu-log-factor

Monitor these Redis metrics in production:

  • evicted_keys: should be near zero under normal load
  • keyspace_hits / (keyspace_hits + keyspace_misses): cache hit ratio, target 90%+
  • used_memory / maxmemory: memory utilization, alert at 85%
  • instantaneous_ops_per_sec: baseline your normal operations/sec, alert on 3x spikes