Skip to main content
surviving the spike

Externalizing State to Redis

6 min read Chapter 9 of 66

Externalizing State to Redis

The Symptom

After removing session affinity (CH3-S1), requests are distributed evenly across pods. But the driver search endpoint now returns inconsistent results. One request shows 12 nearby drivers. The next request, 200ms later from the same rider, shows 4. The third shows 9. Each pod has a different view of driver locations because drivers send location updates to whatever pod the load balancer selects, and the ConcurrentHashMap is pod-local.

The Cause

The driver location cache is a ConcurrentHashMap<String, DriverLocation> in each pod. With 6 pods and round-robin load balancing, each pod receives approximately 16.7% of driver location updates. A pod’s view of “nearby drivers” is a random 16.7% sample of the actual nearby drivers.

// BOTTLENECK: Pod-local driver location cache
@Service
public class DriverLocationService {

    // Each pod has its own copy, sees only 1/N of driver updates
    private final ConcurrentHashMap<String, DriverLocation> driverLocations
        = new ConcurrentHashMap<>();

    public void updateLocation(String driverId, double lat, double lng) {
        driverLocations.put(driverId, new DriverLocation(driverId, lat, lng,
            Instant.now()));
    }

    public List<DriverLocation> findNearby(double lat, double lng,
            double radiusKm) {
        // Naive distance calculation over pod-local data only
        return driverLocations.values().stream()
            .filter(d -> haversine(lat, lng, d.lat(), d.lng()) <= radiusKm)
            .sorted(Comparator.comparingDouble(d ->
                haversine(lat, lng, d.lat(), d.lng())))
            .limit(20)
            .toList();
    }
}

The Baseline

Locust test measuring driver search consistency across pods:

# load-tests/driver_consistency_locustfile.py
from locust import HttpUser, task, between

class DriverSearchConsistencyUser(HttpUser):
    wait_time = between(0.5, 1)

    @task
    def search_drivers(self):
        with self.client.get(
            "/api/drivers/nearby",
            params={"lat": 40.7128, "lng": -74.0060, "radius_km": 5},
            name="/api/drivers/nearby",
            catch_response=True
        ) as response:
            if response.status_code == 200:
                drivers = response.json()
                # Tag with driver count for analysis
                response.success()

Results with pod-local state (6 pods, 500 active drivers):

Driver search results distribution:
  Min drivers returned: 2
  Max drivers returned: 18
  Mean: 8.3
  Std deviation: 4.7
  Expected (all drivers visible): ~15

  Variance coefficient: 56%  ← Results vary wildly between requests

The Fix

Driver Location: Redis GeoSet

Redis GeoSets store geospatial data and support radius queries natively. GEOADD inserts a member with latitude and longitude. GEOSEARCH returns members within a radius. Both are O(log(N)) operations.

// SCALED: Externalized driver locations in Redis GeoSet
@Service
public class DriverLocationService {

    private final ReactiveRedisTemplate<String, String> redisTemplate;
    private static final String GEO_KEY = "driver:locations";
    private static final Duration LOCATION_TTL = Duration.ofSeconds(30);

    public DriverLocationService(
            ReactiveRedisTemplate<String, String> redisTemplate) {
        this.redisTemplate = redisTemplate;
    }

    public Mono<Void> updateLocation(String driverId, double lat, double lng) {
        return redisTemplate.opsForGeo()
            .add(GEO_KEY, new Point(lng, lat), driverId)
            .then(
                // Set per-driver TTL to auto-expire inactive drivers
                redisTemplate.opsForValue()
                    .set("driver:active:" + driverId, "1", LOCATION_TTL)
            )
            .then();
    }

    public Flux<DriverLocation> findNearby(double lat, double lng,
            double radiusKm) {
        return redisTemplate.opsForGeo()
            .search(GEO_KEY,
                GeoReference.fromCoordinate(lng, lat),
                new Distance(radiusKm, Metrics.KILOMETERS),
                GeoSearchCommandArgs.newGeoSearchArgs()
                    .includeCoordinates()
                    .includeDistance()
                    .sortAscending()
                    .limit(20)
            )
            .map(result -> new DriverLocation(
                result.getContent().getName(),
                result.getContent().getPoint().getY(),
                result.getContent().getPoint().getX(),
                result.getDistance().getValue()
            ));
    }
}

Redis command trace for a driver location update:

GEOADD driver:locations -74.0060 40.7128 "driver-5678"
SET driver:active:driver-5678 1 EX 30

Redis command trace for a nearby driver search:

GEOSEARCH driver:locations FROMLONLAT -74.0060 40.7128 BYRADIUS 5 km
    ASC COUNT 20 WITHCOORD WITHDIST

The GEOSEARCH command executes in O(N+log(M)) where N is the number of results and M is the total number of members. With 5,000 active drivers and a 5km radius returning 20 results, this completes in under 1ms.

Surge Pricing: Redis Hash

The surge pricing multiplier is computed every 30 seconds from supply (available drivers) and demand (pending ride requests) per zone.

// SCALED: Surge multiplier stored in Redis Hash
@Service
public class SurgePricingService {

    private final ReactiveRedisTemplate<String, String> redisTemplate;
    private static final String SURGE_KEY = "surge:multipliers";

    public Mono<Double> getMultiplier(String zoneId) {
        return redisTemplate.opsForHash()
            .get(SURGE_KEY, zoneId)
            .map(value -> Double.parseDouble((String) value))
            .defaultIfEmpty(1.0);  // No surge data = base fare
    }

    @Scheduled(fixedRate = 30_000)
    public void recalculateSurge() {
        // Only one pod should recalculate; use Redis lock
        redisTemplate.opsForValue()
            .setIfAbsent("surge:lock", "1", Duration.ofSeconds(25))
            .filter(acquired -> acquired)
            .flatMap(acquired -> calculateAllZones())
            .subscribe();
    }

    private Mono<Void> calculateAllZones() {
        return Flux.fromIterable(ZONE_IDS)
            .flatMap(zoneId -> {
                Mono<Long> supply = redisTemplate.opsForGeo()
                    .search("driver:locations",
                        GeoReference.fromCoordinate(
                            ZONE_CENTERS.get(zoneId).lng(),
                            ZONE_CENTERS.get(zoneId).lat()),
                        new Distance(3, Metrics.KILOMETERS))
                    .count();

                Mono<Long> demand = redisTemplate.opsForValue()
                    .get("demand:zone:" + zoneId)
                    .map(Long::parseLong)
                    .defaultIfEmpty(0L);

                return Mono.zip(supply, demand)
                    .map(tuple -> {
                        long drivers = tuple.getT1();
                        long riders = tuple.getT2();
                        if (drivers == 0) return 3.0;  // Max surge
                        double ratio = (double) riders / drivers;
                        return Math.min(3.0, Math.max(1.0,
                            1.0 + (ratio - 1.0) * 0.5));
                    })
                    .flatMap(multiplier ->
                        redisTemplate.opsForHash()
                            .put(SURGE_KEY, zoneId, String.valueOf(multiplier)));
            })
            .then();
    }
}

HTTP Session: Spring Session with Redis

// SCALED: Spring Session externalized to Redis
@Configuration
@EnableRedisWebSession(maxInactiveIntervalInSeconds = 1800)
public class SessionConfig {

    @Bean
    public ReactiveRedisConnectionFactory redisConnectionFactory() {
        RedisStandaloneConfiguration config =
            new RedisStandaloneConfiguration("redis-sentinel", 26379);
        return new LettuceConnectionFactory(config);
    }

    @Bean
    public RedisSerializer<Object> springSessionDefaultRedisSerializer() {
        // JSON serialization for debuggability
        return new GenericJackson2JsonRedisSerializer();
    }
}
# application.yml
spring:
  session:
    store-type: redis
    redis:
      namespace: ride-hailing:sessions
  data:
    redis:
      sentinel:
        master: mymaster
        nodes: redis-sentinel-0:26379,redis-sentinel-1:26379,redis-sentinel-2:26379

Kubernetes Manifest for Redis Sentinel

# kubernetes/redis-sentinel.yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: redis
spec:
  serviceName: redis
  replicas: 3
  selector:
    matchLabels:
      app: redis
  template:
    metadata:
      labels:
        app: redis
    spec:
      containers:
        - name: redis
          image: redis:7.2-alpine
          ports:
            - containerPort: 6379
            - containerPort: 26379
          command: ["redis-server"]
          args:
            [
              "--appendonly",
              "yes",
              "--maxmemory",
              "1gb",
              "--maxmemory-policy",
              "volatile-lfu",
            ]
          resources:
            requests:
              cpu: "250m"
              memory: "512Mi"
            limits:
              cpu: "500m"
              memory: "1.5Gi"
          volumeMounts:
            - name: redis-data
              mountPath: /data
  volumeClaimTemplates:
    - metadata:
        name: redis-data
      spec:
        accessModes: ["ReadWriteOnce"]
        resources:
          requests:
            storage: 5Gi

The Proof

Locust test at 300 users, 6 pods, comparing local state vs Redis-externalized state:

With pod-local state (ConcurrentHashMap):
  /api/drivers/nearby    p99: 2,100ms   RPS: 6.14   Fail: 0.0%
  /api/fares/estimate    p99: 4,200ms   RPS: 4.09   Fail: 0.2%
  Driver search variance: 56% (inconsistent results)

  Scaling: 2 pods → 8 pods = 3.1x throughput (sublinear)

With Redis-externalized state:
  /api/drivers/nearby    p99:   180ms   RPS: 28.4   Fail: 0.0%
  /api/fares/estimate    p99:   420ms   RPS: 18.2   Fail: 0.0%
  Driver search variance: 0% (consistent results)

  Scaling: 2 pods → 8 pods = 7.2x throughput (near-linear)

Delta (at 8 pods):
  /api/drivers/nearby  p99:  2,100ms → 180ms  (11.7x improvement)
  /api/fares/estimate  p99:  4,200ms → 420ms  (10x improvement)
  Throughput scaling:   3.1x → 7.2x            (linear scaling achieved)

The per-access latency increased (0.1μs local vs 1-3ms Redis), but the system-level latency decreased by 10x. The cost per access went up. The cost per correct result went down. That is the trade, and for the ride-hailing platform, it is the correct one.

Scaling is now linear. Adding pods increases throughput proportionally. The state problem is solved. Chapter 4 addresses the next bottleneck: connection pools and thread pools.