Externalizing State to Redis
Externalizing State to Redis
The Symptom
After removing session affinity (CH3-S1), requests are distributed evenly across pods. But the driver search endpoint now returns inconsistent results. One request shows 12 nearby drivers. The next request, 200ms later from the same rider, shows 4. The third shows 9. Each pod has a different view of driver locations because drivers send location updates to whatever pod the load balancer selects, and the ConcurrentHashMap is pod-local.
The Cause
The driver location cache is a ConcurrentHashMap<String, DriverLocation> in each pod. With 6 pods and round-robin load balancing, each pod receives approximately 16.7% of driver location updates. A pod’s view of “nearby drivers” is a random 16.7% sample of the actual nearby drivers.
// BOTTLENECK: Pod-local driver location cache
@Service
public class DriverLocationService {
// Each pod has its own copy, sees only 1/N of driver updates
private final ConcurrentHashMap<String, DriverLocation> driverLocations
= new ConcurrentHashMap<>();
public void updateLocation(String driverId, double lat, double lng) {
driverLocations.put(driverId, new DriverLocation(driverId, lat, lng,
Instant.now()));
}
public List<DriverLocation> findNearby(double lat, double lng,
double radiusKm) {
// Naive distance calculation over pod-local data only
return driverLocations.values().stream()
.filter(d -> haversine(lat, lng, d.lat(), d.lng()) <= radiusKm)
.sorted(Comparator.comparingDouble(d ->
haversine(lat, lng, d.lat(), d.lng())))
.limit(20)
.toList();
}
}
The Baseline
Locust test measuring driver search consistency across pods:
# load-tests/driver_consistency_locustfile.py
from locust import HttpUser, task, between
class DriverSearchConsistencyUser(HttpUser):
wait_time = between(0.5, 1)
@task
def search_drivers(self):
with self.client.get(
"/api/drivers/nearby",
params={"lat": 40.7128, "lng": -74.0060, "radius_km": 5},
name="/api/drivers/nearby",
catch_response=True
) as response:
if response.status_code == 200:
drivers = response.json()
# Tag with driver count for analysis
response.success()
Results with pod-local state (6 pods, 500 active drivers):
Driver search results distribution:
Min drivers returned: 2
Max drivers returned: 18
Mean: 8.3
Std deviation: 4.7
Expected (all drivers visible): ~15
Variance coefficient: 56% ← Results vary wildly between requests
The Fix
Driver Location: Redis GeoSet
Redis GeoSets store geospatial data and support radius queries natively. GEOADD inserts a member with latitude and longitude. GEOSEARCH returns members within a radius. Both are O(log(N)) operations.
// SCALED: Externalized driver locations in Redis GeoSet
@Service
public class DriverLocationService {
private final ReactiveRedisTemplate<String, String> redisTemplate;
private static final String GEO_KEY = "driver:locations";
private static final Duration LOCATION_TTL = Duration.ofSeconds(30);
public DriverLocationService(
ReactiveRedisTemplate<String, String> redisTemplate) {
this.redisTemplate = redisTemplate;
}
public Mono<Void> updateLocation(String driverId, double lat, double lng) {
return redisTemplate.opsForGeo()
.add(GEO_KEY, new Point(lng, lat), driverId)
.then(
// Set per-driver TTL to auto-expire inactive drivers
redisTemplate.opsForValue()
.set("driver:active:" + driverId, "1", LOCATION_TTL)
)
.then();
}
public Flux<DriverLocation> findNearby(double lat, double lng,
double radiusKm) {
return redisTemplate.opsForGeo()
.search(GEO_KEY,
GeoReference.fromCoordinate(lng, lat),
new Distance(radiusKm, Metrics.KILOMETERS),
GeoSearchCommandArgs.newGeoSearchArgs()
.includeCoordinates()
.includeDistance()
.sortAscending()
.limit(20)
)
.map(result -> new DriverLocation(
result.getContent().getName(),
result.getContent().getPoint().getY(),
result.getContent().getPoint().getX(),
result.getDistance().getValue()
));
}
}
Redis command trace for a driver location update:
GEOADD driver:locations -74.0060 40.7128 "driver-5678"
SET driver:active:driver-5678 1 EX 30
Redis command trace for a nearby driver search:
GEOSEARCH driver:locations FROMLONLAT -74.0060 40.7128 BYRADIUS 5 km
ASC COUNT 20 WITHCOORD WITHDIST
The GEOSEARCH command executes in O(N+log(M)) where N is the number of results and M is the total number of members. With 5,000 active drivers and a 5km radius returning 20 results, this completes in under 1ms.
Surge Pricing: Redis Hash
The surge pricing multiplier is computed every 30 seconds from supply (available drivers) and demand (pending ride requests) per zone.
// SCALED: Surge multiplier stored in Redis Hash
@Service
public class SurgePricingService {
private final ReactiveRedisTemplate<String, String> redisTemplate;
private static final String SURGE_KEY = "surge:multipliers";
public Mono<Double> getMultiplier(String zoneId) {
return redisTemplate.opsForHash()
.get(SURGE_KEY, zoneId)
.map(value -> Double.parseDouble((String) value))
.defaultIfEmpty(1.0); // No surge data = base fare
}
@Scheduled(fixedRate = 30_000)
public void recalculateSurge() {
// Only one pod should recalculate; use Redis lock
redisTemplate.opsForValue()
.setIfAbsent("surge:lock", "1", Duration.ofSeconds(25))
.filter(acquired -> acquired)
.flatMap(acquired -> calculateAllZones())
.subscribe();
}
private Mono<Void> calculateAllZones() {
return Flux.fromIterable(ZONE_IDS)
.flatMap(zoneId -> {
Mono<Long> supply = redisTemplate.opsForGeo()
.search("driver:locations",
GeoReference.fromCoordinate(
ZONE_CENTERS.get(zoneId).lng(),
ZONE_CENTERS.get(zoneId).lat()),
new Distance(3, Metrics.KILOMETERS))
.count();
Mono<Long> demand = redisTemplate.opsForValue()
.get("demand:zone:" + zoneId)
.map(Long::parseLong)
.defaultIfEmpty(0L);
return Mono.zip(supply, demand)
.map(tuple -> {
long drivers = tuple.getT1();
long riders = tuple.getT2();
if (drivers == 0) return 3.0; // Max surge
double ratio = (double) riders / drivers;
return Math.min(3.0, Math.max(1.0,
1.0 + (ratio - 1.0) * 0.5));
})
.flatMap(multiplier ->
redisTemplate.opsForHash()
.put(SURGE_KEY, zoneId, String.valueOf(multiplier)));
})
.then();
}
}
HTTP Session: Spring Session with Redis
// SCALED: Spring Session externalized to Redis
@Configuration
@EnableRedisWebSession(maxInactiveIntervalInSeconds = 1800)
public class SessionConfig {
@Bean
public ReactiveRedisConnectionFactory redisConnectionFactory() {
RedisStandaloneConfiguration config =
new RedisStandaloneConfiguration("redis-sentinel", 26379);
return new LettuceConnectionFactory(config);
}
@Bean
public RedisSerializer<Object> springSessionDefaultRedisSerializer() {
// JSON serialization for debuggability
return new GenericJackson2JsonRedisSerializer();
}
}
# application.yml
spring:
session:
store-type: redis
redis:
namespace: ride-hailing:sessions
data:
redis:
sentinel:
master: mymaster
nodes: redis-sentinel-0:26379,redis-sentinel-1:26379,redis-sentinel-2:26379
Kubernetes Manifest for Redis Sentinel
# kubernetes/redis-sentinel.yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: redis
spec:
serviceName: redis
replicas: 3
selector:
matchLabels:
app: redis
template:
metadata:
labels:
app: redis
spec:
containers:
- name: redis
image: redis:7.2-alpine
ports:
- containerPort: 6379
- containerPort: 26379
command: ["redis-server"]
args:
[
"--appendonly",
"yes",
"--maxmemory",
"1gb",
"--maxmemory-policy",
"volatile-lfu",
]
resources:
requests:
cpu: "250m"
memory: "512Mi"
limits:
cpu: "500m"
memory: "1.5Gi"
volumeMounts:
- name: redis-data
mountPath: /data
volumeClaimTemplates:
- metadata:
name: redis-data
spec:
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 5Gi
The Proof
Locust test at 300 users, 6 pods, comparing local state vs Redis-externalized state:
With pod-local state (ConcurrentHashMap):
/api/drivers/nearby p99: 2,100ms RPS: 6.14 Fail: 0.0%
/api/fares/estimate p99: 4,200ms RPS: 4.09 Fail: 0.2%
Driver search variance: 56% (inconsistent results)
Scaling: 2 pods → 8 pods = 3.1x throughput (sublinear)
With Redis-externalized state:
/api/drivers/nearby p99: 180ms RPS: 28.4 Fail: 0.0%
/api/fares/estimate p99: 420ms RPS: 18.2 Fail: 0.0%
Driver search variance: 0% (consistent results)
Scaling: 2 pods → 8 pods = 7.2x throughput (near-linear)
Delta (at 8 pods):
/api/drivers/nearby p99: 2,100ms → 180ms (11.7x improvement)
/api/fares/estimate p99: 4,200ms → 420ms (10x improvement)
Throughput scaling: 3.1x → 7.2x (linear scaling achieved)
The per-access latency increased (0.1μs local vs 1-3ms Redis), but the system-level latency decreased by 10x. The cost per access went up. The cost per correct result went down. That is the trade, and for the ride-hailing platform, it is the correct one.
Scaling is now linear. Adding pods increases throughput proportionally. The state problem is solved. Chapter 4 addresses the next bottleneck: connection pools and thread pools.