Session Affinity and Its Cost

The Symptom

The ride-hailing platform runs on 6 pods behind an Nginx ingress with cookie-based session affinity. The Grafana dashboard shows CPU usage across pods: pod-1 at 78%, pod-2 at 71%, pod-3 at 62%, pod-4 at 45%, pod-5 at 38%, pod-6 at 22%. The load distribution is not uniform. It is not even close.

During Friday evening peak, pod-1 hits 95% CPU and starts dropping requests. The HPA cannot help because the average CPU across all pods is 53%, below the scale-up threshold of 70%. The hottest pod is drowning while the cluster-wide average says everything is fine.

The Cause

Session affinity routes all requests from the same client to the same pod. The distribution is uniform only when all clients generate equal load. Real users are not equal.

The ride-hailing platform has power users: riders who take 5-10 trips per day, generating 50+ API calls per session for fare estimates, driver searches, and trip history views. It also has casual users who open the app once, check a fare, and close it. Session affinity assigns both to pods using a hash of the session cookie. Power users land on whatever pod the hash function selects, and they stay there.

With 6 pods and a realistic power-law user distribution:

Pod 1: 8 power users, 200 casual users  → 78% CPU
Pod 2: 6 power users, 210 casual users  → 71% CPU
Pod 3: 4 power users, 215 casual users  → 62% CPU
Pod 4: 2 power users, 220 casual users  → 45% CPU
Pod 5: 1 power user,  205 casual users  → 38% CPU
Pod 6: 0 power users, 195 casual users  → 22% CPU

The hash function does not know that a power user generates 10x the load of a casual user. It distributes session cookies uniformly, not load uniformly.

Failover Gaps

When pod-1 crashes under the CPU pressure, its 8 power users and 200 casual users lose their sessions. The session cookie points to a pod that no longer exists. The Ingress detects the failed pod and redistributes those users, but:

Their HTTP sessions are gone. They must re-authenticate.
The driver location cache on pod-1 is gone. Drivers that were only sending updates to pod-1 are invisible until they send their next update to another pod.
The surge pricing multiplier computed by pod-1, which was handling the heaviest load and therefore had the best demand signal, is lost.

The redistribution makes the problem worse. Pod-1’s 208 users are spread across pods 2-6. Pod-2, already at 71%, receives roughly 40 additional users and climbs to 80%. The cascade continues.

The Baseline

Locust test with session affinity enabled, observing per-pod latency:

# load-tests/affinity_locustfile.py
from locust import HttpUser, task, between
import random

class AffinityRiderUser(HttpUser):
    wait_time = between(0.5, 2)

    def on_start(self):
        self.rider_id = f"rider-{random.randint(1, 10000)}"
        # Simulate power user: 20% of users generate 80% of requests
        self.is_power_user = random.random() < 0.2

    @task(3)
    def search_drivers(self):
        self.client.get(
            "/api/drivers/nearby",
            params={"lat": 40.7128, "lng": -74.0060, "radius_km": 5},
            headers={"X-User-Id": self.rider_id},
            name="/api/drivers/nearby"
        )

    @task(2 if not hasattr('is_power_user') else 5)
    def request_fare(self):
        self.client.post(
            "/api/fares/estimate",
            json={
                "pickup_lat": 40.7128, "pickup_lng": -74.0060,
                "dropoff_lat": 40.7580, "dropoff_lng": -73.9855
            },
            headers={"X-User-Id": self.rider_id},
            name="/api/fares/estimate"
        )

Results with session affinity at 300 users:

With session affinity (cookie-based):
  Aggregate p99: 2,800ms
  Per-pod p99 range: 1,200ms (pod-6) to 4,800ms (pod-1)
  Failure rate: 2.4% (concentrated on pod-1)

  Pod CPU distribution:
  pod-1: 92%  pod-2: 78%  pod-3: 65%
  pod-4: 48%  pod-5: 35%  pod-6: 24%

The aggregate p99 of 2,800ms hides the fact that riders assigned to pod-1 experience p99 of 4,800ms while riders on pod-6 experience 1,200ms. The quality of service depends on which pod the hash function assigned them to. That is not a scaling strategy. That is a lottery.

The Fix

Remove session affinity. Make every pod capable of handling any request from any user. This requires externalizing all three state types identified in the parent chapter.

# kubernetes/rider-api-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: rider-api
spec:
  replicas: 6
  selector:
    matchLabels:
      app: rider-api
  template:
    metadata:
      labels:
        app: rider-api
    spec:
      containers:
        - name: rider-api
          image: ride-hailing/rider-api:latest
          ports:
            - containerPort: 8080
          resources:
            requests:
              cpu: "500m"
              memory: "512Mi"
            limits:
              cpu: "1000m"
              memory: "1Gi"
          env:
            - name: SPRING_DATA_REDIS_HOST
              value: "redis-sentinel"
            - name: SPRING_DATA_REDIS_PORT
              value: "26379"
            - name: SPRING_SESSION_STORE_TYPE
              value: "redis"
---
apiVersion: v1
kind: Service
metadata:
  name: rider-api
spec:
  selector:
    app: rider-api
  ports:
    - port: 80
      targetPort: 8080
  # No sessionAffinity field = round-robin by default

The Ingress without affinity annotations:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: rider-api-ingress
  # No nginx.ingress.kubernetes.io/affinity annotation
spec:
  rules:
    - host: api.ridehailing.example
      http:
        paths:
          - path: /api
            pathType: Prefix
            backend:
              service:
                name: rider-api
                port:
                  number: 80

The Proof

After removing session affinity and externalizing state to Redis (detailed in CH3-S2):

Without session affinity (round-robin):
  Aggregate p99: 890ms
  Per-pod p99 range: 820ms to 940ms (uniform)
  Failure rate: 0.0%

  Pod CPU distribution:
  pod-1: 52%  pod-2: 50%  pod-3: 51%
  pod-4: 49%  pod-5: 51%  pod-6: 50%

Delta:
  p99:      2,800ms → 890ms   (3.1x improvement)
  Fail:     2.4% → 0.0%
  CPU skew: 92%-24% range → 52%-49% range

The load distribution is now uniform to within 3%. No pod is overloaded. No pod is idle. HPA can use average CPU as a scaling signal because average CPU now represents every pod accurately. The 3.1x p99 improvement came from removing hot-spot pods, not from any code optimization.