Kubernetes Scaling: HPA, VPA, KEDA, and the Metrics That Drive Them

The Symptom

Friday evening, 6:47 PM. The rider API’s p99 latency crosses 4 seconds. The on-call engineer checks the Kubernetes dashboard. CPU utilization across all pods: 18%. Memory: 42%. The Horizontal Pod Autoscaler shows 3/3 replicas. It has not scaled. According to HPA, the service is idle.

The service is not idle. It is drowning. 5,200 requests per second are hitting 3 pods. Each request queries PostgreSQL and reads from Redis. Spring WebFlux handles the I/O without blocking threads. The event loop is saturated, connection pools are exhausted, and every request is queuing. But CPU stays at 18% because the threads are not computing. They are waiting.

The HPA is configured to scale at 70% CPU. The CPU will never reach 70%. The service will hit 10-second timeouts and start dropping requests before CPU hits 30%.

Meanwhile, in the analytics pipeline, the Kafka consumer group for trip events has a lag of 847,000 messages. The 2 consumer pods process 200 events/second each. The producers push 1,400 events/second during surge. The consumers fall further behind every minute. There is no autoscaler watching Kafka lag.

Two scaling failures. Two different causes. One platform.

HPA autoscaling timeline showing pod count rising from 3 to 8 during a traffic spike, holding steady, then scaling back down after a 5-minute cooldown period

This timeline shows how a properly configured HPA responds to a traffic spike. At baseline, 3 pods handle normal load. When traffic surges and CPU crosses the 70% threshold, HPA scales to 8 pods within 2 minutes. The pods hold at 8 while traffic remains elevated, then scale back to 3 after a cooldown period once traffic normalizes. The key insight: this only works when the scaling metric actually reflects load — CPU-based HPA would never trigger for the I/O-bound rider API.

The Cause

Kubernetes HPA scales pods based on metrics. The default metric is CPU. For CPU-bound workloads (video encoding, machine learning inference, image processing), CPU-based HPA works. The CPU rises with load, HPA adds pods, the CPU drops.

For I/O-bound workloads, CPU is the wrong signal. A Spring WebFlux service uses Netty’s event loop with a small thread pool (typically 2x CPU cores). These threads never block. They dispatch I/O operations and move to the next request. When PostgreSQL takes 50ms to respond, the thread is not waiting. It is handling other requests. CPU stays low regardless of load.

The rider API is I/O-bound. 85% of each request’s wall-clock time is spent waiting for PostgreSQL (35ms), Redis (8ms), and network serialization (12ms). The actual CPU work per request is under 2ms. At 5,000 RPS with 3 pods, each pod handles ~1,667 RPS. The CPU work is 1,667 * 2ms = 3.3 CPU-seconds per second. On a 2-core pod, that is 3.3 / 2 = 165% of one core, or about 82% of one core… but spread across the event loop, it manifests as 18% average CPU because the event loop interleaves waiting and computing.

The correct scaling metric for the rider API is request rate. When each pod handles more than 500 RPS, response latency degrades. HPA should scale to keep RPS per pod below 500.

The analytics consumer has a different problem. It does not receive HTTP requests. It pulls from Kafka. HPA cannot see Kafka consumer lag. The correct scaling mechanism is KEDA (Kubernetes Event-Driven Autoscaling), which reads Kafka consumer group lag directly and scales pods to match.

The surge pricing calculator has a third pattern. It runs 2 replicas that each need significant memory for the pricing model’s in-memory graph. During Friday peak, the graph grows as more zones activate surge pricing. Memory pressure causes GC pauses that spike latency. This workload needs VPA (Vertical Pod Autoscaler) to increase memory per pod, not HPA to add more pods.

Three services. Three scaling strategies.

The Baseline

Current state with CPU-based HPA only:

Service                  Replicas   HPA Metric   Problem
rider-api                3          CPU (70%)     Never scales; CPU at 18% during overload
trip-analytics-consumer  2          None          Kafka lag grows unbounded during surge
surge-pricing-calc       2          CPU (70%)     OOMKilled when graph grows; CPU is 55%

Performance during Friday 6-9 PM peak:

Metric                          rider-api    analytics    surge-pricing
Request/event rate              5,200 RPS    1,400 eps    800 RPS
p99 latency                     4,200ms      N/A          1,800ms
Error rate                      3.2%         0%           0.8%
Kafka consumer lag              N/A          847,000      N/A
OOMKilled events (last 7 days)  0            0            4

Target state:

Service                  Scaling    Metric              Target
rider-api                HPA        requests/sec/pod    500 RPS/pod
trip-analytics-consumer  KEDA       Kafka consumer lag   < 1,000 msgs
surge-pricing-calc       VPA        Memory recommend.    Auto-adjust limits

The Fix

Custom metrics HPA for the rider API

Step 1: Expose request rate from Prometheus. Spring Boot Actuator with Micrometer already exports http_server_requests_seconds_count. The prometheus-adapter translates this into a Kubernetes custom metric:

# SCALED: prometheus-adapter configuration
apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-adapter-config
  namespace: monitoring
data:
  config.yaml: |
    rules:
    - seriesQuery: 'http_server_requests_seconds_count{namespace!="",pod!=""}'
      resources:
        overrides:
          namespace: {resource: "namespace"}
          pod: {resource: "pod"}
      name:
        matches: "^(.*)_seconds_count$"
        as: "${1}_per_second"
      metricsQuery: 'rate(<<.Series>>{<<.LabelMatchers>>}[2m])'

Step 2: Configure HPA to use the custom metric:

# SCALED: HPA for rider-api using request rate
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: rider-api-hpa
  namespace: ridehailing
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: rider-api
  minReplicas: 3
  maxReplicas: 50
  metrics:
    - type: Pods
      pods:
        metric:
          name: http_server_requests_per_second
        target:
          type: AverageValue
          averageValue: "500"
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 30
      policies:
        - type: Percent
          value: 100
          periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Percent
          value: 10
          periodSeconds: 60

The behavior block matters. Scale-up is aggressive: double the pods every 60 seconds if needed. Scale-down is conservative: remove at most 10% of pods per minute, with a 5-minute stabilization window. Friday evening surges spike fast and drop gradually. Aggressive scale-down during a brief dip causes a re-surge that requires another cold-start cycle.

KEDA for Kafka consumers

# SCALED: KEDA ScaledObject for trip analytics consumer
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: trip-analytics-consumer
  namespace: ridehailing
spec:
  scaleTargetRef:
    name: trip-analytics-consumer
  minReplicaCount: 2
  maxReplicaCount: 20
  cooldownPeriod: 120
  triggers:
    - type: kafka
      metadata:
        bootstrapServers: kafka-headless.kafka:9092
        consumerGroup: trip-analytics
        topic: trip-events
        lagThreshold: "1000"
        offsetResetPolicy: earliest

When consumer lag exceeds 1,000 messages, KEDA scales up. When lag drops below 1,000 for 120 seconds, KEDA scales down. The formula: desiredReplicas = ceil(currentLag / lagThreshold). At 847,000 lag with a threshold of 1,000, KEDA targets 847 replicas, capped at maxReplicaCount of 20.

VPA for the surge pricing calculator

# SCALED: VPA for surge pricing calculator
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: surge-pricing-vpa
  namespace: ridehailing
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: surge-pricing-calc
  updatePolicy:
    updateMode: "Auto"
  resourcePolicy:
    containerPolicies:
      - containerName: surge-pricing-calc
        minAllowed:
          memory: "512Mi"
          cpu: "250m"
        maxAllowed:
          memory: "4Gi"
          cpu: "2"
        controlledResources: ["memory"]

controlledResources: ["memory"] tells VPA to only adjust memory, not CPU. The surge pricing calculator is memory-bound, not CPU-bound. VPA watches the pod’s actual memory consumption, generates recommendations, and evicts pods to recreate them with higher memory limits. This eviction is the tradeoff: VPA must restart the pod to change its resource limits.

Locust load test: HPA scaling validation

# SCALED: Locust test for HPA scaling validation
from locust import HttpUser, task, between, events
import time

class RiderApiUser(HttpUser):
    wait_time = between(0.1, 0.5)

    @task(5)
    def get_fare_estimate(self):
        self.client.get("/api/rides/fare-estimate", params={
            "pickup_lat": 40.7128,
            "pickup_lng": -74.0060,
            "dropoff_lat": 40.7589,
            "dropoff_lng": -73.9851
        })

    @task(3)
    def get_nearby_drivers(self):
        self.client.get("/api/drivers/nearby", params={
            "lat": 40.7128,
            "lng": -74.0060,
            "radius_km": 2
        })

    @task(1)
    def request_ride(self):
        self.client.post("/api/rides/request", json={
            "rider_id": "rider-12345",
            "pickup_lat": 40.7128,
            "pickup_lng": -74.0060,
            "dropoff_lat": 40.7589,
            "dropoff_lng": -73.9851,
            "ride_type": "standard"
        })

Run with a ramp from 100 to 10,000 RPS:

locust -f locust_hpa_test.py \
  --host=https://rider-api.ridehailing.internal \
  --users 20000 \
  --spawn-rate 200 \
  --run-time 600s \
  --headless \
  --csv=hpa_scaling_test

Spawn 200 users per second until 20,000 users are active. With a mean wait time of 0.3 seconds and 9 tasks weighted to ~5.4 RPS per user effective rate, this ramps from low hundreds to approximately 10,000 RPS over 100 seconds.

The Proof

After deploying custom metrics HPA, KEDA, and VPA:

Metric                     Before        After          Delta
rider-api p99 (peak)       4,200ms       185ms          -96%
rider-api error rate       3.2%          0.02%          -99%
rider-api pods (peak)      3             24             +700%
analytics lag (peak)       847,000       980            -99.9%
analytics consumers (peak) 2            15             +650%
surge-pricing OOMKills     4/week        0/week         -100%
surge-pricing memory       512Mi fixed   1.8Gi (auto)   VPA adjusted

HPA scaled the rider API from 3 to 24 pods during the Locust ramp. The first scale event triggered at T+45s when per-pod RPS crossed 500. By T+90s, HPA had reached 12 pods. The full 24 pods were running by T+150s. Each scale event takes 30-45 seconds: 10 seconds for the metric to propagate, 5 seconds for HPA to decide, 15-30 seconds for pod startup (image pull from local cache, JVM startup, readiness probe pass).

KEDA scaled the analytics consumers from 2 to 15 as Friday evening trip volume tripled. The lag peaked at 980 messages (under the 1,000 threshold) and the consumer group kept pace with the 1,400 events/second production rate.

VPA recommended 1.8Gi for the surge pricing calculator based on the observed memory consumption during the first Friday after deployment. It evicted and recreated the pods with the new limits during a low-traffic window (Tuesday 3 AM). No more OOMKilled events.

The Locust test runs in CI on every PR that modifies the rider API. The expected pod count at each load level is part of the test assertion. A code change that doubles per-request latency would cause HPA to scale to 48 pods instead of 24, and the CI test would flag the regression.

CH13-S1 covers HPA and VPA mechanics in depth: the scaling algorithm, why CPU metrics lie for WebFlux, prometheus-adapter configuration, and scaling speed. CH13-S2 covers KEDA architecture, Kafka and Prometheus triggers, scale-to-zero for batch workloads, and the combined HPA+KEDA scaling timeline.