HPA, VPA, and Why CPU-Based Scaling Fails for I/O-Bound Services
HPA, VPA, and Why CPU-Based Scaling Fails for I/O-Bound Services
The Symptom
The rider API deployment has CPU-based HPA configured with a target of 70%. During Friday evening surge, the Grafana dashboard shows a flat line at 18% CPU across all 3 pods. The HPA controller evaluates every 15 seconds, computes desiredReplicas = ceil(3 * (18 / 70)) = ceil(0.77) = 1, and decides the deployment is over-provisioned. It wants to scale down.
The service is handling 5,200 RPS across 3 pods. Connection pool exhaustion on PostgreSQL causes request queuing. The p99 climbs from 150ms to 4,200ms over 12 minutes. Riders see spinning loading screens. The HPA does nothing.
The on-call engineer manually scales to 12 pods with kubectl scale deployment rider-api --replicas=12. Latency drops to 200ms within 90 seconds. The engineer adds a TODO to fix the autoscaling configuration. The TODO stays open for 3 months, surviving through two more Friday evening incidents.
The Cause
HPA uses a simple formula:
desiredReplicas = ceil(currentReplicas * (currentMetricValue / desiredMetricValue))
For CPU-based scaling with a 70% target and current CPU at 18%:
desiredReplicas = ceil(3 * (18 / 70)) = ceil(0.77) = 1
HPA wants to scale down to 1 replica. The minReplicas floor of 3 prevents that. But HPA will never scale up because CPU will never approach 70%.
Why does CPU stay low? The rider API uses Spring WebFlux with Netty. The default event loop pool has Runtime.getRuntime().availableProcessors() threads. On a 2-core pod, that is 2 event loop threads. These threads never block. They accept a request, dispatch the PostgreSQL query asynchronously, and immediately handle the next request. The CPU work per request is approximately:
JSON deserialization: 0.3ms
Route matching: 0.1ms
Request validation: 0.2ms
Response serialization: 0.4ms
Netty frame encoding: 0.2ms
Total CPU time/request: 1.2ms
The remaining 53ms of a typical request (55ms total wall clock) is I/O wait: PostgreSQL query (35ms), Redis lookup (8ms), network write (10ms). The event loop thread is free during that time, handling other requests.
At 1,700 RPS per pod (5,200 / 3), the total CPU time is 1,700 * 1.2ms = 2,040ms = 2.04 CPU-seconds per second. On a 2-core pod with a 1000m CPU request, that is 2.04 / 2.0 = 102% of the allocated CPU. But Kubernetes measures CPU utilization against the pod’s resources.requests.cpu, and the actual utilization is distributed across the event loop’s non-blocking model. The metrics pipeline reports ~18% average CPU because the utilization is bursty at the microsecond level, with the event loop alternating between brief CPU bursts and I/O dispatches.
The correct metric is request throughput. When RPS per pod exceeds the capacity of the connection pools and event loop, latency degrades. For the rider API, that threshold is approximately 500 RPS per pod with current pool sizes (PostgreSQL: 20 connections, Redis: 50 connections).
The Baseline
HPA scaling algorithm behavior with CPU vs custom metrics:
Scenario CPU-Based HPA RPS-Based HPA
500 RPS (3 pods) CPU 5%, no scale 167 RPS/pod, no scale
2000 RPS (3 pods) CPU 12%, no scale 667 RPS/pod, scale to 4
5000 RPS (3 pods) CPU 18%, no scale 1667 RPS/pod, scale to 10
10000 RPS (3 pods) CPU 22%, no scale 3333 RPS/pod, scale to 20
10000 RPS (20 pods) N/A 500 RPS/pod, stable
The CPU column demonstrates why CPU-based HPA is invisible to I/O-bound load. Even at 10,000 RPS (2x the Friday peak), CPU barely reaches 22%. The service would return 503s before CPU triggered a scale event.
The Fix
prometheus-adapter: bridging Prometheus metrics to Kubernetes HPA
Spring Boot Actuator exports http_server_requests_seconds_count to Prometheus. The prometheus-adapter converts this Prometheus metric into a Kubernetes custom metric that HPA can query:
# SCALED: prometheus-adapter ConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-adapter-config
namespace: monitoring
data:
config.yaml: |
rules:
- seriesQuery: 'http_server_requests_seconds_count{namespace!="",pod!=""}'
resources:
overrides:
namespace: {resource: "namespace"}
pod: {resource: "pod"}
name:
matches: "^(.*)_seconds_count$"
as: "${1}_per_second"
metricsQuery: 'rate(<<.Series>>{<<.LabelMatchers>>}[2m])'
The metricsQuery computes rate() over a 2-minute window. This smooths out per-second spikes. A 1-minute window is too noisy (a 5-second burst of 2,000 RPS would cause unnecessary scaling). A 5-minute window is too slow (a sustained increase from 500 to 2,000 RPS would take 5 minutes to register fully).
Verify the custom metric is available:
kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/namespaces/ridehailing/pods/*/http_server_requests_per_second" | jq .
HPA manifest with custom metrics
# SCALED: HPA for rider-api on request rate
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: rider-api-hpa
namespace: ridehailing
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: rider-api
minReplicas: 3
maxReplicas: 50
metrics:
- type: Pods
pods:
metric:
name: http_server_requests_per_second
target:
type: AverageValue
averageValue: "500"
behavior:
scaleUp:
stabilizationWindowSeconds: 30
policies:
- type: Percent
value: 100
periodSeconds: 60
- type: Pods
value: 5
periodSeconds: 60
selectPolicy: Max
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 10
periodSeconds: 60
selectPolicy: Min
Scale-up uses selectPolicy: Max: the larger of “double current count” or “add 5 pods.” This ensures that at low replica counts (3 pods), scaling adds at least 5 instead of just 3 (100% of 3). At high replica counts (20 pods), scaling adds 20 (100%) instead of just 5.
Scale-down uses selectPolicy: Min: the smaller of the two policies applies. Conservative. A traffic dip during a bathroom break at a concert does not mean the surge is over.
VPA for the surge pricing calculator
The surge pricing calculator loads a zone graph into memory. Each zone has pricing coefficients, demand multipliers, and historical baselines. During normal hours, the graph has ~200 active zones consuming 380Mi. During Friday peak, 800+ zones activate, and the graph grows to 1.4Gi.
The deployment has resources.limits.memory: 512Mi. When the graph grows past 512Mi, the JVM’s garbage collector thrashes, then the kernel OOMKills the pod. The pod restarts, reloads the graph (which has already grown), and gets OOMKilled again. A restart loop.
# BOTTLENECK: Fixed memory limits for variable workload
apiVersion: apps/v1
kind: Deployment
metadata:
name: surge-pricing-calc
spec:
template:
spec:
containers:
- name: surge-pricing-calc
resources:
requests:
memory: "512Mi"
limits:
memory: "512Mi" # OOMKilled during peak
VPA fixes this by observing actual memory consumption and adjusting limits:
# SCALED: VPA for surge pricing calculator
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: surge-pricing-vpa
namespace: ridehailing
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: surge-pricing-calc
updatePolicy:
updateMode: "Auto"
resourcePolicy:
containerPolicies:
- containerName: surge-pricing-calc
minAllowed:
memory: "512Mi"
cpu: "250m"
maxAllowed:
memory: "4Gi"
cpu: "2"
controlledResources: ["memory"]
VPA in “Auto” mode evicts pods and recreates them with updated resource requests. This means a brief disruption. For the surge pricing calculator running 2 replicas, VPA evicts one at a time, so at least 1 replica serves traffic during the adjustment.
Do not run VPA in “Auto” mode on the same deployment as HPA. They conflict: HPA wants to add pods, VPA wants to resize pods, and the interaction is undefined. Use VPA in “Off” or “Initial” mode alongside HPA, reading the recommendations manually:
kubectl describe vpa surge-pricing-vpa | grep -A 20 "Recommendation"
Scaling speed: the hidden cost
The time from HPA detecting the need to scale to the new pod serving traffic:
Step Duration Cumulative
Metric scrape interval 15s 15s
HPA evaluation interval 15s 30s
Stabilization window 30s 60s
Pod scheduling 2-5s 65s
Image pull (cached) 1-3s 68s
Image pull (uncached) 15-45s 105s
JVM startup 8-12s 117s
Spring context init 5-8s 125s
Readiness probe passes 10-30s 155s
Worst case: 155 seconds from metric breach to new pod serving traffic. During those 155 seconds, the existing pods absorb the excess load. This is why minReplicas: 3 is not optional. Running fewer than 3 pods means a sudden spike has zero headroom while HPA ramps up.
Optimize each step:
# SCALED: Multi-stage build with layered JVM image
FROM eclipse-temurin:21-jre-alpine AS runtime
COPY --from=build /app/target/rider-api.jar /app/app.jar
# Pre-extract Spring Boot layers for faster image pull
RUN java -Djarmode=layertools -jar /app/app.jar extract
ENTRYPOINT ["java", \
"-XX:+UseG1GC", \
"-XX:MaxRAMPercentage=75.0", \
"-XX:+TieredCompilation", \
"-XX:TieredStopAtLevel=1", \
"-Dspring.main.lazy-initialization=true", \
"-jar", "/app/app.jar"]
-XX:TieredStopAtLevel=1 disables C2 compilation at startup, reducing JVM startup from 12s to 6s. The JIT compiler will optimize hot paths later, after the pod is serving traffic. -Dspring.main.lazy-initialization=true defers bean creation until first use, cutting Spring context initialization from 8s to 3s.
With these optimizations, the scaling timeline drops:
Step Duration Cumulative
Metric + HPA + stabilize 60s 60s
Pod scheduling 2s 62s
Image pull (cached) 1s 63s
JVM startup (optimized) 6s 69s
Spring context (lazy) 3s 72s
Readiness probe 10s 82s
82 seconds. Still not instant. The minReplicas floor and the scale-up aggressiveness in the HPA behavior block exist to cover this gap.
The Proof
After switching from CPU-based HPA to request-rate HPA with prometheus-adapter:
Metric CPU-based HPA RPS-based HPA Delta
First scale event (5k RPS) Never T+45s Fixed
Pods at peak (10k RPS) 3 (never scaled) 24 +700%
p99 at peak 4,200ms 185ms -96%
Error rate at peak 3.2% 0.02% -99%
Manual interventions/month 3 0 -100%
VPA results for the surge pricing calculator:
Metric Fixed limits VPA Auto Delta
Memory limit 512Mi 1.8Gi (auto) +250%
OOMKilled events/week 4 0 -100%
p99 during surge 1,800ms 220ms -88%
The HPA now reacts to actual service pressure instead of a metric that does not correlate with load. The engineer who added the TODO three months ago closes the ticket.