Deployment Latency: Connection Draining, Health Checks, and Warm-Up
Deployment Latency: Connection Draining, Health Checks, and Warm-Up
The content platform deploys 4 times per day. Each deployment triggers a rolling update that replaces pods one at a time. During this transition, P99 latency spikes from 30ms to 500ms+. Three factors cause deployment latency: connection draining (existing requests on dying pods), cold JVM startup (no JIT compilation), and empty connection pools (cold downstream connections).
This section eliminates all three.
Connection Draining: Finishing In-Flight Requests
When a pod is terminated, it must finish processing in-flight requests before shutting down. Without graceful draining, Kubernetes sends SIGKILL after 30 seconds (default terminationGracePeriodSeconds), aborting active requests:
Deployment timeline without proper draining:
t=0: Kubernetes sends SIGTERM to old pod
t=0: Kubernetes removes pod from Service endpoints
t=0: Load balancer still has old endpoints cached (stale for 1-5s)
t=0-5s: New requests still arrive at dying pod
t=0-5s: Pod immediately stops accepting → 502 errors from proxy
t=30s: Kubernetes sends SIGKILL (default grace period)
Deployment timeline WITH proper draining:
t=0: Kubernetes sends SIGTERM to old pod
t=0: Pod stops accepting NEW connections (readiness = false)
t=0: Kubernetes removes pod from Service endpoints
t=0-5s: Stale load balancer routes drain naturally (short-lived requests finish)
t=0-30s: In-flight long requests complete normally
t=30s: Pod exits cleanly (all requests finished)
Spring Boot Graceful Shutdown
// application.yml: Enable graceful shutdown
// server:
// shutdown: graceful
// spring:
// lifecycle:
// timeout-per-shutdown-phase: 30s
// Programmatic graceful shutdown with connection draining:
@Configuration
public class GracefulShutdownConfig {
@Bean
public GracefulShutdownHandler gracefulShutdownHandler() {
return new GracefulShutdownHandler();
}
}
@Component
public class GracefulShutdownHandler {
private static final Logger log = LoggerFactory.getLogger(GracefulShutdownHandler.class);
private final AtomicBoolean shuttingDown = new AtomicBoolean(false);
private final AtomicInteger activeRequests = new AtomicInteger(0);
private final CountDownLatch drainComplete = new CountDownLatch(1);
public boolean isShuttingDown() {
return shuttingDown.get();
}
public void incrementActive() {
activeRequests.incrementAndGet();
}
public void decrementActive() {
int remaining = activeRequests.decrementAndGet();
if (shuttingDown.get() && remaining == 0) {
drainComplete.countDown();
}
}
@PreDestroy
public void shutdown() {
log.info("SIGTERM received. Starting graceful drain. Active requests: {}",
activeRequests.get());
shuttingDown.set(true);
// Wait for in-flight requests to complete (max 25s, leave 5s for cleanup)
try {
boolean drained = drainComplete.await(25, TimeUnit.SECONDS);
if (drained) {
log.info("All requests drained successfully");
} else {
log.warn("Drain timeout. {} requests still active",
activeRequests.get());
}
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
log.warn("Drain interrupted");
}
}
}
// Filter that tracks active requests and rejects new ones during shutdown:
@Component
@Order(Ordered.HIGHEST_PRECEDENCE)
public class DrainFilter implements Filter {
private final GracefulShutdownHandler handler;
public DrainFilter(GracefulShutdownHandler handler) {
this.handler = handler;
}
@Override
public void doFilter(ServletRequest req, ServletResponse res, FilterChain chain)
throws IOException, ServletException {
if (handler.isShuttingDown()) {
HttpServletResponse response = (HttpServletResponse) res;
response.setStatus(503);
response.setHeader("Connection", "close");
response.getWriter().write("Service shutting down");
return;
}
handler.incrementActive();
try {
chain.doFilter(req, res);
} finally {
handler.decrementActive();
}
}
}
Kubernetes Configuration for Zero-Downtime Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: article-service
spec:
replicas: 3
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 0 # Never reduce below desired replicas
maxSurge: 1 # Create new pod before killing old
template:
spec:
terminationGracePeriodSeconds: 45 # Must be > drain timeout (25s) + startup time
containers:
- name: article-service
lifecycle:
preStop:
exec:
command: ["sh", "-c", "sleep 5"]
# Sleep 5s after SIGTERM but BEFORE shutdown begins.
# This allows Kubernetes endpoint controller to remove
# this pod from Service, so load balancer stops sending traffic.
# Without this sleep: race condition where traffic arrives
# after SIGTERM but before endpoint removal.
readinessProbe:
httpGet:
path: /health/ready
port: 8080
initialDelaySeconds: 10
periodSeconds: 2
failureThreshold: 1 # Remove from endpoints on first failure
successThreshold: 2 # Require 2 successes before adding back
livenessProbe:
httpGet:
path: /health/live
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
failureThreshold: 3
The preStop sleep is critical. Without it, there is a race condition:
Race condition without preStop sleep:
t=0.000s: Kubernetes sends SIGTERM
t=0.001s: Application begins shutdown, stops accepting requests
t=0.050s: Kubernetes endpoint controller updates Service endpoints
t=0.050s: kube-proxy updates iptables rules
t=0.100s: Nginx upstream config refreshed (if using DNS-based discovery)
Between t=0.001s and t=0.100s: requests routed to dying pod → 503 errors
With preStop sleep(5):
t=0.000s: Kubernetes sends SIGTERM, preStop hook runs
t=0.000-5.0s: Pod still accepts requests normally (sleep running)
t=0.050s: Kubernetes removes pod from endpoints (happens during sleep)
t=0.100s: All load balancers updated (no more new traffic to this pod)
t=5.000s: preStop sleep finishes, application SIGTERM handler runs
t=5.001s: Application stops accepting new requests, drains existing
t=5.001-30s: In-flight requests complete
t=30s: Pod exits
Errors: 0
JVM Warm-Up: The First 60 Seconds
A freshly started JVM runs code in interpreter mode. The JIT compiler needs thousands of method invocations before it compiles hot paths. During this warm-up period, latency is 5-20x higher than steady state:
Content platform article service latency after fresh start:
t=0-5s: P50 = 180ms (interpreter mode, class loading)
t=5-15s: P50 = 85ms (C1 compiled, basic optimizations)
t=15-45s: P50 = 32ms (C2 compiling hot paths)
t=45-90s: P50 = 18ms (C2 complete, inlining stabilized)
t=90s+: P50 = 14ms (steady state, all optimizations applied)
Latency ratio: cold/warm = 180/14 = 12.8x worse at startup
Warm-Up Strategy: Synthetic Load Before Accepting Traffic
// JVM warm-up: exercise hot paths with synthetic requests
// Run AFTER application context is ready, BEFORE readiness probe passes
@Component
public class JvmWarmer {
private static final Logger log = LoggerFactory.getLogger(JvmWarmer.class);
private final ArticleRepository articleRepository;
private final SearchClient searchClient;
private final ArticleRenderingService renderingService;
private final ReadinessController readinessController;
@EventListener(ApplicationReadyEvent.class)
public void warmJvm() {
log.info("Starting JVM warm-up (exercising hot paths)");
long start = System.nanoTime();
// Phase 1: Warm class loading and basic JIT (C1)
warmPhase1_classLoading();
// Phase 2: Warm hot paths to trigger C2 compilation
warmPhase2_hotPaths();
// Phase 3: Warm connection pools (covered in CH24-S2)
warmPhase3_connections();
long elapsed = (System.nanoTime() - start) / 1_000_000;
log.info("JVM warm-up completed in {}ms. Marking ready.", elapsed);
readinessController.markReady();
}
private void warmPhase1_classLoading() {
// Load all classes in the request path
// This prevents class loading latency during real requests
for (int i = 0; i < 100; i++) {
try {
articleRepository.findById("warmup-" + i);
} catch (Exception ignored) {
// Expected: warmup articles do not exist
}
}
}
private void warmPhase2_hotPaths() {
// Execute the full rendering path enough times to trigger C2
// C2 threshold: typically 10,000 invocations (configurable via -XX:CompileThreshold)
// With tiered compilation: C1 at ~200, C2 at ~5,000
int iterations = 5000;
List<String> sampleArticleIds = articleRepository.findRecentIds(10);
for (int i = 0; i < iterations; i++) {
String articleId = sampleArticleIds.get(i % sampleArticleIds.size());
try {
// Exercise the full request path
renderingService.renderArticle(articleId, "warmup-user");
} catch (Exception ignored) {
// Some downstream calls may fail; that is acceptable
}
}
}
private void warmPhase3_connections() {
// Already covered in ConnectionPoolWarmer (CH24-S2)
// Ensure search, recommendation, analytics, image connections are warm
}
}
JVM Flags for Faster Warm-Up
# JVM startup flags for the content platform article service:
java \
# Tiered compilation (default in modern JVMs):
-XX:+TieredCompilation \
# Lower C2 threshold for faster warm-up (default: 10000):
-XX:CompileThreshold=5000 \
# Reserve C2 compiler threads (speeds up background compilation):
-XX:CICompilerCount=4 \
# AOT class data sharing (eliminates class loading time):
-XX:SharedArchiveFile=app-cds.jsa \
# Pre-touch memory pages (avoid page faults during request processing):
-XX:+AlwaysPreTouch \
# Application class-data sharing (CDS) for faster startup:
-XX:SharedClassListFile=classlist.txt \
-jar article-service.jar
Class Data Sharing (CDS) for Startup Time
# Step 1: Generate class list during warm-up run
java -XX:DumpLoadedClassList=classlist.txt \
-jar article-service.jar --warmup-mode
# Step 2: Create shared archive from class list
java -Xshare:dump \
-XX:SharedClassListFile=classlist.txt \
-XX:SharedArchiveFile=app-cds.jsa \
-jar article-service.jar
# Step 3: Use shared archive in production
java -Xshare:on \
-XX:SharedArchiveFile=app-cds.jsa \
-jar article-service.jar
# Impact on content platform startup:
# Without CDS: class loading = 4.2s, total startup = 12s
# With CDS: class loading = 0.8s, total startup = 8.6s
# Savings: 3.4s (28% faster startup)
Measuring Deployment Latency
# Locust script that continuously measures latency during deployment
# Run this alongside `kubectl rollout restart deployment/article-service`
from locust import HttpUser, task, between, events
import time
import csv
import os
class DeploymentLatencyMonitor(HttpUser):
"""Measures P99 latency during rolling deployment"""
wait_time = between(0.01, 0.05) # High frequency for accurate percentiles
host = "http://content-platform.example.com"
latency_log = []
@task
def fetch_article(self):
start = time.perf_counter()
response = self.client.get("/api/articles/12345",
name="GET /api/articles/:id")
elapsed_ms = (time.perf_counter() - start) * 1000
self.latency_log.append({
"timestamp": time.time(),
"latency_ms": elapsed_ms,
"status": response.status_code
})
# Alert on deployment spike
if elapsed_ms > 100:
print(f"SPIKE: {elapsed_ms:.1f}ms at {time.strftime('%H:%M:%S')}")
@events.quitting.add_listener
def on_quitting(environment, **kwargs):
"""Save latency data for analysis"""
with open("deployment_latency.csv", "w", newline="") as f:
writer = csv.DictWriter(f, fieldnames=["timestamp", "latency_ms", "status"])
writer.writeheader()
writer.writerows(DeploymentLatencyMonitor.latency_log)
print(f"Saved {len(DeploymentLatencyMonitor.latency_log)} measurements")
# Results during rolling deployment (3 pods, 1 at a time):
#
# WITHOUT warm-up and proper draining:
# Pre-deploy P99: 30ms
# During deploy P99: 520ms (cold JVM + connection pool miss)
# Duration of spike: 90s (30s per pod * 3 pods)
# 502 errors: 12 (race condition, no preStop sleep)
#
# WITH full optimization (drain + preStop + JVM warm + pool warm):
# Pre-deploy P99: 30ms
# During deploy P99: 42ms (slight increase from reduced capacity)
# Duration of spike: 0s (no spike; new pods are warm before receiving traffic)
# 502 errors: 0
Rolling Deployment Strategy
# Optimized deployment for zero-latency-spike rolling updates:
apiVersion: apps/v1
kind: Deployment
metadata:
name: article-service
spec:
replicas: 3
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 0 # Always maintain 3 ready pods
maxSurge: 1 # Create 4th pod, warm it, then kill 1 old pod
template:
spec:
terminationGracePeriodSeconds: 45
containers:
- name: article-service
resources:
requests:
cpu: "2" # Ensure warm-up has CPU for JIT compilation
memory: "2Gi"
limits:
cpu: "4" # Allow burst during warm-up
memory: "2Gi"
env:
- name: JAVA_OPTS
value: >-
-XX:+TieredCompilation
-XX:CompileThreshold=5000
-XX:CICompilerCount=4
-XX:+AlwaysPreTouch
-Xshare:on
-XX:SharedArchiveFile=/app/app-cds.jsa
startupProbe:
httpGet:
path: /health/live
port: 8080
initialDelaySeconds: 5
periodSeconds: 2
failureThreshold: 30 # Allow up to 65s for startup
readinessProbe:
httpGet:
path: /health/ready
port: 8080
initialDelaySeconds: 0 # Start checking immediately after startup probe passes
periodSeconds: 2
failureThreshold: 1
successThreshold: 2 # Must pass twice (prevent flapping)
livenessProbe:
httpGet:
path: /health/live
port: 8080
initialDelaySeconds: 0
periodSeconds: 10
failureThreshold: 3
lifecycle:
preStop:
exec:
command: ["sh", "-c", "sleep 5"]
Timeline: Optimized Deployment
t=0s: kubectl rollout triggered. New pod (v2) created.
t=5s: v2 JVM starts, class loading begins.
t=8s: v2 application context ready. JVM warm-up starts.
t=8-20s: v2 exercises hot paths (5000 iterations).
t=20-22s: v2 warms connection pools to downstream services.
t=22s: v2 readiness probe passes. Added to Service endpoints.
t=23s: Load balancer routes traffic to v2. Pod serves at full speed.
t=23s: v1-pod1 receives SIGTERM. preStop sleep(5) begins.
t=28s: v1-pod1 stops accepting requests. Drains in-flight.
t=28-53s: v1-pod1 finishes remaining requests.
t=53s: v1-pod1 exits. Process repeats for v1-pod2, v1-pod3.
Total deployment time: ~90s (3 pods)
User-visible impact: 0ms latency spike, 0 errors
Capacity during deployment: never below 3 ready pods
Health Check Optimization
Health checks must distinguish three states: starting (not ready), running (ready), and draining (no longer ready):
@RestController
public class HealthController {
private final AtomicBoolean started = new AtomicBoolean(false);
private final AtomicBoolean ready = new AtomicBoolean(false);
private final AtomicBoolean draining = new AtomicBoolean(false);
// Liveness: Is the process alive? Should Kubernetes restart it?
@GetMapping("/health/live")
public ResponseEntity<String> liveness() {
if (!started.get()) {
return ResponseEntity.status(503).body("starting");
}
return ResponseEntity.ok("alive");
}
// Readiness: Should traffic be sent to this pod?
@GetMapping("/health/ready")
public ResponseEntity<Map<String, Object>> readiness() {
if (draining.get()) {
return ResponseEntity.status(503).body(Map.of(
"status", "draining",
"message", "Pod is shutting down"
));
}
if (!ready.get()) {
return ResponseEntity.status(503).body(Map.of(
"status", "warming",
"message", "JVM warm-up in progress"
));
}
return ResponseEntity.ok(Map.of(
"status", "ready",
"jit_compiled", getCompiledMethodCount(),
"connections_warm", getWarmConnectionCount()
));
}
// Called after JVM warm-up and connection pool warm-up complete
public void markReady() { ready.set(true); }
public void markStarted() { started.set(true); }
public void markDraining() { draining.set(true); ready.set(false); }
private int getCompiledMethodCount() {
CompilationMXBean compilation = ManagementFactory.getCompilationMXBean();
return (int) (compilation.getTotalCompilationTime() / 10); // Rough estimate
}
private int getWarmConnectionCount() {
// Return number of established connections in pool
return 24; // From ConnectionPoolWarmer metrics
}
}
Summary: The Deployment Latency Checklist
Before deployment (zero-downtime requirements):
✓ preStop sleep(5) configured (prevents race condition)
✓ terminationGracePeriodSeconds > drain timeout + preStop sleep
✓ maxUnavailable: 0 (never reduce ready replicas)
✓ maxSurge: 1 (new pod ready before old pod dies)
✓ Graceful shutdown drains in-flight requests
During startup (eliminate cold-start penalty):
✓ DNS prefetched for all downstream services
✓ Connection pools warmed with health check requests
✓ JVM hot paths exercised (5000+ iterations)
✓ CDS archive loaded (3.4s startup savings)
✓ Readiness probe gates on warm-up completion
Steady state (maintain low latency):
✓ Connection max-lifetime rotates connections (DNS rebalancing)
✓ Stale connection detection enabled (validateAfterInactivity)
✓ Passive health checks detect backend failures in < 1ms
✓ Response buffering protects backends from slow clients
Result: P99 latency remains at 30-42ms throughout deployment.
No 502 errors. No cold-start spikes visible to users.
The content platform deploys 4 times daily with zero user-visible impact. The engineering cost was a 22-second startup delay (JVM warm-up + connection warm-up) that is completely hidden behind the readiness probe. Users never see a cold JVM.