Algorithms and Health Checks That Actually Work
Algorithms and Health Checks That Actually Work
The Symptom
The incident postmortem reads: “Health check returned 200 for 7 minutes while the service was unable to serve requests.” The team stares at the timeline. Pod-3’s PostgreSQL connection pool exhausted at 18:42. Every request to pod-3 started failing at 18:42. The load balancer continued routing 430 RPS to pod-3 until 18:49 when the on-call engineer manually removed it. 180,600 failed requests. Customer-facing error rate: 8.3%.
The health check endpoint:
// BOTTLENECK: Health check that lies
@GetMapping("/health")
public ResponseEntity<Map<String, String>> health() {
return ResponseEntity.ok(Map.of("status", "UP"));
}
This endpoint checks nothing. It confirms the JVM is running and the HTTP server is listening. It does not confirm the service can do its job. The load balancer asked “are you alive?” and the pod answered “yes” while dropping every request.
The Cause
Health checks serve two audiences with conflicting needs.
The load balancer needs to know: “Can this pod serve requests right now?” If the answer is no, stop routing traffic to it. This is the readiness check. It should verify every external dependency the pod needs to handle a request: database connectivity, cache availability, downstream service reachability.
The container runtime needs to know: “Is this pod in a recoverable state?” If the answer is no, restart it. This is the liveness check. It should verify only the pod’s internal state: is the JVM responsive, is the main thread alive, is the process deadlocked. It must not check external dependencies because restarting a pod does not fix a broken database.
The rider API depends on PostgreSQL (primary data store) and Redis (session cache, rate limiting). A request cannot be served without both. The readiness check must verify both:
// SCALED: Readiness check that verifies real capabilities
@GetMapping("/health/ready")
public Mono<ResponseEntity<Map<String, Object>>> readiness() {
Mono<HealthStatus> dbCheck = checkPostgres();
Mono<HealthStatus> redisCheck = checkRedis();
Mono<HealthStatus> poolCheck = checkConnectionPool();
return Mono.zip(dbCheck, redisCheck, poolCheck)
.map(tuple -> {
Map<String, Object> details = new LinkedHashMap<>();
details.put("postgres", tuple.getT1());
details.put("redis", tuple.getT2());
details.put("connectionPool", tuple.getT3());
boolean healthy = tuple.getT1().isUp()
&& tuple.getT2().isUp()
&& tuple.getT3().isUp();
details.put("status", healthy ? "UP" : "DOWN");
return healthy
? ResponseEntity.ok(details)
: ResponseEntity.status(503).body(details);
});
}
private Mono<HealthStatus> checkPostgres() {
return Mono.fromCallable(() -> {
try (Connection conn = dataSource.getConnection()) {
try (PreparedStatement stmt = conn.prepareStatement("SELECT 1")) {
stmt.setQueryTimeout(3);
stmt.executeQuery();
return HealthStatus.up("postgres");
}
}
})
.subscribeOn(Schedulers.boundedElastic())
.timeout(Duration.ofSeconds(3))
.onErrorResume(e -> Mono.just(
HealthStatus.down("postgres", e.getMessage())
));
}
private Mono<HealthStatus> checkRedis() {
return redisConnectionFactory.getReactiveConnection()
.ping()
.map(pong -> HealthStatus.up("redis"))
.timeout(Duration.ofSeconds(2))
.onErrorResume(e -> Mono.just(
HealthStatus.down("redis", e.getMessage())
));
}
private Mono<HealthStatus> checkConnectionPool() {
HikariPoolMXBean pool = ((HikariDataSource) dataSource)
.getHikariPoolMXBean();
int active = pool.getActiveConnections();
int total = pool.getTotalConnections();
int pending = pool.getThreadsAwaitingConnection();
boolean healthy = pending < 5 && active < total;
return Mono.just(healthy
? HealthStatus.up("connectionPool",
String.format("active=%d total=%d pending=%d",
active, total, pending))
: HealthStatus.down("connectionPool",
String.format("active=%d total=%d pending=%d",
active, total, pending))
);
}
// Supporting HealthStatus record
public record HealthStatus(String component, String status, String detail) {
public boolean isUp() { return "UP".equals(status); }
public static HealthStatus up(String component) {
return new HealthStatus(component, "UP", "");
}
public static HealthStatus up(String component, String detail) {
return new HealthStatus(component, "UP", detail);
}
public static HealthStatus down(String component, String detail) {
return new HealthStatus(component, "DOWN", detail);
}
}
The connection pool check deserves attention. threadsAwaitingConnection is the number of threads blocked waiting for a database connection. When this exceeds 5, the pool is under pressure. When activeConnections equals totalConnections, the pool is exhausted. The readiness check catches pool exhaustion before request timeouts do.
The liveness check is minimal:
// SCALED: Liveness check - only JVM responsiveness
@GetMapping("/health/live")
public ResponseEntity<Map<String, String>> liveness() {
return ResponseEntity.ok(Map.of("status", "UP"));
}
If the JVM can execute this handler and return a response, it is alive. If it cannot (deadlock, full GC loop, out of file descriptors), the HTTP server will not respond, the liveness probe will time out, and Kubernetes will restart the pod.
The Baseline
Comparison of health check approaches:
Check Type Detects DB Failure Detects Redis Failure False Restarts Cost
TCP connect No No No ~0
HTTP 200 No No No ~0
Shallow /health No No No 0.5ms
Deep readiness Yes Yes No 5ms
Deep liveness Yes Yes YES 5ms
Deep liveness checks (checking dependencies in the liveness probe) cause false restarts. When PostgreSQL goes down, the liveness probe fails on all pods. Kubernetes restarts all pods simultaneously. The pods come back, attempt to connect to the still-down database, fail the liveness check again, and restart again. A restart loop that amplifies the outage.
The Fix
Kubernetes probe configuration
# SCALED: Probe configuration for the rider API
spec:
containers:
- name: rider-api
ports:
- containerPort: 8080
readinessProbe:
httpGet:
path: /health/ready
port: 8080
initialDelaySeconds: 15
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
successThreshold: 1
livenessProbe:
httpGet:
path: /health/live
port: 8080
initialDelaySeconds: 30
periodSeconds: 15
timeoutSeconds: 3
failureThreshold: 5
successThreshold: 1
startupProbe:
httpGet:
path: /health/live
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
failureThreshold: 30
successThreshold: 1
The startupProbe runs during pod startup only. It gives the JVM up to 155 seconds (5 + 30*5) to start. During this time, neither the liveness nor readiness probe runs. Without a startup probe, the liveness probe’s initialDelaySeconds: 30 might be too short for a cold start with class loading and Spring context initialization. The startup probe prevents liveness from killing a pod that is still booting.
Timing decisions:
Readiness periodSeconds: 10. Check every 10 seconds. A pod that fails becomes unhealthy within 30 seconds (3 failures * 10 seconds). Recovery takes 10 seconds (1 success). 30 seconds of downtime per pod is acceptable because other pods absorb the traffic.
Readiness failureThreshold: 3. Three consecutive failures before removal. A single failed check (network blip, momentary connection pool spike) does not remove the pod. Three consecutive failures over 30 seconds indicate a real problem.
Liveness failureThreshold: 5. Five consecutive failures before restart, at 15-second intervals: 75 seconds. This is intentionally high. Restarting is destructive. All in-flight requests are dropped. The pod’s warm caches are lost. If the JVM is truly hung, 75 seconds is a reasonable detection time. If the JVM is experiencing a long GC pause (which G1 can have during a mixed collection), 75 seconds gives it time to recover without an unnecessary restart.
Readiness timeoutSeconds: 5. The deep health check queries PostgreSQL and Redis. If either takes more than 5 seconds to respond, the check fails. A 5-second timeout is generous for “SELECT 1” and Redis PING. If these operations take 5 seconds, the service cannot serve real requests within SLA anyway.
Algorithm comparison for the ride-hailing platform
Testing with 12 pods, 5,000 RPS, one pod injected with 200ms artificial latency:
Algorithm p50 p99 Max Degraded Pod RPS
Round-robin 95ms 1,450ms 3,200ms 430 (8.3%)
Weighted RR 90ms 980ms 2,100ms 215 (4.1%)
Least connections 92ms 210ms 480ms 85 (1.6%)
Power of 2 choices 93ms 240ms 520ms 110 (2.1%)
Round-robin sends equal traffic to the degraded pod. The degraded pod’s 200ms additional latency affects 8.3% of all requests, pulling the p99 to 1,450ms.
Least connections detects the degraded pod’s higher connection count within seconds. Connection count rises because requests take longer: at 430 RPS with 200ms additional latency, the pod has ~86 extra concurrent connections compared to healthy pods. The load balancer shifts traffic away.
Power of two choices performs similarly to least connections but avoids the thundering-herd effect. When the degraded pod recovers, least connections might briefly stampede all traffic to it (it has the fewest connections). Power of two choices randomizes the selection, spreading the recovery load.
For the ride-hailing platform, least connections is the correct default. The thundering herd risk is low with 12+ pods because recovery traffic distributes across the fleet, not to a single pod.
The health check that lied: a postmortem
The timeline:
18:42:00 PostgreSQL connection pool exhausted on pod-3
18:42:01 All new requests to pod-3 start timing out (5s DB timeout)
18:42:01 Health check still returns 200 (does not check DB)
18:42:10 First health check runs: 200 OK
18:42:20 Second health check: 200 OK
18:42:30 Third health check: 200 OK
... (health check returns 200 for 7 minutes)
18:49:00 On-call engineer runs: kubectl delete pod rider-api-pod-3
18:49:05 New pod starts, connects to PostgreSQL successfully
18:49:35 New pod passes readiness, starts receiving traffic
With the deep readiness check, the timeline would have been:
18:42:00 PostgreSQL connection pool exhausted on pod-3
18:42:10 Readiness check: DB check fails (pool exhausted), returns 503
18:42:20 Readiness check: 503 (failure 2)
18:42:30 Readiness check: 503 (failure 3) → pod removed from endpoints
18:42:31 Traffic stops routing to pod-3
18:42:31 Pod-3's connection pool starts recovering (no new requests)
18:43:00 Connection pool recovers, readiness check returns 200
18:43:00 Pod-3 re-added to endpoints, resumes serving traffic
Total downtime: 30 seconds of degraded traffic to pod-3, affecting ~4,300 requests. With auto-recovery, the pod comes back without manual intervention.
The Proof
After deploying deep readiness checks, startup probes, and least-connections balancing:
Metric Before After Delta
DB outage detection time 7 min (manual) 30s (auto) -93%
Requests affected by DB issue 180,600 4,300 -97%
GC pause blast radius 430 requests 22 requests -95%
False pod restarts/month 0 0 No change
Health check latency overhead 0.5ms 5ms +4.5ms
The 5ms health check overhead is the cost of querying PostgreSQL and Redis every 10 seconds. At 12 pods, that is 1.2 health check queries per second to PostgreSQL. The database handles 50,000 queries per second. The overhead is 0.002%.