The Lies Averages Tell and What Percentiles Reveal
The Lies Averages Tell and What Percentiles Reveal
The Symptom
The ride-hailing platform’s monitoring dashboard shows a single number for the fare calculation endpoint: average latency 120ms. The line is flat. The color is green. The on-call engineer glances at it during the morning check and moves on.
Meanwhile, the support queue fills with riders reporting 5-second load times when requesting a fare during Friday evening surge. The dashboard says 120ms. The riders say 5 seconds. Both are telling the truth.
The Cause
Average latency is an arithmetic mean. It treats a 50ms request and a 10,000ms request as equal contributors to a single number. When the distribution is bimodal, two modes with vastly different latencies, the average lands between them and describes neither.
The fare calculation endpoint has two code paths:
// BOTTLENECK: Two paths with vastly different latencies
@RestController
public class FareController {
private final ReactiveRedisTemplate<String, FareEstimate> redisTemplate;
private final FareCalculationService fareService;
@PostMapping("/api/fares/estimate")
public Mono<FareEstimate> estimateFare(@RequestBody FareRequest request) {
String cacheKey = "fare:" + request.gridCell();
return redisTemplate.opsForValue().get(cacheKey) // Path A: 15ms
.switchIfEmpty(
fareService.calculateWithSurge(request) // Path B: 800-4200ms
.flatMap(fare ->
redisTemplate.opsForValue()
.set(cacheKey, fare, Duration.ofSeconds(60))
.thenReturn(fare)
)
);
}
}
Path A hits the Redis cache. It completes in 15ms. During normal hours, 95% of requests take Path A.
Path B misses the cache. It queries PostgreSQL for the base fare, fetches driver locations from the Redis GeoSet, computes the surge multiplier from supply/demand ratio, and writes the result back to Redis. Under low load, Path B takes 800ms. Under high load when the connection pool is contended, Path B takes 4,200ms.
During Friday evening surge, the cache hit rate drops from 95% to 60% because surge multipliers change rapidly and cached values expire. Now 40% of requests take the 800-4,200ms path. The average climbs from 120ms to 450ms, but the distribution is bimodal: most requests are still 15ms, and a meaningful fraction are 2,000ms+.
The average of 450ms describes neither the fast requests nor the slow ones. It is the temperature at which nobody is comfortable.
The Baseline
The distribution matters. Here is the actual latency distribution of the fare calculation endpoint during Friday evening load, captured by Prometheus:
# Latency distribution buckets for fare estimation
histogram_quantile(0.50, sum(rate(http_server_requests_seconds_bucket{uri="/api/fares/estimate"}[5m])) by (le))
# Result: 0.018 (18ms)
histogram_quantile(0.90, sum(rate(http_server_requests_seconds_bucket{uri="/api/fares/estimate"}[5m])) by (le))
# Result: 0.210 (210ms)
histogram_quantile(0.95, sum(rate(http_server_requests_seconds_bucket{uri="/api/fares/estimate"}[5m])) by (le))
# Result: 0.890 (890ms)
histogram_quantile(0.99, sum(rate(http_server_requests_seconds_bucket{uri="/api/fares/estimate"}[5m])) by (le))
# Result: 4.200 (4,200ms)
The jump from p90 (210ms) to p95 (890ms) is the boundary between cache hits and cache misses. The jump from p95 (890ms) to p99 (4,200ms) is the boundary between “cache miss under moderate load” and “cache miss under connection pool exhaustion.”
At 10,000 fare requests per hour during Friday evening, p99 = 4,200ms means 100 riders per hour wait over 4 seconds for a fare estimate. That is 100 riders who see a spinner, question whether the app is working, and consider the competitor’s app sitting right next to it on their home screen.
The Fix
Track percentiles, not averages. Every metric in this book uses p50, p95, and p99. The Spring Boot Actuator configuration that makes this possible:
// SCALED: Micrometer configuration for percentile tracking
@Configuration
public class MetricsConfig {
@Bean
public MeterRegistryCustomizer<MeterRegistry> metricsCustomizer() {
return registry -> registry.config()
.meterFilter(new MeterFilter() {
@Override
public DistributionStatisticConfig configure(
Meter.Id id,
DistributionStatisticConfig config) {
if (id.getName().startsWith("http.server.requests")) {
return DistributionStatisticConfig.builder()
.percentiles(0.5, 0.95, 0.99)
.percentilesHistogram(true)
.serviceLevelObjectives(
Duration.ofMillis(50).toNanos() / 1e9,
Duration.ofMillis(100).toNanos() / 1e9,
Duration.ofMillis(200).toNanos() / 1e9,
Duration.ofMillis(500).toNanos() / 1e9,
Duration.ofSeconds(1).toNanos() / 1e9,
Duration.ofSeconds(5).toNanos() / 1e9
)
.build()
.merge(config);
}
return config;
}
});
}
}
The SLO buckets (50ms, 100ms, 200ms, 500ms, 1s, 5s) enable Prometheus to calculate the percentage of requests that meet each threshold. The Grafana panel that matters most shows a single line: “percentage of fare estimate requests completing under 500ms.” When that line dips below 99.9%, something is wrong.
The Proof
With percentile tracking enabled, the monitoring dashboard now shows three lines instead of one:
| Metric | Before (avg only) | After (percentiles) |
|---|---|---|
| Dashboard value | 120ms (avg) | p50: 18ms, p95: 890ms, p99: 4,200ms |
| Alerts triggered | None | p99 > 2,000ms fires during Friday surge |
| Time to detect surge issue | Never (discovered via support tickets) | 15 seconds (Prometheus scrape interval) |
The average is still 120ms. Nothing changed about the system’s behavior. What changed is the team’s ability to see the problem before riders report it.
Coordinated Omission
There is a subtler lie hiding in load test results. Locust, by default, measures the time from request start to response. If a request takes 5 seconds and Locust’s wait time is 1-3 seconds, the next request starts 5 seconds late. The 5-second response time is recorded, but the fact that the user was also waiting during the 5 seconds is not.
This is coordinated omission. The load test coordinates with the slow system by backing off when it should be piling on. Real users do not back off. When the fare estimate takes 5 seconds, the rider taps the button again. The system receives more load when it is already struggling.
Gil Tene named this problem and it means that naive load test results undercount tail latency. Locust handles this better than many tools because each simulated user operates independently, but the effect still exists. The mitigation: always run Locust with enough users that the system is saturated, and watch for the failure rate column in the output. A 0% failure rate at high load usually means you are not pushing hard enough.
For this book, every Locust test runs at load levels that produce measurable p99 degradation. If the p99 does not move, the test is not stressing the right bottleneck.