Skip to main content
surviving the spike

Measuring Before Optimizing: Latency Percentiles, Throughput, and the Lies Averages Tell

8 min read Chapter 1 of 66

Measuring Before Optimizing

Before you tune a connection pool, add a cache, or split a monolith, you need a number. Not a guess. Not a feeling. A number that describes exactly how your system behaves under load right now, before you change anything.

This book is about a ride-hailing platform. Riders open an app, request a ride, get matched with a driver, see a fare, take the trip, and view their history. Behind the screen: driver location ingestion at thousands of updates per second, a matching algorithm, a fare calculator that accounts for surge pricing, a trip history service, and a real-time location stream. Every chapter uses this system. Every Locust script targets it. Every failure scenario happens inside it.

Three opinions run through every chapter. State them now.

Redis is the definitive caching layer. Not Memcached, not Hazelcast, not Caffeine alone. Redis, used correctly, handles HTTP cache coordination, query result caching, computed aggregate storage, rate limiting state, and cross-instance pub/sub. Every caching chapter in this book uses Redis. When a different tool fits a narrow case better, the narrow case is stated, then the book returns to Redis.

Reactive over blocking for I/O-bound services. Spring WebFlux with Project Reactor is the correct default for services that spend most of their time waiting on network I/O. Spring MVC with thread-per-request is not wrong in absolute terms, but when the bottleneck is I/O, blocking 200 threads to wait for database responses is waste. The math is shown throughout.

Measure first, always. An opinion without a number is a preference. A number without a baseline is noise. Every optimization in this book starts with a Locust test, establishes the current state, applies the fix, and re-runs the test. The delta is the proof.

Why Averages Lie

The fare calculation endpoint averages 120ms. The team reports this in standup. The product manager is satisfied. The SRE who got paged at 3am because riders in surge-pricing zones experienced 8-second load times is not satisfied.

The average of 120ms is real. It is also useless for understanding user experience under load. Here is why.

Consider 1,000 requests to the fare calculation endpoint during a Friday evening surge:

  • 950 requests complete in 80ms (cache hit, surge multiplier already computed)
  • 40 requests complete in 400ms (cache miss, recalculate surge from driver locations)
  • 10 requests complete in 4,200ms (cache miss + connection pool exhaustion + PostgreSQL cold query)

The average: $(950 \times 80 + 40 \times 400 + 10 \times 4200) / 1000 = 118\text{ms}$

The average says 118ms. The 10 riders who waited 4.2 seconds, watched their app spinner, and switched to a competitor are not represented by that number.

Percentiles Reveal What Averages Hide

Percentiles sort every request by duration and tell you: “X% of requests were faster than this value.”

  • p50 (median): 80ms. Half of all requests complete in 80ms or less. This is the typical experience.
  • p95: 400ms. 95% of requests complete in 400ms or less. This is the experience of the unlucky 5%.
  • p99: 4,200ms. 99% of requests complete in 4,200ms or less. This is the experience of the 1% who will write negative reviews and churn.

At 10,000 requests per minute, p99 = 4,200ms means 100 riders per minute are waiting over 4 seconds. That is 6,000 riders per hour during peak. The average says 118ms. The p99 says you have a problem.

For the ride-hailing platform, these percentiles map to different failure modes:

PercentileTypical causeWho feels it
p50Normal request path, cache hitsNobody complains
p95Cache misses, moderate database loadSome riders notice delay
p99Connection pool exhaustion, GC pauses, lock contentionRiders abandon, support tickets spike
p99.9Cascading failures, timeout stormsOutage territory

Throughput and Latency Are Not Friends

Throughput measures how many requests the system handles per second. Latency measures how long each request takes. Optimizing one often degrades the other.

The rider API handles 2,000 requests per second with a p99 of 200ms. A developer adds an in-memory cache for fare estimates and throughput jumps to 3,500 RPS. The p99 stays at 200ms. Good trade.

A different developer enables eager fetching on the trip history entity to “reduce round trips to the database.” Throughput drops to 1,200 RPS because each query now returns 10x the data. The p99 climbs to 1,800ms because the larger result sets saturate the connection pool. Bad trade, made without measurement.

The relationship is nonlinear. As throughput approaches the system’s maximum capacity, latency does not increase linearly. It follows a hockey stick curve. At 50% capacity, p99 is 200ms. At 80% capacity, p99 is 600ms. At 95% capacity, p99 is 4,000ms. The system has not crashed. The metrics dashboard still shows green for average latency. But 1% of users are experiencing a broken product.

This is why every chapter in this book establishes a baseline before changing anything. Without a baseline, you cannot distinguish a good trade from a bad one.

The Locust Baseline

Locust is a Python load testing framework that simulates user behavior. Every chapter in this book includes a Locust scenario. The scenarios live in a load-tests/ directory and target the ride-hailing platform’s endpoints.

The baseline scenario models three user types:

# load-tests/baseline_locustfile.py
from locust import HttpUser, task, between, tag

class RiderUser(HttpUser):
    weight = 6  # 60% of traffic is riders
    wait_time = between(1, 3)

    @tag("rider")
    @task(3)
    def search_drivers(self):
        """Rider searches for available drivers near their location."""
        self.client.get(
            "/api/drivers/nearby",
            params={"lat": 40.7128, "lng": -74.0060, "radius_km": 5},
            name="/api/drivers/nearby"
        )

    @tag("rider")
    @task(2)
    def request_fare_estimate(self):
        """Rider requests a fare estimate for a trip."""
        self.client.post(
            "/api/fares/estimate",
            json={
                "pickup_lat": 40.7128,
                "pickup_lng": -74.0060,
                "dropoff_lat": 40.7580,
                "dropoff_lng": -73.9855
            },
            name="/api/fares/estimate"
        )

    @tag("rider")
    @task(1)
    def view_trip_history(self):
        """Rider views their past trips."""
        self.client.get(
            "/api/trips/history",
            headers={"X-User-Id": "rider-1234"},
            name="/api/trips/history"
        )


class DriverUser(HttpUser):
    weight = 3  # 30% of traffic is drivers
    wait_time = between(2, 5)

    @tag("driver")
    @task(5)
    def update_location(self):
        """Driver sends location update every few seconds."""
        self.client.post(
            "/api/drivers/location",
            json={
                "driver_id": "driver-5678",
                "lat": 40.7128 + (self.environment.runner.user_count * 0.0001),
                "lng": -74.0060,
                "timestamp": "2026-05-23T19:00:00Z"
            },
            name="/api/drivers/location"
        )

    @tag("driver")
    @task(1)
    def check_ride_requests(self):
        """Driver checks for incoming ride requests."""
        self.client.get(
            "/api/drivers/requests",
            headers={"X-Driver-Id": "driver-5678"},
            name="/api/drivers/requests"
        )


class AdminUser(HttpUser):
    weight = 1  # 10% of traffic is admin/analytics
    wait_time = between(5, 15)

    @tag("admin")
    @task(1)
    def view_zone_stats(self):
        """Admin views zone-level statistics."""
        self.client.get(
            "/api/admin/zones/stats",
            name="/api/admin/zones/stats"
        )

Run the baseline:

locust -f load-tests/baseline_locustfile.py \
    --host=http://localhost:8080 \
    --users 200 \
    --spawn-rate 10 \
    --run-time 5m \
    --headless \
    --csv=load-tests/results/baseline

The baseline output for the unoptimized ride-hailing platform:

Name                     # reqs  Avg   Med   Min   Max    p95    p99   RPS   Fail%
/api/drivers/nearby       1842   145    98    12   4210   420   2100  6.14   0.0%
/api/fares/estimate       1228   210   130    18   8400   890   4200  4.09   0.2%
/api/trips/history         614   680   450    45  12000  2800   8400  2.05   1.1%
/api/drivers/location     1535    45    32     8    980   180    650  5.12   0.0%
/api/drivers/requests      307    62    40    10   1200   250    800  1.02   0.0%
/api/admin/zones/stats     102   890   600    80  15000  4200  12000  0.34   2.9%
Aggregated                5628   224   110    8   15000  1200   4800 18.76   0.4%

These numbers are the starting point. Every chapter that follows will target a specific row in this table, show why the number is bad, fix it, and re-run to show the delta.

The trip history endpoint at p99 = 8,400ms is the first target. The admin zone stats at p99 = 12,000ms is the worst, but it serves 10% of traffic. The fare estimate at p99 = 4,200ms affects riders directly and is the highest-impact problem.

Prometheus and Grafana for Continuous Measurement

Locust measures from the outside. Prometheus measures from the inside. Both are necessary.

The ride-hailing platform exposes metrics via Spring Boot Actuator and Micrometer:

# application.yml
management:
  endpoints:
    web:
      exposure:
        include: health,prometheus,info
  metrics:
    tags:
      application: ride-hailing
    distribution:
      percentiles-histogram:
        http.server.requests: true
      slo:
        http.server.requests: 50ms,100ms,200ms,500ms,1s,5s
// SCALED: Custom metric for fare calculation duration
@Component
public class FareCalculationMetrics {

    private final Timer fareCalculationTimer;

    public FareCalculationMetrics(MeterRegistry registry) {
        this.fareCalculationTimer = Timer.builder("fare.calculation.duration")
            .description("Time spent calculating fare including surge pricing")
            .publishPercentiles(0.5, 0.95, 0.99)
            .publishPercentileHistogram()
            .register(registry);
    }

    public <T> Mono<T> timed(Mono<T> operation) {
        return Mono.defer(() -> {
            Timer.Sample sample = Timer.start();
            return operation.doOnTerminate(() -> sample.stop(fareCalculationTimer));
        });
    }
}

The Prometheus query that tells you whether your service is meeting its latency target:

histogram_quantile(0.99,
  sum(rate(http_server_requests_seconds_bucket{
    application="ride-hailing",
    uri="/api/fares/estimate"
  }[5m])) by (le)
)

This number, updated every 15 seconds, is the ground truth. Not the average. Not what the developer thinks happens. What actually happens, measured at the 99th percentile, over a 5-minute window.

The Grafana dashboard for the baseline has four panels:

  1. Request rate (RPS) per endpoint, stacked area chart
  2. p99 latency per endpoint, line chart with SLO threshold at 500ms
  3. Error rate per endpoint, line chart with threshold at 0.1%
  4. Connection pool utilization, HikariCP active connections vs max

This dashboard is referenced in every subsequent chapter. When a chapter claims an optimization reduced p99 from 4,200ms to 180ms, the dashboard is the evidence.

What Comes Next

The baseline is established. The ride-hailing platform handles 18.76 RPS aggregate with a p99 of 4,800ms. The fare estimate endpoint is slow. The trip history endpoint is slower. The admin zone stats endpoint is catastrophic.

Every chapter that follows picks one of these problems, explains the mechanical cause, and fixes it with code. The Locust baseline re-runs after each fix. The delta is the proof.

Chapter 2 traces a single request through every layer of the system to understand where the milliseconds go. Start there.