Skip to main content
surviving the spike

Defining SLOs for the Ride-Hailing Platform

8 min read Chapter 50 of 66

Defining SLOs for the Ride-Hailing Platform

The Symptom

The engineering team defines SLOs for every service. The surge pricing service gets a 99.9% availability SLO. The driver analytics dashboard gets a 99.9% latency SLO. The internal admin API gets a 99.95% availability SLO.

Three months later, the error budget dashboard shows all SLOs at 100% budget remaining. Not because the services are perfect, but because nobody is measuring the SLIs correctly. The surge pricing SLO measures internal health checks, not actual surge calculations. The analytics dashboard SLO counts page loads, including the blank loading state. The admin API has 3 requests per hour, making the SLI statistically meaningless.

The team spent time defining SLOs that measure nothing useful.

The Cause

SLOs fail when they measure the wrong thing. Three common mistakes:

  1. Measuring infrastructure instead of user experience: “The database is available” is not an SLI. “The user can complete a ride request” is.
  2. Measuring all traffic equally: Health checks, readiness probes, and admin endpoints should not be in the SLI calculation. They dilute the signal.
  3. Setting targets without understanding the current baseline: Choosing 99.99% because it sounds good, when the service currently runs at 99.5%.

A meaningful SLI answers: “Did the user get what they wanted, fast enough, and correctly?” For the ride-hailing platform, the user wants:

  • To request a ride and get a driver (availability)
  • To get a fare estimate in under half a second (latency)
  • To be charged the correct amount (correctness)

The Baseline

Current SLI measurements:

Service              SLI Type      What It Measures            Problem
Rider API            Availability  All HTTP 200s               Includes health checks
Surge Pricing        Availability  Internal health endpoint    Not measuring surge calcs
Driver Analytics     Latency       Page load (empty state)     Not measuring data load
Admin API            Availability  All responses               3 req/hour, meaningless
Fare Service         None          Nothing                     Not measured

Target SLI definitions:

Service              SLI Type      What It Measures                    Excludes
Rider API            Latency       Ride request < 500ms                Health checks, admin
Rider API            Availability  Ride request non-5xx                Health checks, admin
Fare Service         Correctness   Fare within expected range          Test requests
Fare Service         Latency       Fare estimate < 200ms               Test requests

The Fix

SLI Selection: The Three Proportions

Every SLI is a proportion: good events divided by total events.

SLI Type       Good Event                              Total Event
Latency        Request completed in < threshold         All requests
Availability   Request completed without server error   All requests
Correctness    Fare within ±5% of expected value        All fare calculations

Prometheus Recording Rules

# SCALED: Recording rules for ride-hailing SLIs
groups:
  - name: ride_hailing_slis
    interval: 30s
    rules:
      # ============================
      # LATENCY SLI: Rider API
      # ============================
      # Good: requests faster than 500ms
      # Total: all requests (excluding health checks)

      - record: sli:rider_api:latency:good_total5m
        expr: |
          sum(rate(http_server_requests_seconds_bucket{
            service="rider-api",
            uri=~"/api/rides/.*",
            le="0.5"
          }[5m]))

      - record: sli:rider_api:latency:total5m
        expr: |
          sum(rate(http_server_requests_seconds_count{
            service="rider-api",
            uri=~"/api/rides/.*"
          }[5m]))

      - record: sli:rider_api:latency:ratio5m
        expr: |
          sli:rider_api:latency:good_total5m
          /
          sli:rider_api:latency:total5m

      # 30-minute window
      - record: sli:rider_api:latency:ratio30m
        expr: |
          sum(rate(http_server_requests_seconds_bucket{
            service="rider-api",
            uri=~"/api/rides/.*",
            le="0.5"
          }[30m]))
          /
          sum(rate(http_server_requests_seconds_count{
            service="rider-api",
            uri=~"/api/rides/.*"
          }[30m]))

      # 1-hour window
      - record: sli:rider_api:latency:ratio1h
        expr: |
          sum(rate(http_server_requests_seconds_bucket{
            service="rider-api",
            uri=~"/api/rides/.*",
            le="0.5"
          }[1h]))
          /
          sum(rate(http_server_requests_seconds_count{
            service="rider-api",
            uri=~"/api/rides/.*"
          }[1h]))

      # 6-hour window
      - record: sli:rider_api:latency:ratio6h
        expr: |
          sum(rate(http_server_requests_seconds_bucket{
            service="rider-api",
            uri=~"/api/rides/.*",
            le="0.5"
          }[6h]))
          /
          sum(rate(http_server_requests_seconds_count{
            service="rider-api",
            uri=~"/api/rides/.*"
          }[6h]))

      # ============================
      # AVAILABILITY SLI: Rider API
      # ============================
      - record: sli:rider_api:availability:ratio5m
        expr: |
          1 - (
            sum(rate(http_server_requests_seconds_count{
              service="rider-api",
              uri=~"/api/rides/.*",
              status=~"5.."
            }[5m]))
            /
            sum(rate(http_server_requests_seconds_count{
              service="rider-api",
              uri=~"/api/rides/.*"
            }[5m]))
          )

      - record: sli:rider_api:availability:ratio1h
        expr: |
          1 - (
            sum(rate(http_server_requests_seconds_count{
              service="rider-api",
              uri=~"/api/rides/.*",
              status=~"5.."
            }[1h]))
            /
            sum(rate(http_server_requests_seconds_count{
              service="rider-api",
              uri=~"/api/rides/.*"
            }[1h]))
          )

      # ============================
      # CORRECTNESS SLI: Fare Service
      # ============================
      - record: sli:fare:correctness:ratio5m
        expr: |
          sum(rate(fare_calculation_accurate_total{
            service="fare-service"
          }[5m]))
          /
          sum(rate(fare_calculation_total{
            service="fare-service"
          }[5m]))

The uri=~"/api/rides/.*" filter excludes health checks (/health), readiness probes (/ready), and admin endpoints (/admin/*). Only rider-facing traffic counts toward the SLO.

Why Multiple Windows Matter

A single window SLI is vulnerable to edge effects. If you compute the ratio over only 5 minutes, a brief spike looks catastrophic. If you compute only over 6 hours, a brief spike is invisible but a real degradation takes hours to surface.

Multiple windows serve different purposes:

Window    Purpose                              Used By
5m        Short-term validation (is it still    Fast burn short window
          happening right now?)
30m       Recent trend confirmation             Slow burn short window
1h        Sustained impact detection            Fast burn long window
6h        Gradual degradation detection         Slow burn long window

The 5m and 30m windows are short validation windows. They confirm the problem is current, not historical. The 1h and 6h windows are long detection windows. They confirm the problem is significant, not a blip. Alerting rules pair one long window with one short window (CH17-S2).

Error Budget Calculation

SLO Target    Error Budget Rate    30-Day Budget (time)    30-Day Budget (requests at 100 RPS)
99.9%         0.1%                 43.2 minutes            259,200 slow/failed requests
99.95%        0.05%                21.6 minutes            129,600 slow/failed requests
99.99%        0.01%                4.32 minutes            25,920 slow/failed requests

A 30-day error budget query:

# SCALED: Error budget remaining for rider API latency SLO
1 - (
  (1 - sli:rider_api:latency:ratio30d)    # actual error rate over 30 days
  /
  0.001                                    # allowed error rate (1 - 0.999)
)

If the result is 0.75, 75% of the error budget remains. If it drops below 0, the SLO is violated.

Vanity Metrics vs Meaningful SLOs

SLO                                     Meaningful?    Why?
Surge pricing 99.9% availability        No             Riders can book without surge pricing
Driver analytics 99.9% latency          No             Drivers don't need real-time analytics
Rider API 99.9% latency                 Yes            Directly affects ride request experience
Fare service 99.99% correctness         Yes            Wrong fares lose trust and revenue

The surge pricing service can be down for 10 minutes and riders still book rides. They just do not see surge pricing. That is a degraded experience, not an outage. The rider API going down for 10 minutes means nobody can request a ride. That is an outage.

Prioritizing SLOs

Priority    Service              SLO                           Engineering Investment
1           Rider API            99.9% latency < 500ms         High (auto-scaling, caching, circuit breakers)
2           Rider API            99.95% availability           High (multi-AZ, graceful degradation)
3           Fare Service         99.99% correctness            Medium (validation, reconciliation)
4           Driver API           99.5% latency < 1s            Low (batch-tolerant users)
5           Analytics Dashboard  99% availability              Minimal (internal tool)

Lower-priority services get looser SLOs and less engineering investment. The analytics dashboard at 99% availability gets 7.2 hours of allowed downtime per month. That is generous enough to deploy during business hours without worrying about SLO violations.

Error Budget as an Engineering Lever

The error budget is not just a measurement. It is a decision-making tool:

Budget Remaining    Action
> 75%               Ship features freely, take risks
50-75%              Normal development, monitor trends
25-50%              Slow feature work, prioritize reliability
< 25%               Feature freeze, all hands on reliability
0% (violated)       Postmortem required, mandatory reliability sprint

When the rider API has 90% budget remaining, the team ships a risky database migration without hesitation. When budget drops to 30%, the team postpones the migration and investigates the burn rate. When budget hits 0%, feature development stops until reliability is restored.

This converts “how reliable should we be?” from a philosophical debate into a data-driven discussion. Product managers see the budget gauge. They understand that shipping a risky feature when the budget is at 15% means accepting the possibility of a feature freeze.

The Proof

After defining correct SLIs, validate them against real traffic:

# SCALED: Validate SLI accuracy

# Step 1: Check SLI ratio for the last hour
sli:rider_api:latency:ratio1h

# Expected: 0.997-0.999 for a healthy system
# If you see 1.0, the SLI might not be measuring real traffic
# If you see < 0.99, either the SLO target is too aggressive or the service has issues
# Step 2: Verify the SLI excludes health checks
# This ratio should be 0 (no health check traffic in the SLI)
sum(rate(http_server_requests_seconds_count{
  service="rider-api",
  uri="/health"
}[5m]))
/
sum(rate(http_server_requests_seconds_count{
  service="rider-api",
  uri=~"/api/rides/.*"
}[5m]))

If health check traffic contributes more than 1% to the denominator, the SLI is diluted. The filter is working correctly when health checks contribute 0% to the SLI calculation.

Run Locust for 1 hour and verify the SLI tracks reality:

# SCALED: Locust for SLI validation
from locust import HttpUser, task, between

class SLIValidationUser(HttpUser):
    wait_time = between(0.1, 0.5)

    @task(10)
    def ride_request(self):
        """Rider-facing traffic: should be in SLI"""
        self.client.post("/api/rides/request",
            json={
                "rider_id": "validation-rider",
                "pickup": {"lat": 40.7128, "lng": -74.0060},
                "dropoff": {"lat": 40.7589, "lng": -73.9851}
            },
            name="/api/rides/request"
        )

    @task(1)
    def health_check(self):
        """Infrastructure traffic: should NOT be in SLI"""
        self.client.get("/health", name="/health")

After 1 hour, compare:

  • Total requests to /api/rides/request in Locust: ~36,000
  • Total requests in sli:rider_api:latency:total5m summed over 1 hour: ~36,000
  • Total requests to /health in Locust: ~3,600
  • Health check contribution to SLI: 0%

The SLI measures what users experience, nothing more.