Skip to main content
surviving the spike

SLOs, Error Budgets, and Escaping Alert Fatigue

9 min read Chapter 49 of 66

SLOs, Error Budgets, and Escaping Alert Fatigue

The Symptom

The on-call engineer’s phone buzzed 14 times last night. Twelve of those alerts were “p99 latency > 500ms” on the rider API. Each spike lasted 5-15 seconds, coinciding with deployments rolling through Kubernetes pods. The other two alerts were “error rate > 1%” triggered by a single failed health check that Kubernetes retried successfully.

None of the 14 alerts required action. The engineer checked each one, confirmed it was transient, and went back to sleep. This has happened every night for three weeks.

Last Tuesday, a real incident happened during a surge event. The rider API’s p99 climbed to 1.2 seconds and stayed there for 45 minutes. The on-call engineer’s phone buzzed. They assumed it was another deployment blip and silenced it. The incident was detected 38 minutes later when riders started calling support.

Alert fatigue killed the alert. The signal drowned in noise.

The Cause

Threshold-based alerting fires when a metric crosses a line. “Alert when p99 > 500ms” is a threshold alert. It treats a 5-second spike during a rolling deployment the same as a 45-minute degradation during a surge event. Both cross the threshold. Both fire the same alert. One requires action. The other does not.

The problem is not the threshold value. The problem is that threshold alerts have no concept of duration, magnitude, or user impact. A 5-second spike that affects 10 requests is noise. A 45-minute degradation that affects 50,000 requests is a real incident.

SLO-based alerting with burn rates solves this by asking a different question. Instead of “is the metric above a line?” it asks “at the current rate of failure, will we exhaust our error budget before the end of the SLO window?” A 5-second spike does not consume meaningful error budget. A 45-minute degradation does. The alert fires for the second case, not the first.

The Baseline

SLIs for the Ride-Hailing Platform

SLIs (Service Level Indicators) are the raw measurements:

SLI             Definition                                      Measurement
Latency         Proportion of requests < 500ms                  http_server_requests_seconds_bucket{le="0.5"}
Availability    Proportion of non-5xx responses                 1 - (5xx count / total count)
Correctness     Proportion of fare calcs within expected range  fare_calculation_accurate_total / fare_calculation_total

SLOs

SLOs (Service Level Objectives) set targets:

SLO                        Target         Error Budget (30 days)
Rider API latency          99.9% < 500ms  0.1% = 43.2 minutes
Rider API availability     99.95%         0.05% = 21.6 minutes
Fare correctness           99.99%         0.01% = 4.3 minutes

The error budget is the acceptable amount of failure. 99.9% latency SLO means 0.1% of requests are allowed to exceed 500ms. Over 30 days at 100 RPS, that is 259,200 slow requests out of 259.2 million. Translated to time: 43.2 minutes of total violation allowed.

Current Alerting

Alert                              Fires/Week    Actionable?
p99 > 500ms                        12            2 (17%)
Error rate > 1%                    8             1 (13%)
CPU > 80%                          5             0 (0%)
Memory > 70%                       3             0 (0%)
Total                              28            3 (11%)

89% of alerts are noise. The on-call engineer is paged 25 times per week for nothing.

Target: alerts that fire only when user experience is meaningfully degraded. Fewer than 3 false positives per week.

The Fix

Prometheus Recording Rules for SLIs

# SCALED: Prometheus recording rules for SLO tracking
groups:
  - name: slo_recording_rules
    interval: 30s
    rules:
      # Latency SLI: proportion of requests faster than 500ms
      - record: sli:rider_api:latency:success_rate5m
        expr: |
          sum(rate(http_server_requests_seconds_bucket{
            service="rider-api", uri=~"/api/rides/.*", le="0.5"
          }[5m]))
          /
          sum(rate(http_server_requests_seconds_count{
            service="rider-api", uri=~"/api/rides/.*"
          }[5m]))

      - record: sli:rider_api:latency:success_rate30m
        expr: |
          sum(rate(http_server_requests_seconds_bucket{
            service="rider-api", uri=~"/api/rides/.*", le="0.5"
          }[30m]))
          /
          sum(rate(http_server_requests_seconds_count{
            service="rider-api", uri=~"/api/rides/.*"
          }[30m]))

      - record: sli:rider_api:latency:success_rate1h
        expr: |
          sum(rate(http_server_requests_seconds_bucket{
            service="rider-api", uri=~"/api/rides/.*", le="0.5"
          }[1h]))
          /
          sum(rate(http_server_requests_seconds_count{
            service="rider-api", uri=~"/api/rides/.*"
          }[1h]))

      - record: sli:rider_api:latency:success_rate6h
        expr: |
          sum(rate(http_server_requests_seconds_bucket{
            service="rider-api", uri=~"/api/rides/.*", le="0.5"
          }[6h]))
          /
          sum(rate(http_server_requests_seconds_count{
            service="rider-api", uri=~"/api/rides/.*"
          }[6h]))

      # Availability SLI: proportion of non-5xx responses
      - record: sli:rider_api:availability:success_rate5m
        expr: |
          1 - (
            sum(rate(http_server_requests_seconds_count{
              service="rider-api", uri=~"/api/rides/.*", status=~"5.."
            }[5m]))
            /
            sum(rate(http_server_requests_seconds_count{
              service="rider-api", uri=~"/api/rides/.*"
            }[5m]))
          )

Pre-computed SLI ratios over 5m, 30m, 1h, and 6h windows. These recording rules run every 30 seconds, so alerting rules can reference them without recomputing expensive range queries.

Error budget consumption chart over 30 days showing normal burn rate, a surge incident causing 14x burn rate at day 15, and budget exhaustion triggering a deploy freeze

The chart above illustrates how error budgets work in practice. Under normal operation, the budget depletes gradually at roughly 1x burn rate. At day 15, a surge event combined with connection pool exhaustion causes a 14x burn rate — the budget drops from 55% to near zero in just a few days. The horizontal threshold lines show where alerts fire: a 2x burn rate generates a ticket for investigation, while a 14x burn rate pages the on-call engineer immediately. Once the budget is exhausted, all non-essential deploys are frozen until reliability work restores the budget.

Burn Rate Alerting Rules

# SCALED: Burn rate alerting for rider API latency SLO (99.9%)
groups:
  - name: slo_alerts
    rules:
      # Fast burn: 14.4x burn rate over 1 hour, validated against 5 minutes
      # At 14.4x, the 30-day budget would be exhausted in ~50 hours
      # Consuming ~2% of the monthly budget per hour
      # Action: PAGE
      - alert: RiderAPILatencyBudgetFastBurn
        expr: |
          (1 - sli:rider_api:latency:success_rate1h) > (14.4 * 0.001)
          and
          (1 - sli:rider_api:latency:success_rate5m) > (14.4 * 0.001)
        for: 2m
        labels:
          severity: critical
          slo: rider-api-latency
          burn_rate: fast
        annotations:
          summary: "Rider API latency SLO fast burn: error budget consumption is critical"
          description: |
            Current error rate: {{ $value | humanizePercentage }}
            SLO target: 99.9%
            Burn rate: >14.4x (budget consumed in ~50 hours at this rate)

      # Slow burn: 1x burn rate over 3 days, validated against 6 hours
      # Budget will be exactly exhausted at end of 30-day window
      # Action: TICKET
      - alert: RiderAPILatencyBudgetSlowBurn
        expr: |
          (1 - sli:rider_api:latency:success_rate6h) > (1 * 0.001)
          and
          (1 - sli:rider_api:latency:success_rate30m) > (1 * 0.001)
        for: 30m
        labels:
          severity: warning
          slo: rider-api-latency
          burn_rate: slow
        annotations:
          summary: "Rider API latency SLO slow burn: error budget being consumed steadily"
          description: |
            Current error rate: {{ $value | humanizePercentage }}
            SLO target: 99.9%
            Burn rate: ~1x (budget on track to exhaust before window ends)

The fast burn alert checks: “Is the 1-hour error rate 14.4 times the allowed rate, AND is the 5-minute rate also elevated?” Both conditions must be true. A 5-second spike elevates the 5-minute window but not the 1-hour window. It does not fire. A 45-minute degradation elevates both windows. It fires.

The slow burn alert checks: “Is the 6-hour error rate at or above the allowed rate, AND is the 30-minute rate also elevated?” This catches gradual degradations that would exhaust the budget over days. A slow memory leak that adds 10ms per hour, eventually crossing the 500ms threshold. Threshold alerting would not fire until the leak is severe. Burn rate alerting files a ticket when the trend becomes dangerous.

Grafana SLO Dashboard

{
  "panels": [
    {
      "title": "Error Budget Remaining (Latency SLO 99.9%)",
      "type": "gauge",
      "targets": [
        {
          "expr": "1 - ((1 - sli:rider_api:latency:success_rate30d) / 0.001)",
          "legendFormat": "Budget Remaining"
        }
      ],
      "fieldConfig": {
        "defaults": {
          "thresholds": {
            "steps": [
              { "color": "red", "value": 0 },
              { "color": "yellow", "value": 0.25 },
              { "color": "green", "value": 0.5 }
            ]
          },
          "unit": "percentunit",
          "max": 1,
          "min": 0
        }
      }
    },
    {
      "title": "Burn Rate (1h window)",
      "type": "timeseries",
      "targets": [
        {
          "expr": "(1 - sli:rider_api:latency:success_rate1h) / 0.001",
          "legendFormat": "Burn Rate"
        }
      ],
      "fieldConfig": {
        "defaults": {
          "custom": {
            "thresholdsStyle": { "mode": "line+area" }
          },
          "thresholds": {
            "steps": [
              { "color": "transparent", "value": 0 },
              { "color": "yellow", "value": 1 },
              { "color": "red", "value": 14.4 }
            ]
          }
        }
      }
    }
  ]
}

The gauge shows remaining error budget as a percentage. Green above 50%, yellow between 25-50%, red below 25%. The burn rate chart shows the current consumption rate with threshold lines at 1x (exact budget pace) and 14.4x (fast burn page threshold).

The Alert That Should Not Have Paged

Tuesday, 3:17 AM. Kubernetes rolls out a new version of the rider API. Rolling deployment: old pods drain connections, new pods warm up. For 12 seconds, 30% of requests hit pods that are starting up. Cold JIT compilation. Cold connection pools. p99 spikes to 1.8 seconds.

Threshold alert: p99 > 500ms fires immediately. On-call is paged.

Burn rate analysis: 12 seconds of elevated latency at 100 RPS affects ~360 requests. Error budget for the month is 259,200 requests. This event consumed 0.14% of the budget. The 1-hour error rate barely moves. The fast-burn alert does not fire. The slow-burn alert does not fire. The on-call engineer sleeps through the deployment.

The Proof

Locust: Simulating an SLO Violation

# SCALED: Locust simulating a sustained SLO violation
from locust import HttpUser, task, between
import time

class SLOViolationUser(HttpUser):
    wait_time = between(0.1, 0.5)
    start_time = None

    def on_start(self):
        if SLOViolationUser.start_time is None:
            SLOViolationUser.start_time = time.time()

    @task
    def request_ride(self):
        elapsed = time.time() - SLOViolationUser.start_time

        params = {
            "pickup_lat": 40.7128, "pickup_lng": -74.0060,
            "dropoff_lat": 40.7589, "dropoff_lng": -73.9851
        }

        # After 5 minutes, add artificial delay to 2% of requests
        # 2% error rate vs 0.1% budget = 20x burn rate
        # Fast-burn threshold (14.4x) should fire within 10 minutes
        if elapsed > 300 and hash(str(time.time())) % 100 < 2:
            params["simulate_delay_ms"] = 2000

        self.client.get("/api/rides/fare-estimate",
            params=params,
            name="/api/rides/fare-estimate"
        )

Run for 30 minutes with 200 users:

locust -f slo_violation.py --users 200 --spawn-rate 50 --run-time 30m --headless

Minutes 0-5: Normal traffic. Burn rate near 0. Minutes 5-15: 2% of requests exceed 500ms. Burn rate climbs to ~20x (2% error rate vs 0.1% budget = 20x). Minute 7: Fast-burn alert fires (14.4x threshold crossed, sustained for 2 minutes). Minute 15: Stop simulated delay. Burn rate drops to 0.

Budget consumed: ~10 minutes at 20x burn rate. Roughly 7.7% of the 30-day budget. The gauge shows 92.3% remaining.

The threshold alert would have fired and stayed firing for the entire 10-minute window, generating continuous noise. The burn-rate alert fired once, with a meaningful severity and a clear description: “budget consumption is critical at current rate.” One actionable alert vs. continuous noise.