Skip to main content
surviving the spike

Burn Rate Alerting and Escaping Alert Fatigue

10 min read Chapter 51 of 66

Burn Rate Alerting and Escaping Alert Fatigue

The Symptom

The on-call rotation is a punishment. Engineers dread their turn. Three engineers have requested transfers out of the team in the last six months, citing “unsustainable on-call burden.” The PagerDuty statistics tell the story:

Month        Pages    Actionable    False Positive Rate
January      127      11            91.3%
February     143      8             94.4%
March        118      14            88.1%

91% false positive rate. For every real incident, the on-call engineer is woken up 10 times for nothing. The median time to acknowledge an alert: 14 minutes in January, 22 minutes in February, 31 minutes in March. Response time is increasing because trust in the alerting system is collapsing.

The Cause

Every alert in the current system is a threshold alert:

# BOTTLENECK: Threshold alerting rules
groups:
  - name: rider_api_alerts
    rules:
      - alert: RiderAPIHighLatency
        expr: histogram_quantile(0.99,
          sum(rate(http_server_requests_seconds_bucket{
            service="rider-api"
          }[5m])) by (le)
        ) > 0.5
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Rider API p99 latency > 500ms"

      - alert: RiderAPIHighErrorRate
        expr: |
          sum(rate(http_server_requests_seconds_count{
            service="rider-api", status=~"5.."
          }[5m]))
          /
          sum(rate(http_server_requests_seconds_count{
            service="rider-api"
          }[5m]))
          > 0.01
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Rider API error rate > 1%"

These rules fire during:

  • Rolling deployments (new pods warming up, 5-15 seconds of elevated latency)
  • Garbage collection pauses (G1 mixed collections, 200-500ms pauses every 30 minutes)
  • Kubernetes node rebalancing (pod migration, brief connection drops)
  • Network blips (cloud provider maintenance, 2-3 second packet loss)
  • Single failed health checks (retry succeeds, no user impact)

Each event crosses the threshold for 1-5 minutes. The for: 1m clause is too short to filter them out. Increasing it to for: 10m would filter the noise but also delay real incidents by 10 minutes.

Threshold alerting cannot distinguish between “brief transient spike” and “sustained degradation.” Both cross the threshold. The only difference is duration and impact.

The Baseline

Alert Type         Pros                                Cons
Threshold          Simple to write                     Cannot distinguish transient from sustained
                   Easy to understand                  No concept of error budget
                   Fires fast                          Fires on every deployment
                                                       False positive rate > 80%

Burn Rate          Budget-aware                        Requires SLO definition
                   Duration-sensitive                  More complex PromQL
                   Severity-tiered                     Requires recording rules
                   False positive rate < 5%

The Fix

Burn Rate: The Concept

Burn rate measures how fast the error budget is being consumed:

Burn Rate = Actual Error Rate / Allowed Error Rate

For a 99.9% SLO (0.1% allowed error rate):

Scenario                     Error Rate    Burn Rate    Time to Exhaust 30-Day Budget
Normal operation              0.02%         0.2x         150 days (well within budget)
Rolling deployment spike      0.5%          5x           6 days
Moderate degradation          1.0%          10x          3 days
Severe incident               1.44%         14.4x        ~50 hours
Total outage                  100%          1000x        ~43 minutes

A burn rate of 1x means the budget will be exactly exhausted at the end of the 30-day window. A burn rate of 14.4x means the budget will be gone in roughly 50 hours of sustained failure. The alerting window determines how quickly you detect it.

Multi-Window Alerting

The key insight: pair a long window with a short window. The long window ensures the problem is significant (not a blip). The short window ensures the problem is still happening (not already resolved).

Alert Tier    Burn Rate    Long Window    Short Window    Action        Detects
Fast burn     14.4x        1 hour         5 minutes       PagerDuty     Severe incidents
Slow burn     1x           3 days         6 hours         Slack ticket  Gradual degradation

Fast burn: “Have we been burning budget at 14.4x for the last hour, AND is it still happening in the last 5 minutes?” This catches severe incidents within minutes while ignoring 12-second deployment blips.

Slow burn: “Have we been burning budget at 1x for the last 3 days, AND is it still happening in the last 6 hours?” This catches slow degradation that threshold alerting would miss entirely. A memory leak that adds 5ms per hour. A slow connection pool exhaustion. A gradual increase in backend latency from a growing table.

Prometheus Alerting Rules

# SCALED: Multi-window, multi-burn-rate alerting
groups:
  - name: slo_burn_rate_alerts
    rules:
      # =====================
      # FAST BURN: PAGE
      # =====================
      # 14.4x burn rate, 1h long window, 5m short window
      # At this rate: ~2% of monthly budget consumed per hour
      - alert: RiderAPILatencyFastBurn
        expr: |
          (1 - sli:rider_api:latency:success_rate1h) > (14.4 * 0.001)
          and
          (1 - sli:rider_api:latency:success_rate5m) > (14.4 * 0.001)
        for: 2m
        labels:
          severity: critical
          team: rider-platform
          slo: rider-api-latency
        annotations:
          summary: "FAST BURN: Rider API latency budget being consumed at >14.4x"
          runbook: "https://wiki.internal/runbooks/rider-api-latency"

      # =====================
      # SLOW BURN: TICKET
      # =====================
      # 1x burn rate, 3d long window approximated by 6h, 30m short window
      - alert: RiderAPILatencySlowBurn
        expr: |
          (1 - sli:rider_api:latency:success_rate6h) > (1 * 0.001)
          and
          (1 - sli:rider_api:latency:success_rate30m) > (1 * 0.001)
        for: 30m
        labels:
          severity: warning
          team: rider-platform
          slo: rider-api-latency
        annotations:
          summary: "SLOW BURN: Rider API latency budget consumption trending toward exhaustion"
          runbook: "https://wiki.internal/runbooks/rider-api-latency-slow"

      # =====================
      # AVAILABILITY: FAST BURN
      # =====================
      - alert: RiderAPIAvailabilityFastBurn
        expr: |
          (1 - sli:rider_api:availability:success_rate1h) > (14.4 * 0.0005)
          and
          (1 - sli:rider_api:availability:success_rate5m) > (14.4 * 0.0005)
        for: 2m
        labels:
          severity: critical
          team: rider-platform
          slo: rider-api-availability
        annotations:
          summary: "FAST BURN: Rider API availability budget being consumed at >14.4x"

Grafana Dashboard

{
  "dashboard": {
    "title": "Rider API SLO Dashboard",
    "panels": [
      {
        "title": "Error Budget Remaining",
        "type": "gauge",
        "gridPos": { "h": 8, "w": 6, "x": 0, "y": 0 },
        "targets": [
          {
            "expr": "clamp_min(1 - ((1 - sli:rider_api:latency:success_rate30d) / 0.001), 0)",
            "legendFormat": "Latency Budget"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "thresholds": {
              "steps": [
                { "color": "red", "value": 0 },
                { "color": "orange", "value": 0.1 },
                { "color": "yellow", "value": 0.25 },
                { "color": "green", "value": 0.5 }
              ]
            },
            "unit": "percentunit"
          }
        }
      },
      {
        "title": "Burn Rate Over Time",
        "type": "timeseries",
        "gridPos": { "h": 8, "w": 12, "x": 6, "y": 0 },
        "targets": [
          {
            "expr": "(1 - sli:rider_api:latency:success_rate1h) / 0.001",
            "legendFormat": "1h Burn Rate"
          },
          {
            "expr": "(1 - sli:rider_api:latency:success_rate6h) / 0.001",
            "legendFormat": "6h Burn Rate"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "custom": {
              "thresholdsStyle": { "mode": "line" }
            },
            "thresholds": {
              "steps": [
                { "color": "green", "value": 0 },
                { "color": "yellow", "value": 1 },
                { "color": "red", "value": 14.4 }
              ]
            }
          }
        }
      },
      {
        "title": "Time Until Budget Exhaustion",
        "type": "stat",
        "gridPos": { "h": 8, "w": 6, "x": 18, "y": 0 },
        "targets": [
          {
            "expr": "clamp_min((sli:rider_api:latency:success_rate1h - 0.999) / (1 - sli:rider_api:latency:success_rate1h) * 720, 0)",
            "legendFormat": "Hours Remaining"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "h",
            "thresholds": {
              "steps": [
                { "color": "red", "value": 0 },
                { "color": "yellow", "value": 168 },
                { "color": "green", "value": 336 }
              ]
            }
          }
        }
      }
    ]
  }
}

Three panels. The gauge shows percentage of budget remaining. The time series shows burn rate with horizontal threshold lines at 1x and 14.4x. The stat panel shows estimated hours until budget exhaustion at the current rate. When the burn rate drops below 1x, the stat panel shows ”> 720h” (more than the 30-day window).

Alertmanager Routing

# SCALED: Alertmanager routing by burn rate severity
route:
  receiver: default-slack
  group_by: ["slo"]
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h

  routes:
    # Fast burn alerts: page the on-call engineer
    - matchers:
        - severity = critical
        - burn_rate =~ "fast|critical"
      receiver: pagerduty-rider-platform
      group_wait: 10s
      repeat_interval: 1h
      continue: true

    # Also send fast burn to Slack for visibility
    - matchers:
        - severity = critical
      receiver: slack-incidents
      group_wait: 10s

    # Slow burn alerts: create a ticket, notify Slack
    - matchers:
        - severity = warning
        - burn_rate = slow
      receiver: slack-slo-warnings
      group_wait: 5m
      repeat_interval: 24h

receivers:
  - name: default-slack
    slack_configs:
      - channel: "#rider-platform-alerts"
        title: "{{ .GroupLabels.slo }}"
        text: "{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}"

  - name: pagerduty-rider-platform
    pagerduty_configs:
      - service_key_file: /etc/alertmanager/pagerduty-key
        severity: critical
        description: "{{ .CommonAnnotations.summary }}"

  - name: slack-incidents
    slack_configs:
      - channel: "#incidents"
        title: "SLO VIOLATION: {{ .GroupLabels.slo }}"
        text: '{{ range .Alerts }}{{ .Annotations.summary }}\n{{ .Annotations.description }}{{ end }}'
        color: danger

  - name: slack-slo-warnings
    slack_configs:
      - channel: "#rider-platform-slo"
        title: "Slow Burn: {{ .GroupLabels.slo }}"
        text: "{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}"
        color: warning

Fast burn goes to PagerDuty and the #incidents Slack channel. The on-call engineer is paged. Slow burn goes to #rider-platform-slo as a ticket-level notification. Nobody is woken up for a slow burn. The team reviews slow burn alerts during business hours and investigates the trend.

The Alert That Should Have Paged But Didn’t

Thursday. A new database index is deployed that improves 99% of queries but makes 0.3% of queries 200ms slower due to a different query plan for edge-case zone lookups. The threshold alert does not fire because p99 stays at 460ms (under the 500ms threshold). The 0.3% of slow requests are hidden inside the p95-p99 range.

Burn rate analysis: 0.3% error rate vs 0.1% budget = 3x burn rate. Not enough for fast burn (14.4x threshold). But after 2 days of sustained 3x burn, the 6-hour window shows a consistent 1x+ burn rate. The slow burn alert fires a ticket to #rider-platform-slo.

The team investigates, finds the query plan regression, adds a query hint to force the original plan. Budget consumed: ~20% over 2 days. Threshold alerting would not have noticed until the budget was exhausted and riders started complaining.

The Proof

Locust: Inducing a Slow Burn

# SCALED: Locust inducing a slow burn scenario
from locust import HttpUser, task, between
import random

class SlowBurnUser(HttpUser):
    wait_time = between(0.1, 0.5)

    @task
    def fare_estimate(self):
        params = {
            "pickup_lat": 40.7128, "pickup_lng": -74.0060,
            "dropoff_lat": 40.7589, "dropoff_lng": -73.9851
        }

        # Add 50ms delay to 0.5% of requests
        # This pushes them from ~450ms to ~500ms, crossing the SLO threshold
        # 0.5% failure rate vs 0.1% budget = 5x burn rate
        # Slow enough that threshold alerting ignores it
        # Fast enough that slow-burn alert fires within 6 hours
        if random.random() < 0.005:
            params["simulate_delay_ms"] = 50

        self.client.get("/api/rides/fare-estimate",
            params=params,
            name="/api/rides/fare-estimate"
        )

Run for 8 hours with 100 users:

locust -f slow_burn.py --users 100 --spawn-rate 20 --run-time 8h --headless

Expected timeline:

Time        Burn Rate (6h)    Alert Status
0-30min     ~5x               No alert (for: 30m not met)
30min-1h    ~5x               No alert (30m window stabilizing)
1h-6h       ~5x               Slow burn pending (for: 30m condition met at ~1h)
~1.5h       ~5x               SLOW BURN ALERT FIRES → Slack ticket
6h-8h       ~5x               Alert continues (repeat_interval: 24h, no re-alert)

The threshold alert (p99 > 500ms) never fires because p99 stays at 460ms. Only 0.5% of requests cross 500ms. The burn rate alert catches it because 0.5% failure rate against a 0.1% budget is a 5x burn rate, which exceeds the 1x slow-burn threshold.

After the run, check the error budget:

1 - ((1 - sli:rider_api:latency:success_rate6h) / 0.001)

Expected: ~0.6 (40% of budget consumed in 8 hours at 5x burn rate). The dashboard gauge is yellow. The team has 12 hours before the budget is exhausted at the current rate.

Before burn rate alerting:

Month        Pages    Actionable    False Positive Rate    Mean Acknowledge Time
March        118      14            88.1%                  31 minutes

After burn rate alerting:

Month        Pages    Actionable    False Positive Rate    Mean Acknowledge Time
April        9        8             11.1%                  4 minutes

From 118 pages to 9. From 88% false positives to 11%. From 31-minute acknowledge time to 4 minutes. The on-call engineer trusts the alert. When the phone buzzes, they know it matters.