Chaos Engineering: Breaking the Ride-Hailing Platform on Purpose

The Symptom

The ride-hailing platform has circuit breakers, bulkheads, fallback chains, feature flags, and a write-ahead log. The team is confident. The architecture diagram shows resilience at every layer. The design documents describe graceful degradation for every failure mode.

Then the surge pricing service goes down in production. The circuit breaker opens. The fallback returns cached multipliers. Everything works according to plan for 11 minutes. Then the cached multipliers expire (5-minute TTL). The fallback returns the default 1.0x. Riders in a 3.2x surge zone now book at base fare. Revenue loss: $34,000 in 6 minutes before an engineer notices and restarts the surge pricing service.

The circuit breaker worked. The bulkhead worked. The fallback chain worked. But nobody tested what happens when the surge pricing service stays down longer than the cache TTL. The resilience patterns were correct in isolation. They were untested as a system.

Chaos engineering closes this gap. You break things on purpose, under controlled conditions, with load testing running alongside, and you observe whether the system behaves as designed.

The Cause

Resilience patterns are code. Code has bugs. The bugs in resilience code only manifest during failures. And failures in production are not controlled experiments. When a real incident happens, the team is diagnosing, not observing. They are fixing, not measuring. They never learn whether the circuit breaker opened at the right threshold, whether the bulkhead was sized correctly, or whether the fallback chain reached level 4.

Chaos engineering provides:

Steady state hypothesis: Define what “working” looks like in numbers. p99 < 500ms, error rate < 0.1%, bookings completing.
Controlled failure injection: Kill a service, add latency, fill memory, drop pods. One variable at a time.
Observation under load: Locust runs alongside the experiment. Real traffic patterns, real concurrency.
Automated verification: The tool checks whether the steady state held or was violated.
Documented results: Each experiment produces a report. What broke, what held, what needs fixing.

The ride-hailing platform needs four experiments that test the four major failure modes from CH18 and CH19:

Experiment              What Breaks            What Should Happen
Kill surge pricing      Surge pricing service   Circuit breaker → cached → default 1.0x
PG latency injection    PostgreSQL connections   Circuit breaker → Redis fallback
Redis maxmemory         Redis eviction policy    Cache degradation, no crash
Kill 50% pods           Kubernetes capacity      HPA scales, traffic redistributes

The Baseline

Current resilience verification: manual. An engineer reads the configuration and says “this should work.” The circuit breaker has never been tested under real load with a real failure. The bulkhead sizes were calculated on paper, never validated. The WAL replay was tested with 10 entries, never with 12,000.

Resilience Pattern       Tested in Dev?   Tested Under Load?   Tested in Staging?
Circuit breaker          Yes (unit)       No                   No
Bulkhead                 Yes (unit)       No                   No
Fallback chain           Yes (unit)       No                   No
WAL replay               Yes (10 entries) No                   No
Feature flags            Yes (manual)     No                   No
Caffeine fallback        Yes (unit)       No                   No
Kafka queue fallback     No               No                   No

Seven resilience patterns. Zero validated under production-like conditions.

The Fix

Chaos Toolkit Setup

# SCALED: Install Chaos Toolkit with Kubernetes and Prometheus extensions
pip install chaostoolkit \
            chaostoolkit-kubernetes \
            chaostoolkit-prometheus \
            chaostoolkit-reporting

Project structure:

Chaos engineering project structure showing experiments, steady-state probes, load tests, and results directories

The chaos engineering workspace separates concerns into four directories. The experiments folder (red) holds YAML definitions that deliberately break components—killing surge pricing pods, injecting PostgreSQL latency, exhausting Redis memory, and terminating half the pod fleet. The steady-state folder (green) defines the health probes and hypotheses that validate the system is behaving correctly before and after each experiment. The locust folder contains load test scripts that simulate realistic traffic patterns during chaos, and the results directory captures output for post-mortem analysis.

Experiment Framework

Every experiment follows the same structure:

# SCALED: Experiment template
version: 1.0.0
title: "[Experiment Name]"
description: "[What we are testing]"
tags:
  - "resilience"
  - "ride-hailing"

# What "working" looks like
steady-state-hypothesis:
  title: "Ride booking continues within SLO"
  probes:
    - type: probe
      name: "p99-latency-within-slo"
      provider:
        type: python
        module: chaosPrometheus.probes
        func: query_interval
        arguments:
          query: >
            histogram_quantile(0.99,
              rate(http_server_requests_seconds_bucket{
                uri="/api/rides/book"}[1m]))
          start: "1 minute ago"
          end: "now"
      tolerance:
        type: range
        range: [0, 0.5] # p99 < 500ms

    - type: probe
      name: "error-rate-below-threshold"
      provider:
        type: http
        url: "http://locust:8089/stats/requests"
      tolerance:
        type: jsonpath
        path: "$.stats[-1].current_fail_per_sec"
        expect:
          type: range
          range: [0, 5] # < 5 failures/sec

    - type: probe
      name: "bookings-completing"
      provider:
        type: http
        url: "http://rider-api:8080/actuator/health"
      tolerance:
        status: 200

# What we break
method: [] # Filled per experiment

# How we clean up
rollbacks: [] # Filled per experiment

Locust Running Alongside Chaos

# SCALED: Locust load test for chaos experiments
from locust import HttpUser, task, between, events
import json, time

class ChaosRideUser(HttpUser):
    wait_time = between(0.1, 0.3)
    host = "http://rider-api:8080"

    @task(10)
    def book_ride(self):
        with self.client.post("/api/rides/book", json={
            "riderId": f"rider-{self.environment.runner.user_count}",
            "pickupLat": 40.7128, "pickupLng": -74.0060,
            "dropoffLat": 40.7580, "dropoffLng": -73.9855,
            "zoneId": "manhattan-midtown"
        }, catch_response=True, name="/api/rides/book") as resp:
            if resp.status_code == 200:
                data = resp.json()
                if data.get("degradedFeatures"):
                    resp.success()
                    self.environment.events.request.fire(
                        request_type="DEGRADED",
                        name="degraded_booking",
                        response_time=resp.elapsed.total_seconds() * 1000,
                        response_length=len(resp.content),
                        exception=None,
                        context={}
                    )
                else:
                    resp.success()
            else:
                resp.failure(f"Status {resp.status_code}")

    @task(3)
    def fare_estimate(self):
        self.client.get("/api/fares/estimate?zoneId=manhattan-midtown")

    @task(2)
    def trip_history(self):
        self.client.get("/api/trips/history?riderId=rider-1")

    @task(1)
    def health_check(self):
        self.client.get("/actuator/health")

Start Locust before the chaos experiment. Let it run for 2 minutes to establish baseline metrics. Then trigger the experiment. Locust keeps running throughout, providing real-time verification of the steady state hypothesis.

# Start Locust in the background
locust -f chaos/locust/chaos-load-test.py \
  --headless --users 500 --spawn-rate 50 \
  --run-time 10m --host http://rider-api:8080 &

# Wait for baseline
sleep 120

# Run chaos experiment
chaos run chaos/experiments/kill-surge-pricing.yaml \
  --journal-path chaos/results/kill-surge-$(date +%Y%m%d-%H%M%S).json

Game Day Planning

The quarterly chaos session:

Game Day Agenda (2 hours):

1. Pre-flight (15 min)
   - Confirm staging environment matches production config
   - Start Locust with production traffic profile
   - Verify Grafana dashboards are visible to all participants
   - Confirm rollback procedures are documented

2. Experiment Execution (90 min)
   - Experiment 1: Kill surge pricing (20 min)
   - Experiment 2: PG latency injection (20 min)
   - Experiment 3: Redis maxmemory (25 min)
   - Experiment 4: Kill 50% pods (25 min)

3. Results Review (15 min)
   - Which steady states held?
   - Which were violated?
   - Action items for fixes

Each experiment:
   5 min  - Describe hypothesis to the team
   5 min  - Execute experiment
   5 min  - Observe dashboards and Locust stats
   5 min  - Discuss results and document findings

The Proof

The four experiments validate the entire resilience stack from CH18 and CH19. Results are covered in detail in CH20-S1 (Chaos Toolkit setup and hypothesis definition) and CH20-S2 (the four experiments with full results).

Summary of what the chaos experiments found:

Experiment          Steady State    Finding
Kill surge          HELD            Circuit breaker opened in 10s, cached fallback worked
PG latency          RESTORED        p99 spike to 850ms, circuit breaker + Redis restored it
Redis maxmemory     VIOLATED        Cache hit rate drop exposed missing Redis instance isolation
Kill 50% pods       VIOLATED        minReplicas=3 was insufficient, fixed to 6

Two passes. Two failures. Two fixes identified. Two re-runs confirming the fixes. This is the value of chaos engineering: finding the gaps before production finds them for you.