Chaos Engineering: Breaking the Ride-Hailing Platform on Purpose
Chaos Engineering: Breaking the Ride-Hailing Platform on Purpose
The Symptom
The ride-hailing platform has circuit breakers, bulkheads, fallback chains, feature flags, and a write-ahead log. The team is confident. The architecture diagram shows resilience at every layer. The design documents describe graceful degradation for every failure mode.
Then the surge pricing service goes down in production. The circuit breaker opens. The fallback returns cached multipliers. Everything works according to plan for 11 minutes. Then the cached multipliers expire (5-minute TTL). The fallback returns the default 1.0x. Riders in a 3.2x surge zone now book at base fare. Revenue loss: $34,000 in 6 minutes before an engineer notices and restarts the surge pricing service.
The circuit breaker worked. The bulkhead worked. The fallback chain worked. But nobody tested what happens when the surge pricing service stays down longer than the cache TTL. The resilience patterns were correct in isolation. They were untested as a system.
Chaos engineering closes this gap. You break things on purpose, under controlled conditions, with load testing running alongside, and you observe whether the system behaves as designed.
The Cause
Resilience patterns are code. Code has bugs. The bugs in resilience code only manifest during failures. And failures in production are not controlled experiments. When a real incident happens, the team is diagnosing, not observing. They are fixing, not measuring. They never learn whether the circuit breaker opened at the right threshold, whether the bulkhead was sized correctly, or whether the fallback chain reached level 4.
Chaos engineering provides:
- Steady state hypothesis: Define what “working” looks like in numbers. p99 < 500ms, error rate < 0.1%, bookings completing.
- Controlled failure injection: Kill a service, add latency, fill memory, drop pods. One variable at a time.
- Observation under load: Locust runs alongside the experiment. Real traffic patterns, real concurrency.
- Automated verification: The tool checks whether the steady state held or was violated.
- Documented results: Each experiment produces a report. What broke, what held, what needs fixing.
The ride-hailing platform needs four experiments that test the four major failure modes from CH18 and CH19:
Experiment What Breaks What Should Happen
Kill surge pricing Surge pricing service Circuit breaker → cached → default 1.0x
PG latency injection PostgreSQL connections Circuit breaker → Redis fallback
Redis maxmemory Redis eviction policy Cache degradation, no crash
Kill 50% pods Kubernetes capacity HPA scales, traffic redistributes
The Baseline
Current resilience verification: manual. An engineer reads the configuration and says “this should work.” The circuit breaker has never been tested under real load with a real failure. The bulkhead sizes were calculated on paper, never validated. The WAL replay was tested with 10 entries, never with 12,000.
Resilience Pattern Tested in Dev? Tested Under Load? Tested in Staging?
Circuit breaker Yes (unit) No No
Bulkhead Yes (unit) No No
Fallback chain Yes (unit) No No
WAL replay Yes (10 entries) No No
Feature flags Yes (manual) No No
Caffeine fallback Yes (unit) No No
Kafka queue fallback No No No
Seven resilience patterns. Zero validated under production-like conditions.
The Fix
Chaos Toolkit Setup
# SCALED: Install Chaos Toolkit with Kubernetes and Prometheus extensions
pip install chaostoolkit \
chaostoolkit-kubernetes \
chaostoolkit-prometheus \
chaostoolkit-reporting
Project structure:
The chaos engineering workspace separates concerns into four directories. The experiments folder (red) holds YAML definitions that deliberately break components—killing surge pricing pods, injecting PostgreSQL latency, exhausting Redis memory, and terminating half the pod fleet. The steady-state folder (green) defines the health probes and hypotheses that validate the system is behaving correctly before and after each experiment. The locust folder contains load test scripts that simulate realistic traffic patterns during chaos, and the results directory captures output for post-mortem analysis.
Experiment Framework
Every experiment follows the same structure:
# SCALED: Experiment template
version: 1.0.0
title: "[Experiment Name]"
description: "[What we are testing]"
tags:
- "resilience"
- "ride-hailing"
# What "working" looks like
steady-state-hypothesis:
title: "Ride booking continues within SLO"
probes:
- type: probe
name: "p99-latency-within-slo"
provider:
type: python
module: chaosPrometheus.probes
func: query_interval
arguments:
query: >
histogram_quantile(0.99,
rate(http_server_requests_seconds_bucket{
uri="/api/rides/book"}[1m]))
start: "1 minute ago"
end: "now"
tolerance:
type: range
range: [0, 0.5] # p99 < 500ms
- type: probe
name: "error-rate-below-threshold"
provider:
type: http
url: "http://locust:8089/stats/requests"
tolerance:
type: jsonpath
path: "$.stats[-1].current_fail_per_sec"
expect:
type: range
range: [0, 5] # < 5 failures/sec
- type: probe
name: "bookings-completing"
provider:
type: http
url: "http://rider-api:8080/actuator/health"
tolerance:
status: 200
# What we break
method: [] # Filled per experiment
# How we clean up
rollbacks: [] # Filled per experiment
Locust Running Alongside Chaos
# SCALED: Locust load test for chaos experiments
from locust import HttpUser, task, between, events
import json, time
class ChaosRideUser(HttpUser):
wait_time = between(0.1, 0.3)
host = "http://rider-api:8080"
@task(10)
def book_ride(self):
with self.client.post("/api/rides/book", json={
"riderId": f"rider-{self.environment.runner.user_count}",
"pickupLat": 40.7128, "pickupLng": -74.0060,
"dropoffLat": 40.7580, "dropoffLng": -73.9855,
"zoneId": "manhattan-midtown"
}, catch_response=True, name="/api/rides/book") as resp:
if resp.status_code == 200:
data = resp.json()
if data.get("degradedFeatures"):
resp.success()
self.environment.events.request.fire(
request_type="DEGRADED",
name="degraded_booking",
response_time=resp.elapsed.total_seconds() * 1000,
response_length=len(resp.content),
exception=None,
context={}
)
else:
resp.success()
else:
resp.failure(f"Status {resp.status_code}")
@task(3)
def fare_estimate(self):
self.client.get("/api/fares/estimate?zoneId=manhattan-midtown")
@task(2)
def trip_history(self):
self.client.get("/api/trips/history?riderId=rider-1")
@task(1)
def health_check(self):
self.client.get("/actuator/health")
Start Locust before the chaos experiment. Let it run for 2 minutes to establish baseline metrics. Then trigger the experiment. Locust keeps running throughout, providing real-time verification of the steady state hypothesis.
# Start Locust in the background
locust -f chaos/locust/chaos-load-test.py \
--headless --users 500 --spawn-rate 50 \
--run-time 10m --host http://rider-api:8080 &
# Wait for baseline
sleep 120
# Run chaos experiment
chaos run chaos/experiments/kill-surge-pricing.yaml \
--journal-path chaos/results/kill-surge-$(date +%Y%m%d-%H%M%S).json
Game Day Planning
The quarterly chaos session:
Game Day Agenda (2 hours):
1. Pre-flight (15 min)
- Confirm staging environment matches production config
- Start Locust with production traffic profile
- Verify Grafana dashboards are visible to all participants
- Confirm rollback procedures are documented
2. Experiment Execution (90 min)
- Experiment 1: Kill surge pricing (20 min)
- Experiment 2: PG latency injection (20 min)
- Experiment 3: Redis maxmemory (25 min)
- Experiment 4: Kill 50% pods (25 min)
3. Results Review (15 min)
- Which steady states held?
- Which were violated?
- Action items for fixes
Each experiment:
5 min - Describe hypothesis to the team
5 min - Execute experiment
5 min - Observe dashboards and Locust stats
5 min - Discuss results and document findings
The Proof
The four experiments validate the entire resilience stack from CH18 and CH19. Results are covered in detail in CH20-S1 (Chaos Toolkit setup and hypothesis definition) and CH20-S2 (the four experiments with full results).
Summary of what the chaos experiments found:
Experiment Steady State Finding
Kill surge HELD Circuit breaker opened in 10s, cached fallback worked
PG latency RESTORED p99 spike to 850ms, circuit breaker + Redis restored it
Redis maxmemory VIOLATED Cache hit rate drop exposed missing Redis instance isolation
Kill 50% pods VIOLATED minReplicas=3 was insufficient, fixed to 6
Two passes. Two failures. Two fixes identified. Two re-runs confirming the fixes. This is the value of chaos engineering: finding the gaps before production finds them for you.