Breaking the Platform: Four Experiments with Results
Breaking the Platform: Four Experiments with Results
The Symptom
The chaos toolkit is set up. The steady state hypothesis is defined. Locust is running 500 users against staging. The team gathers for their first game day. Four experiments. Two hours. The goal: find out whether the resilience patterns from CH18 and CH19 actually work.
Two experiments pass. Two fail. The failures are not in the code. They are in the configuration.
The Cause
Resilience patterns have two layers of correctness. The first layer is functional: does the circuit breaker open when failures exceed the threshold? Unit tests cover this. The second layer is operational: is the threshold correct? Is the timeout long enough? Is the bulkhead large enough? Is the fallback fast enough? Only chaos experiments under real load answer these questions.
Experiment 1: Kill Surge Pricing Service
Hypothesis: Ride bookings continue at base fare with zero user-facing errors.
# SCALED: Experiment 1 - Kill surge pricing
version: 1.0.0
title: "Experiment 1: Kill Surge Pricing Service"
description: "Kill all surge pricing pods. Verify circuit breaker opens and rides book at cached/default fare."
steady-state-hypothesis:
title: "Bookings within SLO"
probes:
- type: probe
name: "p99-under-500ms"
provider:
type: python
module: chaosPrometheus.probes
func: query_interval
arguments:
api_url: "http://prometheus:9090"
query: >
histogram_quantile(0.99,
sum(rate(http_server_requests_seconds_bucket{
uri="/api/rides/book", status="200"}[1m])) by (le))
start: "1 minute ago"
end: "now"
tolerance:
type: range
range: [0, 0.5]
- type: probe
name: "error-rate-under-0.1"
provider:
type: python
module: chaosPrometheus.probes
func: query_interval
arguments:
api_url: "http://prometheus:9090"
query: >
sum(rate(http_server_requests_seconds_count{
uri="/api/rides/book", status=~"5.."}[1m]))
/ sum(rate(http_server_requests_seconds_count{
uri="/api/rides/book"}[1m])) * 100
start: "1 minute ago"
end: "now"
tolerance:
type: range
range: [0, 0.1]
method:
- type: action
name: "kill-all-surge-pricing-pods"
provider:
type: python
module: chaosk8s.pod.actions
func: terminate_pods
arguments:
label_selector: "app=surge-pricing"
ns: "ride-hailing"
qty: 3
rand: false
grace_period: 0
pauses:
after: 60
rollbacks:
- type: action
name: "restart-surge-pricing"
provider:
type: python
module: chaosk8s.deployment.actions
func: rollout_restart
arguments:
name: "surge-pricing"
ns: "ride-hailing"
Result: PASSED. Steady state maintained.
Timeline:
T+0s Surge pricing pods killed (all 3 replicas)
T+2s First connection timeout from rider API to surge pricing
T+4s Circuit breaker failure rate: 30% (6 of 20 calls)
T+8s Circuit breaker failure rate: 55% (11 of 20 calls)
T+10s Circuit breaker state: CLOSED → OPEN
T+10s All surge pricing calls return cached multiplier instantly
T+10s Bulkhead releases all 20 connections
T+60s Experiment ends, steady state re-checked
Metrics during 60-second experiment:
p99 latency (booking): 380ms (within 500ms SLO)
Error rate (booking): 0.04% (within 0.1% SLO)
Surge fallback calls: 28,400 (cached: 27,800, default: 600)
Circuit breaker state: OPEN for 50s of 60s
Booking throughput: 4,920 RPS (99% of normal)
The 10-second window between pod kill and circuit breaker opening is the vulnerability window. During those 10 seconds, 20 concurrent calls (bulkhead limit) timed out at 2 seconds each. The remaining 480 concurrent calls were unaffected because the bulkhead isolated them.
The 600 calls that returned the default 1.0x multiplier were for zones with no cached data. Those zones had not seen traffic recently, so no multiplier was cached. In production, this means riders in low-traffic zones get no-surge pricing during a surge pricing outage. Acceptable.
Experiment 2: 500ms PostgreSQL Latency
Hypothesis: p99 increases but stays below SLO. Circuit breaker engages and Redis fallback restores performance.
# SCALED: Experiment 2 - PostgreSQL latency injection
version: 1.0.0
title: "Experiment 2: 500ms PostgreSQL Latency"
description: "Inject 500ms network latency to PostgreSQL. Verify circuit breaker and Redis fallback maintain booking SLO."
steady-state-hypothesis:
title: "Bookings within SLO"
probes:
- type: probe
name: "p99-under-500ms"
provider:
type: python
module: chaosPrometheus.probes
func: query_interval
arguments:
api_url: "http://prometheus:9090"
query: >
histogram_quantile(0.99,
sum(rate(http_server_requests_seconds_bucket{
uri="/api/rides/book", status="200"}[1m])) by (le))
start: "1 minute ago"
end: "now"
tolerance:
type: range
range: [0, 0.5]
- type: probe
name: "error-rate-under-0.5"
provider:
type: python
module: chaosPrometheus.probes
func: query_interval
arguments:
api_url: "http://prometheus:9090"
query: >
sum(rate(http_server_requests_seconds_count{
uri="/api/rides/book", status=~"5.."}[1m]))
/ sum(rate(http_server_requests_seconds_count{
uri="/api/rides/book"}[1m])) * 100
start: "1 minute ago"
end: "now"
tolerance:
type: range
range: [0, 0.5]
method:
- type: action
name: "inject-pg-latency"
provider:
type: process
path: "kubectl"
arguments:
[
"exec",
"-n",
"ride-hailing",
"deploy/rider-api",
"--",
"tc",
"qdisc",
"add",
"dev",
"eth0",
"root",
"netem",
"delay",
"500ms",
"50ms",
"distribution",
"normal",
]
pauses:
after: 90
rollbacks:
- type: action
name: "remove-pg-latency"
provider:
type: process
path: "kubectl"
arguments:
[
"exec",
"-n",
"ride-hailing",
"deploy/rider-api",
"--",
"tc",
"qdisc",
"del",
"dev",
"eth0",
"root",
]
Result: PASSED after initial violation. Steady state restored.
Timeline:
T+0s 500ms latency injected on rider API network interface
T+0s All PG queries add 500ms ± 50ms
T+1s p99 booking latency: 200ms → 720ms
T+5s p99 booking latency: 850ms (SLO violated briefly)
T+8s Connection pool saturation reaches 60%
T+12s R2DBC circuit breaker opens (query timeout threshold exceeded)
T+12s Fare calculation falls back to Redis cached rules
T+12s Trip persistence falls back to WAL + Redis
T+15s p99 booking latency: 850ms → 180ms (Redis path faster)
T+90s Experiment ends
Metrics during 90-second experiment:
p99 latency peak: 850ms (T+5s to T+15s, 10-second violation)
p99 latency after CB: 180ms (T+15s onward)
Error rate: 0.12% (brief spike during transition)
WAL entries written: 4,200
Redis cache hits: 98.3%
Booking throughput: 4,650 RPS (93% of normal)
The p99 briefly hit 850ms, violating the 500ms SLO for 10 seconds. The steady state hypothesis uses a 1-minute window, so the brief spike was averaged with the pre-injection baseline. The final steady state check passed because p99 at T+90s was 180ms.
The Redis fallback path was faster than the normal PostgreSQL path because Redis round-trip was 2ms vs PostgreSQL’s normal 15ms. The injected latency pushed PostgreSQL to 515ms, triggering the circuit breaker, and the system switched to a faster path.
After the experiment, the WAL replay flushed 4,200 entries to PostgreSQL in 3.1 seconds.
Experiment 3: Redis at maxmemory
Hypothesis: Eviction kicks in, cache hit rate drops, but the system does not crash.
# SCALED: Experiment 3 - Redis maxmemory
version: 1.0.0
title: "Experiment 3: Redis maxmemory Exhaustion"
description: "Fill Redis to maxmemory. Verify eviction policy handles it and system degrades gracefully."
steady-state-hypothesis:
title: "Bookings within SLO"
probes:
- type: probe
name: "p99-under-500ms"
provider:
type: python
module: chaosPrometheus.probes
func: query_interval
arguments:
api_url: "http://prometheus:9090"
query: >
histogram_quantile(0.99,
sum(rate(http_server_requests_seconds_bucket{
uri="/api/rides/book", status="200"}[1m])) by (le))
start: "1 minute ago"
end: "now"
tolerance:
type: range
range: [0, 0.5]
- type: probe
name: "error-rate-under-0.5"
provider:
type: python
module: chaosPrometheus.probes
func: query_interval
arguments:
api_url: "http://prometheus:9090"
query: >
sum(rate(http_server_requests_seconds_count{
uri="/api/rides/book", status=~"5.."}[1m]))
/ sum(rate(http_server_requests_seconds_count{
uri="/api/rides/book"}[1m])) * 100
start: "1 minute ago"
end: "now"
tolerance:
type: range
range: [0, 0.5]
method:
# Fill Redis with junk data until maxmemory
- type: action
name: "fill-redis-memory"
provider:
type: process
path: "python3"
arguments:
- "-c"
- |
import redis
r = redis.Redis(host='redis', port=6379)
i = 0
while True:
try:
r.set(f'junk:{i}', 'x' * 10240)
i += 1
except redis.exceptions.ResponseError:
break # maxmemory reached
print(f'Filled {i} keys')
pauses:
after: 120
rollbacks:
- type: action
name: "flush-junk-keys"
provider:
type: process
path: "redis-cli"
arguments:
[
"-h",
"redis",
"EVAL",
"for _,k in ipairs(redis.call('keys','junk:*')) do redis.call('del',k) end",
"0",
]
Result: FAILED. Steady state violated.
Timeline:
T+0s Redis filled to maxmemory
T+0s allkeys-lfu eviction starts
T+2s Feature flag hash partially evicted
T+2s surge_pricing_enabled flag evicted → defaults to true (correct)
T+5s Surge pricing cache keys evicted
T+5s Cache hit rate: 95% → 72%
T+10s Rate limiter keys evicted
T+10s Rate limiting stops working (no keys = no limits)
T+15s p99 booking latency: 200ms → 450ms (more PG queries due to cache misses)
T+30s p99 stabilizes at 450ms
T+120s Experiment ends
Metrics during 120-second experiment:
p99 latency: 450ms (within 500ms SLO, barely)
Error rate: 0.08%
Cache hit rate: 72% (down from 95%)
Rate limiter functional: NO (keys evicted)
Feature flags intact: PARTIAL (some evicted, defaulted correctly)
The p99 stayed under 500ms, so the latency probe passed. But the experiment revealed a critical problem: the allkeys-lfu eviction policy evicted rate limiter keys and feature flag keys. These are not cache keys. They are operational data. Evicting them changes system behavior.
Rate limiter keys being evicted means rate limiting stops working during memory pressure. An attacker or traffic spike during a memory event would face no rate limits.
The fix: Separate Redis instances.
# SCALED: Separate Redis instances by data criticality
# Before: one Redis for everything
# After: three Redis instances
# Redis 1: Operational data (rate limiting, feature flags)
# maxmemory-policy: noeviction
# Data: rate limiter counters, feature flag hash
# Size: small (< 100MB)
# Redis 2: Cache data (surge multipliers, fare rules, driver locations)
# maxmemory-policy: allkeys-lfu
# Data: all cache keys
# Size: large (1-4GB)
# Redis 3: Session data (active trips, user state)
# maxmemory-policy: volatile-lru
# Data: trip state, user sessions
# Size: medium (500MB-1GB)
# SCALED: Kubernetes Redis deployments
# redis-operational.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: redis-operational
namespace: ride-hailing
spec:
replicas: 1
template:
spec:
containers:
- name: redis
image: redis:7-alpine
args:
- "--maxmemory"
- "100mb"
- "--maxmemory-policy"
- "noeviction"
resources:
limits:
memory: 150Mi
---
# redis-cache.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: redis-cache
namespace: ride-hailing
spec:
replicas: 1
template:
spec:
containers:
- name: redis
image: redis:7-alpine
args:
- "--maxmemory"
- "2gb"
- "--maxmemory-policy"
- "allkeys-lfu"
resources:
limits:
memory: 2500Mi
// SCALED: Separate Redis connections per instance
@Configuration
public class RedisConfig {
@Bean("operationalRedis")
public ReactiveRedisTemplate<String, String> operationalRedis() {
RedisStandaloneConfiguration config =
new RedisStandaloneConfiguration("redis-operational", 6379);
return new ReactiveRedisTemplate<>(
LettuceConnectionFactory.create(config),
RedisSerializationContext.string());
}
@Bean("cacheRedis")
public ReactiveRedisTemplate<String, String> cacheRedis() {
RedisStandaloneConfiguration config =
new RedisStandaloneConfiguration("redis-cache", 6379);
return new ReactiveRedisTemplate<>(
LettuceConnectionFactory.create(config),
RedisSerializationContext.string());
}
}
Re-run after fix:
Experiment 3 (re-run with separated Redis):
Cache Redis filled to maxmemory
Eviction applies only to cache keys
Rate limiter: FUNCTIONAL (operational Redis untouched)
Feature flags: INTACT (operational Redis untouched)
p99 latency: 380ms (cache misses hit PG, but less impact)
Cache hit rate: 68% (similar, but critical data unaffected)
Result: PASSED
Experiment 4: Kill 50% of Rider API Pods
Hypothesis: HPA scales up, traffic redistributes, error rate stays below 0.5% after 30 seconds.
# SCALED: Experiment 4 - Kill 50% of pods
version: 1.0.0
title: "Experiment 4: Kill 50% of Rider API Pods"
description: "Kill half of rider API pods. Verify HPA scales up and traffic redistributes."
steady-state-hypothesis:
title: "Bookings within SLO after recovery"
probes:
- type: probe
name: "p99-under-500ms"
provider:
type: python
module: chaosPrometheus.probes
func: query_interval
arguments:
api_url: "http://prometheus:9090"
query: >
histogram_quantile(0.99,
sum(rate(http_server_requests_seconds_bucket{
uri="/api/rides/book", status="200"}[1m])) by (le))
start: "1 minute ago"
end: "now"
tolerance:
type: range
range: [0, 0.5]
- type: probe
name: "error-rate-under-0.5"
provider:
type: python
module: chaosPrometheus.probes
func: query_interval
arguments:
api_url: "http://prometheus:9090"
query: >
sum(rate(http_server_requests_seconds_count{
uri="/api/rides/book", status=~"5.."}[1m]))
/ sum(rate(http_server_requests_seconds_count{
uri="/api/rides/book"}[1m])) * 100
start: "1 minute ago"
end: "now"
tolerance:
type: range
range: [0, 0.5]
method:
- type: action
name: "kill-half-rider-api-pods"
provider:
type: python
module: chaosk8s.pod.actions
func: terminate_pods
arguments:
label_selector: "app=rider-api"
ns: "ride-hailing"
qty: 3 # Kill 3 of 6 pods
rand: true
grace_period: 0
pauses:
after: 120
rollbacks:
- type: action
name: "ensure-rider-api-scaled"
provider:
type: process
path: "kubectl"
arguments:
["scale", "deployment/rider-api", "--replicas=6", "-n", "ride-hailing"]
Result: FAILED. Error rate exceeded threshold.
Timeline (initial run with minReplicas=3):
T+0s 3 of 6 rider API pods killed
T+0s 50% of in-flight requests fail (connection reset)
T+1s Kubernetes removes killed pods from service endpoints
T+1s Remaining 3 pods receive 100% of traffic (2x normal load)
T+5s 3 pods at 85% CPU (normal: 42% per pod)
T+10s HPA detects CPU > 70%, begins scaling
T+15s New pod scheduled, pulling image
T+25s New pod starting, JVM initializing
T+35s New pod ready, begins receiving traffic
T+40s HPA scales to 5 pods, load balances
T+60s HPA scales to 6 pods, back to normal
T+120s Experiment ends
Metrics:
Error rate T+0 to T+25: 8% (connection resets + pod overload)
Error rate T+25 to T+60: 1.2% (recovering)
Error rate T+60 to T+120: 0.08% (normal)
p99 peak: 1,200ms (T+5s to T+35s)
Recovery time: 60 seconds
8% error rate for 25 seconds. The HPA took 35 seconds to bring a new pod online (10s detection + 10s scheduling + 15s startup). During that time, 3 pods handled the full load at 85% CPU, causing queuing and timeouts.
The fix: Increase minReplicas from 3 to 6.
# SCALED: HPA with higher minimum replicas
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: rider-api
namespace: ride-hailing
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: rider-api
minReplicas: 6 # Was 3. Losing 50% still leaves 3.
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
behavior:
scaleUp:
stabilizationWindowSeconds: 30
policies:
- type: Pods
value: 4
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Pods
value: 2
periodSeconds: 60
Re-run after fix:
Experiment 4 (re-run with minReplicas=6):
T+0s 3 of 6 rider API pods killed
T+0s Remaining 3 pods receive 100% of traffic
T+1s 3 pods at 85% CPU
T+5s Error rate: 0.3% (brief connection resets only)
T+10s HPA detects overload, scales to 9
T+25s 3 new pods ready, load balanced
T+30s Error rate: 0.02%
Error rate T+0 to T+5: 0.3% (within 0.5% threshold)
Error rate T+5 to T+120: 0.02%
p99 peak: 420ms (within 500ms SLO)
Recovery time: 5 seconds to acceptable, 25 seconds to full
Result: PASSED
With 6 minimum replicas, losing 3 still leaves 3 pods. Each handles 2x load at 85% CPU. The HPA brings new pods online within 25 seconds. The error rate during the initial window dropped from 8% to 0.3%.
Game Day Format
Quarterly Chaos Game Day Checklist:
Pre-Game (1 week before):
□ Staging environment mirrors production config
□ All experiments reviewed and updated
□ Locust scripts updated for current traffic patterns
□ Rollback procedures verified
□ On-call rotation aware (do not run on live on-call rotation)
Game Day Execution (2 hours):
□ Start Locust: 500 users, production traffic profile
□ Wait 5 minutes for baseline metrics
□ Run Experiment 1 → Record results → 5 min cooldown
□ Run Experiment 2 → Record results → 5 min cooldown
□ Run Experiment 3 → Record results → 5 min cooldown
□ Run Experiment 4 → Record results → 5 min cooldown
□ Review all results as a team
Post-Game (1 week after):
□ File tickets for any steady state violations
□ Update resilience configurations based on findings
□ Re-run failed experiments to verify fixes
□ Update experiment parameters for next quarter
□ Share results in engineering all-hands
CI Integration
# SCALED: Nightly chaos experiment in CI
# .gitlab-ci.yml
chaos-test-surge-pricing:
stage: chaos
image: chaostoolkit/chaostoolkit:latest
services:
- name: locust/locust:latest
alias: locust
variables:
KUBECONFIG: /etc/kubernetes/config
before_script:
- pip install chaostoolkit-kubernetes chaostoolkit-prometheus
- |
locust -f chaos/locust/chaos-load-test.py \
--headless --users 100 --spawn-rate 20 \
--host http://rider-api.staging:8080 &
- sleep 60 # Wait for baseline
script:
- chaos run chaos/experiments/kill-surge-pricing.yaml
--journal-path chaos/results/ci-$(date +%Y%m%d).json
after_script:
- chaos report --export-format=html
chaos/results/ci-*.json chaos/results/report.html
artifacts:
paths:
- chaos/results/
when: always
rules:
- if: $CI_PIPELINE_SOURCE == "schedule"
when: always
allow_failure: false # Block deployment if chaos test fails
The nightly CI job runs the surge pricing kill experiment against staging. If the steady state is violated, the pipeline fails. No deployment until the resilience pattern is fixed.
Start with one experiment in CI. Add more as confidence grows. The goal is not to run all four experiments nightly. The goal is to catch regressions. A configuration change that accidentally disables the circuit breaker will be caught by the nightly chaos test before it reaches production.
Chaos Test Coverage Over Time:
Month 1: Surge pricing kill (nightly)
Month 2: + PG latency injection (weekly)
Month 3: + Redis maxmemory (weekly)
Month 4: + Pod kill (weekly)
Month 6: All four experiments nightly
Month 12: Custom experiments for new dependencies
Each experiment takes 3 minutes to run (baseline + injection + observation + rollback). Four experiments take 12 minutes. Adding 12 minutes to a nightly CI pipeline is a small cost for the confidence that resilience patterns actually work.