Breaking the Platform: Four Experiments with Results

The Symptom

The chaos toolkit is set up. The steady state hypothesis is defined. Locust is running 500 users against staging. The team gathers for their first game day. Four experiments. Two hours. The goal: find out whether the resilience patterns from CH18 and CH19 actually work.

Two experiments pass. Two fail. The failures are not in the code. They are in the configuration.

The Cause

Resilience patterns have two layers of correctness. The first layer is functional: does the circuit breaker open when failures exceed the threshold? Unit tests cover this. The second layer is operational: is the threshold correct? Is the timeout long enough? Is the bulkhead large enough? Is the fallback fast enough? Only chaos experiments under real load answer these questions.

Experiment 1: Kill Surge Pricing Service

Hypothesis: Ride bookings continue at base fare with zero user-facing errors.

# SCALED: Experiment 1 - Kill surge pricing
version: 1.0.0
title: "Experiment 1: Kill Surge Pricing Service"
description: "Kill all surge pricing pods. Verify circuit breaker opens and rides book at cached/default fare."

steady-state-hypothesis:
  title: "Bookings within SLO"
  probes:
    - type: probe
      name: "p99-under-500ms"
      provider:
        type: python
        module: chaosPrometheus.probes
        func: query_interval
        arguments:
          api_url: "http://prometheus:9090"
          query: >
            histogram_quantile(0.99,
              sum(rate(http_server_requests_seconds_bucket{
                uri="/api/rides/book", status="200"}[1m])) by (le))
          start: "1 minute ago"
          end: "now"
      tolerance:
        type: range
        range: [0, 0.5]
    - type: probe
      name: "error-rate-under-0.1"
      provider:
        type: python
        module: chaosPrometheus.probes
        func: query_interval
        arguments:
          api_url: "http://prometheus:9090"
          query: >
            sum(rate(http_server_requests_seconds_count{
              uri="/api/rides/book", status=~"5.."}[1m]))
            / sum(rate(http_server_requests_seconds_count{
              uri="/api/rides/book"}[1m])) * 100
          start: "1 minute ago"
          end: "now"
      tolerance:
        type: range
        range: [0, 0.1]

method:
  - type: action
    name: "kill-all-surge-pricing-pods"
    provider:
      type: python
      module: chaosk8s.pod.actions
      func: terminate_pods
      arguments:
        label_selector: "app=surge-pricing"
        ns: "ride-hailing"
        qty: 3
        rand: false
        grace_period: 0
    pauses:
      after: 60

rollbacks:
  - type: action
    name: "restart-surge-pricing"
    provider:
      type: python
      module: chaosk8s.deployment.actions
      func: rollout_restart
      arguments:
        name: "surge-pricing"
        ns: "ride-hailing"

Result: PASSED. Steady state maintained.

Timeline:
T+0s     Surge pricing pods killed (all 3 replicas)
T+2s     First connection timeout from rider API to surge pricing
T+4s     Circuit breaker failure rate: 30% (6 of 20 calls)
T+8s     Circuit breaker failure rate: 55% (11 of 20 calls)
T+10s    Circuit breaker state: CLOSED → OPEN
T+10s    All surge pricing calls return cached multiplier instantly
T+10s    Bulkhead releases all 20 connections
T+60s    Experiment ends, steady state re-checked

Metrics during 60-second experiment:
p99 latency (booking):    380ms (within 500ms SLO)
Error rate (booking):     0.04% (within 0.1% SLO)
Surge fallback calls:     28,400 (cached: 27,800, default: 600)
Circuit breaker state:    OPEN for 50s of 60s
Booking throughput:       4,920 RPS (99% of normal)

The 10-second window between pod kill and circuit breaker opening is the vulnerability window. During those 10 seconds, 20 concurrent calls (bulkhead limit) timed out at 2 seconds each. The remaining 480 concurrent calls were unaffected because the bulkhead isolated them.

The 600 calls that returned the default 1.0x multiplier were for zones with no cached data. Those zones had not seen traffic recently, so no multiplier was cached. In production, this means riders in low-traffic zones get no-surge pricing during a surge pricing outage. Acceptable.

Experiment 2: 500ms PostgreSQL Latency

Hypothesis: p99 increases but stays below SLO. Circuit breaker engages and Redis fallback restores performance.

# SCALED: Experiment 2 - PostgreSQL latency injection
version: 1.0.0
title: "Experiment 2: 500ms PostgreSQL Latency"
description: "Inject 500ms network latency to PostgreSQL. Verify circuit breaker and Redis fallback maintain booking SLO."

steady-state-hypothesis:
  title: "Bookings within SLO"
  probes:
    - type: probe
      name: "p99-under-500ms"
      provider:
        type: python
        module: chaosPrometheus.probes
        func: query_interval
        arguments:
          api_url: "http://prometheus:9090"
          query: >
            histogram_quantile(0.99,
              sum(rate(http_server_requests_seconds_bucket{
                uri="/api/rides/book", status="200"}[1m])) by (le))
          start: "1 minute ago"
          end: "now"
      tolerance:
        type: range
        range: [0, 0.5]
    - type: probe
      name: "error-rate-under-0.5"
      provider:
        type: python
        module: chaosPrometheus.probes
        func: query_interval
        arguments:
          api_url: "http://prometheus:9090"
          query: >
            sum(rate(http_server_requests_seconds_count{
              uri="/api/rides/book", status=~"5.."}[1m]))
            / sum(rate(http_server_requests_seconds_count{
              uri="/api/rides/book"}[1m])) * 100
          start: "1 minute ago"
          end: "now"
      tolerance:
        type: range
        range: [0, 0.5]

method:
  - type: action
    name: "inject-pg-latency"
    provider:
      type: process
      path: "kubectl"
      arguments:
        [
          "exec",
          "-n",
          "ride-hailing",
          "deploy/rider-api",
          "--",
          "tc",
          "qdisc",
          "add",
          "dev",
          "eth0",
          "root",
          "netem",
          "delay",
          "500ms",
          "50ms",
          "distribution",
          "normal",
        ]
    pauses:
      after: 90

rollbacks:
  - type: action
    name: "remove-pg-latency"
    provider:
      type: process
      path: "kubectl"
      arguments:
        [
          "exec",
          "-n",
          "ride-hailing",
          "deploy/rider-api",
          "--",
          "tc",
          "qdisc",
          "del",
          "dev",
          "eth0",
          "root",
        ]

Result: PASSED after initial violation. Steady state restored.

Timeline:
T+0s     500ms latency injected on rider API network interface
T+0s     All PG queries add 500ms ± 50ms
T+1s     p99 booking latency: 200ms → 720ms
T+5s     p99 booking latency: 850ms (SLO violated briefly)
T+8s     Connection pool saturation reaches 60%
T+12s    R2DBC circuit breaker opens (query timeout threshold exceeded)
T+12s    Fare calculation falls back to Redis cached rules
T+12s    Trip persistence falls back to WAL + Redis
T+15s    p99 booking latency: 850ms → 180ms (Redis path faster)
T+90s    Experiment ends

Metrics during 90-second experiment:
p99 latency peak:         850ms (T+5s to T+15s, 10-second violation)
p99 latency after CB:     180ms (T+15s onward)
Error rate:               0.12% (brief spike during transition)
WAL entries written:      4,200
Redis cache hits:         98.3%
Booking throughput:       4,650 RPS (93% of normal)

The p99 briefly hit 850ms, violating the 500ms SLO for 10 seconds. The steady state hypothesis uses a 1-minute window, so the brief spike was averaged with the pre-injection baseline. The final steady state check passed because p99 at T+90s was 180ms.

The Redis fallback path was faster than the normal PostgreSQL path because Redis round-trip was 2ms vs PostgreSQL’s normal 15ms. The injected latency pushed PostgreSQL to 515ms, triggering the circuit breaker, and the system switched to a faster path.

After the experiment, the WAL replay flushed 4,200 entries to PostgreSQL in 3.1 seconds.

Experiment 3: Redis at maxmemory

Hypothesis: Eviction kicks in, cache hit rate drops, but the system does not crash.

# SCALED: Experiment 3 - Redis maxmemory
version: 1.0.0
title: "Experiment 3: Redis maxmemory Exhaustion"
description: "Fill Redis to maxmemory. Verify eviction policy handles it and system degrades gracefully."

steady-state-hypothesis:
  title: "Bookings within SLO"
  probes:
    - type: probe
      name: "p99-under-500ms"
      provider:
        type: python
        module: chaosPrometheus.probes
        func: query_interval
        arguments:
          api_url: "http://prometheus:9090"
          query: >
            histogram_quantile(0.99,
              sum(rate(http_server_requests_seconds_bucket{
                uri="/api/rides/book", status="200"}[1m])) by (le))
          start: "1 minute ago"
          end: "now"
      tolerance:
        type: range
        range: [0, 0.5]
    - type: probe
      name: "error-rate-under-0.5"
      provider:
        type: python
        module: chaosPrometheus.probes
        func: query_interval
        arguments:
          api_url: "http://prometheus:9090"
          query: >
            sum(rate(http_server_requests_seconds_count{
              uri="/api/rides/book", status=~"5.."}[1m]))
            / sum(rate(http_server_requests_seconds_count{
              uri="/api/rides/book"}[1m])) * 100
          start: "1 minute ago"
          end: "now"
      tolerance:
        type: range
        range: [0, 0.5]

method:
  # Fill Redis with junk data until maxmemory
  - type: action
    name: "fill-redis-memory"
    provider:
      type: process
      path: "python3"
      arguments:
        - "-c"
        - |
          import redis
          r = redis.Redis(host='redis', port=6379)
          i = 0
          while True:
              try:
                  r.set(f'junk:{i}', 'x' * 10240)
                  i += 1
              except redis.exceptions.ResponseError:
                  break  # maxmemory reached
          print(f'Filled {i} keys')
    pauses:
      after: 120

rollbacks:
  - type: action
    name: "flush-junk-keys"
    provider:
      type: process
      path: "redis-cli"
      arguments:
        [
          "-h",
          "redis",
          "EVAL",
          "for _,k in ipairs(redis.call('keys','junk:*')) do redis.call('del',k) end",
          "0",
        ]

Result: FAILED. Steady state violated.

Timeline:
T+0s     Redis filled to maxmemory
T+0s     allkeys-lfu eviction starts
T+2s     Feature flag hash partially evicted
T+2s     surge_pricing_enabled flag evicted → defaults to true (correct)
T+5s     Surge pricing cache keys evicted
T+5s     Cache hit rate: 95% → 72%
T+10s    Rate limiter keys evicted
T+10s    Rate limiting stops working (no keys = no limits)
T+15s    p99 booking latency: 200ms → 450ms (more PG queries due to cache misses)
T+30s    p99 stabilizes at 450ms
T+120s   Experiment ends

Metrics during 120-second experiment:
p99 latency:              450ms (within 500ms SLO, barely)
Error rate:               0.08%
Cache hit rate:           72% (down from 95%)
Rate limiter functional:  NO (keys evicted)
Feature flags intact:     PARTIAL (some evicted, defaulted correctly)

The p99 stayed under 500ms, so the latency probe passed. But the experiment revealed a critical problem: the allkeys-lfu eviction policy evicted rate limiter keys and feature flag keys. These are not cache keys. They are operational data. Evicting them changes system behavior.

Rate limiter keys being evicted means rate limiting stops working during memory pressure. An attacker or traffic spike during a memory event would face no rate limits.

The fix: Separate Redis instances.

# SCALED: Separate Redis instances by data criticality
# Before: one Redis for everything
# After: three Redis instances

# Redis 1: Operational data (rate limiting, feature flags)
# maxmemory-policy: noeviction
# Data: rate limiter counters, feature flag hash
# Size: small (< 100MB)

# Redis 2: Cache data (surge multipliers, fare rules, driver locations)
# maxmemory-policy: allkeys-lfu
# Data: all cache keys
# Size: large (1-4GB)

# Redis 3: Session data (active trips, user state)
# maxmemory-policy: volatile-lru
# Data: trip state, user sessions
# Size: medium (500MB-1GB)

# SCALED: Kubernetes Redis deployments
# redis-operational.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: redis-operational
  namespace: ride-hailing
spec:
  replicas: 1
  template:
    spec:
      containers:
        - name: redis
          image: redis:7-alpine
          args:
            - "--maxmemory"
            - "100mb"
            - "--maxmemory-policy"
            - "noeviction"
          resources:
            limits:
              memory: 150Mi

---
# redis-cache.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: redis-cache
  namespace: ride-hailing
spec:
  replicas: 1
  template:
    spec:
      containers:
        - name: redis
          image: redis:7-alpine
          args:
            - "--maxmemory"
            - "2gb"
            - "--maxmemory-policy"
            - "allkeys-lfu"
          resources:
            limits:
              memory: 2500Mi

// SCALED: Separate Redis connections per instance
@Configuration
public class RedisConfig {

    @Bean("operationalRedis")
    public ReactiveRedisTemplate<String, String> operationalRedis() {
        RedisStandaloneConfiguration config =
            new RedisStandaloneConfiguration("redis-operational", 6379);
        return new ReactiveRedisTemplate<>(
            LettuceConnectionFactory.create(config),
            RedisSerializationContext.string());
    }

    @Bean("cacheRedis")
    public ReactiveRedisTemplate<String, String> cacheRedis() {
        RedisStandaloneConfiguration config =
            new RedisStandaloneConfiguration("redis-cache", 6379);
        return new ReactiveRedisTemplate<>(
            LettuceConnectionFactory.create(config),
            RedisSerializationContext.string());
    }
}

Re-run after fix:

Experiment 3 (re-run with separated Redis):
Cache Redis filled to maxmemory
Eviction applies only to cache keys
Rate limiter: FUNCTIONAL (operational Redis untouched)
Feature flags: INTACT (operational Redis untouched)
p99 latency: 380ms (cache misses hit PG, but less impact)
Cache hit rate: 68% (similar, but critical data unaffected)
Result: PASSED

Experiment 4: Kill 50% of Rider API Pods

Hypothesis: HPA scales up, traffic redistributes, error rate stays below 0.5% after 30 seconds.

# SCALED: Experiment 4 - Kill 50% of pods
version: 1.0.0
title: "Experiment 4: Kill 50% of Rider API Pods"
description: "Kill half of rider API pods. Verify HPA scales up and traffic redistributes."

steady-state-hypothesis:
  title: "Bookings within SLO after recovery"
  probes:
    - type: probe
      name: "p99-under-500ms"
      provider:
        type: python
        module: chaosPrometheus.probes
        func: query_interval
        arguments:
          api_url: "http://prometheus:9090"
          query: >
            histogram_quantile(0.99,
              sum(rate(http_server_requests_seconds_bucket{
                uri="/api/rides/book", status="200"}[1m])) by (le))
          start: "1 minute ago"
          end: "now"
      tolerance:
        type: range
        range: [0, 0.5]
    - type: probe
      name: "error-rate-under-0.5"
      provider:
        type: python
        module: chaosPrometheus.probes
        func: query_interval
        arguments:
          api_url: "http://prometheus:9090"
          query: >
            sum(rate(http_server_requests_seconds_count{
              uri="/api/rides/book", status=~"5.."}[1m]))
            / sum(rate(http_server_requests_seconds_count{
              uri="/api/rides/book"}[1m])) * 100
          start: "1 minute ago"
          end: "now"
      tolerance:
        type: range
        range: [0, 0.5]

method:
  - type: action
    name: "kill-half-rider-api-pods"
    provider:
      type: python
      module: chaosk8s.pod.actions
      func: terminate_pods
      arguments:
        label_selector: "app=rider-api"
        ns: "ride-hailing"
        qty: 3 # Kill 3 of 6 pods
        rand: true
        grace_period: 0
    pauses:
      after: 120

rollbacks:
  - type: action
    name: "ensure-rider-api-scaled"
    provider:
      type: process
      path: "kubectl"
      arguments:
        ["scale", "deployment/rider-api", "--replicas=6", "-n", "ride-hailing"]

Result: FAILED. Error rate exceeded threshold.

Timeline (initial run with minReplicas=3):
T+0s     3 of 6 rider API pods killed
T+0s     50% of in-flight requests fail (connection reset)
T+1s     Kubernetes removes killed pods from service endpoints
T+1s     Remaining 3 pods receive 100% of traffic (2x normal load)
T+5s     3 pods at 85% CPU (normal: 42% per pod)
T+10s    HPA detects CPU > 70%, begins scaling
T+15s    New pod scheduled, pulling image
T+25s    New pod starting, JVM initializing
T+35s    New pod ready, begins receiving traffic
T+40s    HPA scales to 5 pods, load balances
T+60s    HPA scales to 6 pods, back to normal
T+120s   Experiment ends

Metrics:
Error rate T+0 to T+25:   8% (connection resets + pod overload)
Error rate T+25 to T+60:  1.2% (recovering)
Error rate T+60 to T+120: 0.08% (normal)
p99 peak:                 1,200ms (T+5s to T+35s)
Recovery time:             60 seconds

8% error rate for 25 seconds. The HPA took 35 seconds to bring a new pod online (10s detection + 10s scheduling + 15s startup). During that time, 3 pods handled the full load at 85% CPU, causing queuing and timeouts.

The fix: Increase minReplicas from 3 to 6.

# SCALED: HPA with higher minimum replicas
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: rider-api
  namespace: ride-hailing
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: rider-api
  minReplicas: 6 # Was 3. Losing 50% still leaves 3.
  maxReplicas: 20
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 30
      policies:
        - type: Pods
          value: 4
          periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Pods
          value: 2
          periodSeconds: 60

Re-run after fix:

Experiment 4 (re-run with minReplicas=6):
T+0s     3 of 6 rider API pods killed
T+0s     Remaining 3 pods receive 100% of traffic
T+1s     3 pods at 85% CPU
T+5s     Error rate: 0.3% (brief connection resets only)
T+10s    HPA detects overload, scales to 9
T+25s    3 new pods ready, load balanced
T+30s    Error rate: 0.02%

Error rate T+0 to T+5:    0.3% (within 0.5% threshold)
Error rate T+5 to T+120:  0.02%
p99 peak:                 420ms (within 500ms SLO)
Recovery time:             5 seconds to acceptable, 25 seconds to full
Result: PASSED

With 6 minimum replicas, losing 3 still leaves 3 pods. Each handles 2x load at 85% CPU. The HPA brings new pods online within 25 seconds. The error rate during the initial window dropped from 8% to 0.3%.

Game Day Format

Quarterly Chaos Game Day Checklist:

Pre-Game (1 week before):
□ Staging environment mirrors production config
□ All experiments reviewed and updated
□ Locust scripts updated for current traffic patterns
□ Rollback procedures verified
□ On-call rotation aware (do not run on live on-call rotation)

Game Day Execution (2 hours):
□ Start Locust: 500 users, production traffic profile
□ Wait 5 minutes for baseline metrics
□ Run Experiment 1 → Record results → 5 min cooldown
□ Run Experiment 2 → Record results → 5 min cooldown
□ Run Experiment 3 → Record results → 5 min cooldown
□ Run Experiment 4 → Record results → 5 min cooldown
□ Review all results as a team

Post-Game (1 week after):
□ File tickets for any steady state violations
□ Update resilience configurations based on findings
□ Re-run failed experiments to verify fixes
□ Update experiment parameters for next quarter
□ Share results in engineering all-hands

CI Integration

# SCALED: Nightly chaos experiment in CI
# .gitlab-ci.yml
chaos-test-surge-pricing:
  stage: chaos
  image: chaostoolkit/chaostoolkit:latest
  services:
    - name: locust/locust:latest
      alias: locust
  variables:
    KUBECONFIG: /etc/kubernetes/config
  before_script:
    - pip install chaostoolkit-kubernetes chaostoolkit-prometheus
    - |
      locust -f chaos/locust/chaos-load-test.py \
        --headless --users 100 --spawn-rate 20 \
        --host http://rider-api.staging:8080 &
    - sleep 60 # Wait for baseline
  script:
    - chaos run chaos/experiments/kill-surge-pricing.yaml
      --journal-path chaos/results/ci-$(date +%Y%m%d).json
  after_script:
    - chaos report --export-format=html
      chaos/results/ci-*.json chaos/results/report.html
  artifacts:
    paths:
      - chaos/results/
    when: always
  rules:
    - if: $CI_PIPELINE_SOURCE == "schedule"
      when: always
  allow_failure: false # Block deployment if chaos test fails

The nightly CI job runs the surge pricing kill experiment against staging. If the steady state is violated, the pipeline fails. No deployment until the resilience pattern is fixed.

Start with one experiment in CI. Add more as confidence grows. The goal is not to run all four experiments nightly. The goal is to catch regressions. A configuration change that accidentally disables the circuit breaker will be caught by the nightly chaos test before it reaches production.

Chaos Test Coverage Over Time:

Month 1:  Surge pricing kill (nightly)
Month 2:  + PG latency injection (weekly)
Month 3:  + Redis maxmemory (weekly)
Month 4:  + Pod kill (weekly)
Month 6:  All four experiments nightly
Month 12: Custom experiments for new dependencies

Each experiment takes 3 minutes to run (baseline + injection + observation + rollback). Four experiments take 12 minutes. Adding 12 minutes to a nightly CI pipeline is a small cost for the confidence that resilience patterns actually work.