Skip to main content
surviving the spike

Degraded Mode Design: What the System Does When Half of It Is Gone

8 min read Chapter 55 of 66

Degraded Mode Design: What the System Does When Half of It Is Gone

The Symptom

Wednesday. 3:47 AM. PostgreSQL primary fails over to the replica. The failover takes 12 seconds. During those 12 seconds, the rider API returns 500 on every request. 12 seconds. 4,800 failed ride requests. 4,800 people who opened the app, tapped “Request Ride,” and saw an error.

The PostgreSQL failover completed successfully. The database was unavailable for 12 seconds. The engineering response: “The failover worked as designed.”

The product response: “4,800 people could not get a ride.”

The rider API did not need PostgreSQL to book a ride. It needed driver locations (cached in Redis), surge multipliers (cached in Redis), and fare rates (cached in Redis). The only thing PostgreSQL provided in the hot path was writing the trip record. That write could have been deferred by 12 seconds. Nobody would have noticed.

The system returned 500 because it treated PostgreSQL as a hard dependency. Every component was equally critical. When any component failed, everything failed.

The Cause

Most services are built with an implicit assumption: all dependencies are available. The happy path is the only path. Error handling is catch (Exception e) { throw new ServiceException(e); }, which propagates every failure to the caller as a 500.

A ride-hailing platform has dozens of dependencies with vastly different criticality levels. Losing the booking database is catastrophic. Losing trip history for 30 seconds is invisible. Losing analytics for 5 minutes is irrelevant. But the code treats them identically.

Degraded mode design starts with a question: what is the minimum viable ride? A rider needs to request a ride, get matched with a driver, see an estimated fare, and start the trip. Everything else is enhancement.

Feature                Criticality     What Happens If It Is Gone
Ride booking           Critical        Platform is useless
Driver matching        Critical        Cannot assign rides
Fare calculation       Important       Show estimated fare
Surge pricing          Important       Book at base fare
Payment processing     Important       Charge after trip
Trip history           Deferrable      Show "Loading..." for 30s
Driver ETA             Deferrable      Show "Driver on the way"
Analytics              Expendable      Nobody notices
Promotions             Expendable      Full price rides

Critical features must work. Important features have fallbacks. Deferrable features can disappear temporarily. Expendable features can be shut off indefinitely.

Layered feature criticality diagram showing three concentric rings: core features (green, always on), optional features (yellow, simplified under load), and non-critical features (red, disabled under load)

The concentric ring diagram shows how features are organized by criticality. The core ring — booking, fare calculation, and driver matching — remains fully operational under any conditions. The middle ring (surge pricing display, ETA refinement, ride history) switches to simplified mode under load, returning cached or estimated values. The outer ring (analytics, promotions, recommendations) is disabled entirely when the system is under stress. This layered approach ensures the core booking flow is protected at all times, shedding non-essential work to preserve capacity for what matters.

The implementation requires three mechanisms:

  1. Feature flags to disable non-critical features instantly
  2. Fallback chains so each feature degrades through multiple levels before failing
  3. Kill switches for instant deactivation during incidents

The Baseline

The rider API before degraded mode:

// BOTTLENECK: Every dependency is critical
@RestController
public class RideController {

    @PostMapping("/api/rides/book")
    public Mono<RideBooking> bookRide(@RequestBody RideRequest request) {
        return fareService.calculateExactFare(request)       // needs PG
            .flatMap(fare ->
                surgeService.getMultiplier(request.getZoneId()) // needs surge svc
                    .map(m -> fare.withSurge(m)))
            .flatMap(fare ->
                matchingService.findDriver(request, fare))    // needs matching svc
            .flatMap(match ->
                tripRepository.save(new Trip(request, match))) // needs PG
            .flatMap(trip ->
                analyticsService.trackBooking(trip))           // needs analytics
            .map(trip -> new RideBooking(trip));
        // If ANY of these fail, the rider gets 500
    }
}

Six dependencies. Six failure points. If analytics is down, the rider cannot book a ride. Analytics has zero impact on the ride experience. But it is chained into the booking pipeline, and its failure propagates to the response.

Load test with PostgreSQL down:

Locust: 500 users, PostgreSQL unavailable

Metric          Value
Error rate      100%
Booking success 0%
p50 latency     2,100ms (timeout waiting for PG)

100% failure because the trip save is in the critical path. The rider cannot book a ride because the system cannot write a database record.

The Fix

Redesign the booking pipeline with criticality awareness:

// SCALED: Criticality-aware booking pipeline
@RestController
public class RideController {

    private final FeatureFlagService featureFlags;
    private final FareService fareService;
    private final SurgePricingClient surgeClient;
    private final DriverMatchingClient matchingClient;
    private final TripRepository tripRepository;
    private final RedisTripCache tripCache;
    private final AnalyticsService analyticsService;

    @PostMapping("/api/rides/book")
    public Mono<RideBooking> bookRide(@RequestBody RideRequest request) {
        List<String> degradedFeatures = new ArrayList<>();

        return getFare(request, degradedFeatures)              // Important: fallback chain
            .flatMap(fare ->
                matchingClient.findDriver(request, fare))      // Critical: circuit breaker
            .flatMap(match ->
                persistTrip(request, match, degradedFeatures)) // Important: Redis fallback
            .doOnNext(trip ->
                trackAsync(trip))                              // Expendable: fire-and-forget
            .map(trip -> new RideBooking(trip, degradedFeatures));
    }

    private Mono<FareEstimate> getFare(RideRequest req, List<String> degraded) {
        return fareService.calculateExactFare(req)
            .onErrorResume(ex -> {
                degraded.add("fare_calculation");
                return fareService.calculateEstimatedFare(req);
            })
            .onErrorResume(ex -> {
                degraded.add("fare_estimation");
                return Mono.just(FareEstimate.baseFare(req));
            });
    }

    private Mono<Trip> persistTrip(RideRequest req, MatchResult match,
                                    List<String> degraded) {
        return tripRepository.save(new Trip(req, match))
            .onErrorResume(ex -> {
                degraded.add("trip_persistence");
                return tripCache.saveTemporary(new Trip(req, match));
            });
    }

    private void trackAsync(Trip trip) {
        analyticsService.trackBooking(trip)
            .subscribe(
                result -> {},
                error -> Metrics.counter("analytics.fire_forget.error").increment()
            );
    }
}

Analytics is now fire-and-forget. It cannot block the booking. Fare calculation degrades through three levels: exact → estimated → base fare. Trip persistence falls back to Redis when PostgreSQL is down.

The response includes which features are degraded:

{
  "rideId": "ride-abc123",
  "driverId": "driver-456",
  "estimatedFare": 24.5,
  "currency": "USD",
  "degradedFeatures": ["fare_calculation", "trip_persistence"],
  "message": "Your ride is confirmed. Exact fare will be calculated after the trip."
}

The frontend reads degradedFeatures and adjusts the UI. Instead of showing “$24.50” as the fare, it shows ”~$24.50 (estimated)”. The rider gets a ride. The exact fare is calculated when PostgreSQL recovers.

Feature Flags via Redis

// SCALED: Redis-backed feature flags
@Service
public class FeatureFlagService {

    private final ReactiveRedisTemplate<String, String> redis;
    private static final String FLAGS_KEY = "feature_flags";

    public Mono<Boolean> isEnabled(String feature) {
        return redis.opsForHash()
            .get(FLAGS_KEY, feature)
            .map(val -> "true".equals(val))
            .defaultIfEmpty(true) // Default: enabled if Redis is down
            .onErrorReturn(true); // Redis failure = all features enabled
    }

    public Mono<Map<String, Boolean>> getAllFlags() {
        return redis.<String, String>opsForHash()
            .entries(FLAGS_KEY)
            .collectMap(
                Map.Entry::getKey,
                entry -> "true".equals(entry.getValue())
            );
    }
}
# Kill switch: disable surge pricing in production
redis-cli HSET feature_flags surge_pricing_enabled false

# Re-enable after fix
redis-cli HSET feature_flags surge_pricing_enabled true

The kill switch takes effect on the next request. No deployment. No restart. One Redis command, and surge pricing is disabled across all pods within 1 second.

WebFilter Kill Switch

// SCALED: WebFilter that checks feature flags before processing
@Component
@Order(1)
public class FeatureFlagFilter implements WebFilter {

    private final FeatureFlagService featureFlags;

    private static final Map<String, String> PATH_TO_FLAG = Map.of(
        "/api/trips/history", "trip_history_enabled",
        "/api/analytics", "analytics_enabled",
        "/api/promotions", "promotions_enabled",
        "/api/surge", "surge_pricing_enabled"
    );

    @Override
    public Mono<Void> filter(ServerWebExchange exchange, WebFilterChain chain) {
        String path = exchange.getRequest().getPath().value();

        return PATH_TO_FLAG.entrySet().stream()
            .filter(e -> path.startsWith(e.getKey()))
            .findFirst()
            .map(entry -> featureFlags.isEnabled(entry.getValue())
                .flatMap(enabled -> {
                    if (!enabled) {
                        exchange.getResponse().setStatusCode(HttpStatus.SERVICE_UNAVAILABLE);
                        exchange.getResponse().getHeaders()
                            .add("X-Feature-Disabled", entry.getValue());
                        return exchange.getResponse().setComplete();
                    }
                    return chain.filter(exchange);
                }))
            .orElse(chain.filter(exchange));
    }
}

Expendable features are disabled at the filter level. The request never reaches the controller. No thread pool consumption, no database queries, no downstream calls. The response includes X-Feature-Disabled so the frontend knows to show a “temporarily unavailable” message instead of an error.

The Proof

Load test: PostgreSQL killed at T+60s, stays down for 120 seconds.

# SCALED: Locust test for degraded mode
from locust import HttpUser, task, between

class DegradedModeTest(HttpUser):
    wait_time = between(0.05, 0.2)

    @task(10)
    def book_ride(self):
        with self.client.post("/api/rides/book", json={
            "riderId": f"rider-{self.environment.runner.user_count}",
            "pickupLat": 40.7128, "pickupLng": -74.0060,
            "dropoffLat": 40.7580, "dropoffLng": -73.9855,
            "zoneId": "manhattan-midtown"
        }, catch_response=True) as resp:
            if resp.status_code == 200:
                data = resp.json()
                if "degradedFeatures" in data and data["degradedFeatures"]:
                    resp.success()  # Degraded but functional
                else:
                    resp.success()
            else:
                resp.failure(f"Status {resp.status_code}")

    @task(3)
    def fare_estimate(self):
        self.client.get("/api/fares/estimate?zoneId=manhattan-midtown")

    @task(1)
    def trip_history(self):
        self.client.get("/api/trips/history?riderId=rider-1")

Results:

Locust: 500 users, PostgreSQL down for 120s at T+60s

                    Before PG Fail   Without Degraded   With Degraded Mode
p50 booking         120ms            timeout            145ms
p95 booking         280ms            timeout            310ms
Error rate          0.03%            100%               0.2%
Booking throughput  4,980 RPS        0 RPS              4,230 RPS (85%)
Trip history        works            fails              "Loading..."
Analytics           works            fails              disabled (kill switch)
Fare source         PostgreSQL       N/A                Redis cache / base fare

With degraded mode, 85% of normal throughput during a complete PostgreSQL outage. The 15% drop comes from the additional Redis round-trips for the fallback path and the estimated fare calculation being slightly slower than the cached exact fare.

The 0.2% error rate came from new riders with no cached data in Redis. Their first booking attempt failed because no fallback data existed. The second attempt succeeded after the system populated a base fare estimate.

When PostgreSQL recovered at T+180s, the temporary trip records in Redis were flushed to PostgreSQL by a reconciliation job. No data was lost. The riders never knew the database was down.