Feature Criticality and Graceful Degradation Chains
Feature Criticality and Graceful Degradation Chains
The Symptom
The on-call engineer gets paged at 4 AM. The surge pricing service is returning 500s. The runbook says “restart the surge pricing pods.” The engineer restarts them. The pods come up. The surge pricing service works. 22 minutes later, it fails again. The engineer restarts again. This cycle repeats three times before someone senior wakes up and asks: “Why does surge pricing failing prevent ride bookings?”
It should not. But nobody ranked the features by criticality. Nobody built fallback chains. Nobody asked: “If this component disappears, what should the rider experience?”
The Cause
Every feature in the ride-hailing platform is treated as equally important. The booking pipeline calls surge pricing, fare calculation, driver matching, trip persistence, analytics, promotions, and ETA estimation in sequence. Any failure in any step returns 500 to the rider.
This is wrong. The platform exists to connect riders with drivers. Everything else is optimization. If surge pricing is down, book at base fare. If fare calculation is down, show an estimate. If analytics is down, nobody cares until the morning standup.
The criticality matrix forces engineering decisions about what matters:
Criticality Feature Must Work? Fallback Acceptable? Can Disable?
Critical Ride booking Yes No No
Critical Driver matching Yes Queue for retry No
Important Fare calculation No Estimated fare No
Important Surge pricing No Cached / base fare Yes
Important Payment processing No Charge after trip No
Deferrable Trip history No Show later Yes
Deferrable Driver ETA No "On the way" Yes
Expendable Analytics No Drop silently Yes
Expendable Promotions No Full price Yes
Critical features need redundancy, circuit breakers, and fast failover. Important features need fallback chains. Deferrable features need graceful hiding. Expendable features need kill switches.
The Baseline
The fare calculation path without a fallback chain:
// BOTTLENECK: Single path, no fallback
@Service
public class FareService {
private final PricingRuleRepository pricingRules; // PostgreSQL
private final SurgePricingClient surgeClient; // External service
public Mono<FareEstimate> calculateFare(RideRequest request) {
return pricingRules.findByZone(request.getPickupZoneId()) // PG query
.switchIfEmpty(Mono.error(
new FareException("No pricing rules for zone")))
.flatMap(rules ->
surgeClient.getMultiplier(request.getZoneId())
.map(multiplier -> computeFare(request, rules, multiplier)));
}
}
If PostgreSQL is slow, the fare calculation is slow. If PostgreSQL is down, the fare calculation fails. If the surge pricing service is down, the fare calculation fails. Two single points of failure in one method.
The Fix
Fallback Chain for Fare Calculation
// SCALED: Four-level fallback chain
@Service
public class FareService {
private final PricingRuleRepository pricingRules;
private final ReactiveRedisTemplate<String, String> redis;
private final SurgePricingClient surgeClient;
private final ObjectMapper objectMapper;
private static final String FARE_CACHE_PREFIX = "fare:rules:";
private static final String ZONE_BASE_RATES = "fare:base_rates";
public Mono<FareEstimate> calculateFare(RideRequest request,
List<String> degraded) {
// Level 1: Exact fare from PostgreSQL + live surge
return calculateExactFare(request)
.onErrorResume(ex -> {
degraded.add("exact_fare");
// Level 2: Cached rules from Redis + live surge
return calculateFromCachedRules(request);
})
.onErrorResume(ex -> {
degraded.add("cached_fare");
// Level 3: Base fare from Redis (no surge)
return calculateBaseFare(request);
})
.onErrorResume(ex -> {
degraded.add("base_fare");
// Level 4: Fixed fare with post-trip calculation
return Mono.just(FareEstimate.deferred(request,
"Fare will be calculated after your trip"));
});
}
private Mono<FareEstimate> calculateExactFare(RideRequest request) {
return pricingRules.findByZone(request.getPickupZoneId())
.flatMap(rules -> {
cachePricingRules(request.getPickupZoneId(), rules);
return surgeClient.getMultiplier(request.getZoneId())
.map(m -> computeFare(request, rules, m));
});
}
private Mono<FareEstimate> calculateFromCachedRules(RideRequest request) {
return redis.opsForValue()
.get(FARE_CACHE_PREFIX + request.getPickupZoneId())
.flatMap(json -> {
PricingRules rules = deserialize(json);
return surgeClient.getMultiplier(request.getZoneId())
.map(m -> computeFare(request, rules, m))
.onErrorResume(ex ->
Mono.just(computeFare(request, rules, BigDecimal.ONE)));
});
}
private Mono<FareEstimate> calculateBaseFare(RideRequest request) {
return redis.opsForHash()
.get(ZONE_BASE_RATES, request.getPickupZoneId())
.map(rate -> FareEstimate.estimated(request,
new BigDecimal(rate.toString()), BigDecimal.ONE));
}
private void cachePricingRules(String zoneId, PricingRules rules) {
redis.opsForValue()
.set(FARE_CACHE_PREFIX + zoneId,
serialize(rules), Duration.ofHours(1))
.subscribe();
}
private PricingRules deserialize(String json) {
try {
return objectMapper.readValue(json, PricingRules.class);
} catch (JsonProcessingException e) {
throw new RuntimeException(e);
}
}
private String serialize(PricingRules rules) {
try {
return objectMapper.writeValueAsString(rules);
} catch (JsonProcessingException e) {
throw new RuntimeException(e);
}
}
}
The fallback chain:
Level 1: Exact fare (PG rules + live surge)
↓ PG fails or surge fails
Level 2: Cached fare (Redis rules + live surge, or Redis rules + no surge)
↓ Redis cache miss
Level 3: Base fare (Redis base rate for zone, no surge)
↓ Redis fails entirely
Level 4: Deferred fare ("Fare calculated after trip")
Each level produces a FareEstimate with a source field indicating how the fare was calculated. The frontend adjusts the display:
// SCALED: FareEstimate with degradation tracking
public record FareEstimate(
BigDecimal amount,
BigDecimal surgeMultiplier,
String currency,
FareSource source,
String message
) {
public enum FareSource {
EXACT, // PG + live surge
CACHED, // Redis rules + surge
ESTIMATED, // Redis base rate
DEFERRED // Calculate after trip
}
public static FareEstimate deferred(RideRequest request, String message) {
return new FareEstimate(null, null, "USD", FareSource.DEFERRED, message);
}
public static FareEstimate estimated(RideRequest request,
BigDecimal baseRate,
BigDecimal surge) {
BigDecimal distance = calculateDistance(request);
return new FareEstimate(
baseRate.multiply(distance).multiply(surge),
surge, "USD", FareSource.ESTIMATED,
"Estimated fare based on zone base rate");
}
}
Redis Feature Flags with Kill Switches
// SCALED: Feature flag service with health tracking
@Service
public class FeatureFlagService {
private final ReactiveRedisTemplate<String, String> redis;
private final MeterRegistry meterRegistry;
private static final String FLAGS_KEY = "feature_flags";
private static final Map<String, Boolean> DEFAULTS = Map.of(
"surge_pricing_enabled", true,
"trip_history_enabled", true,
"analytics_enabled", true,
"promotions_enabled", true,
"driver_eta_enabled", true,
"exact_fare_enabled", true
);
public Mono<Boolean> isEnabled(String feature) {
return redis.opsForHash()
.get(FLAGS_KEY, feature)
.map(val -> "true".equals(val))
.defaultIfEmpty(DEFAULTS.getOrDefault(feature, true))
.onErrorReturn(DEFAULTS.getOrDefault(feature, true))
.doOnNext(enabled -> meterRegistry.gauge(
"feature.flag.status",
Tags.of("feature", feature),
enabled ? 1.0 : 0.0));
}
public Mono<Void> disable(String feature) {
return redis.opsForHash()
.put(FLAGS_KEY, feature, "false")
.doOnSuccess(v -> meterRegistry.counter(
"feature.flag.changed",
Tags.of("feature", feature, "action", "disable"))
.increment())
.then();
}
public Mono<Void> enable(String feature) {
return redis.opsForHash()
.put(FLAGS_KEY, feature, "true")
.doOnSuccess(v -> meterRegistry.counter(
"feature.flag.changed",
Tags.of("feature", feature, "action", "enable"))
.increment())
.then();
}
}
WebFilter Kill Switch
// SCALED: Kill switch filter for deferrable/expendable features
@Component
@Order(1)
public class KillSwitchFilter implements WebFilter {
private final FeatureFlagService featureFlags;
private final MeterRegistry meterRegistry;
private static final Map<String, KillSwitchConfig> KILL_SWITCHES = Map.of(
"/api/trips/history", new KillSwitchConfig(
"trip_history_enabled", "deferrable",
"{\"message\":\"Trip history is temporarily unavailable\"}"),
"/api/analytics", new KillSwitchConfig(
"analytics_enabled", "expendable", ""),
"/api/promotions", new KillSwitchConfig(
"promotions_enabled", "expendable",
"{\"promotions\":[]}"),
"/api/drivers/eta", new KillSwitchConfig(
"driver_eta_enabled", "deferrable",
"{\"eta\":null,\"message\":\"ETA temporarily unavailable\"}")
);
@Override
public Mono<Void> filter(ServerWebExchange exchange, WebFilterChain chain) {
String path = exchange.getRequest().getPath().value();
return KILL_SWITCHES.entrySet().stream()
.filter(e -> path.startsWith(e.getKey()))
.findFirst()
.map(entry -> featureFlags.isEnabled(entry.getValue().flag())
.flatMap(enabled -> {
if (!enabled) {
meterRegistry.counter("killswitch.activated",
Tags.of("feature", entry.getValue().flag(),
"criticality", entry.getValue().criticality()))
.increment();
ServerHttpResponse response = exchange.getResponse();
response.setStatusCode(HttpStatus.SERVICE_UNAVAILABLE);
response.getHeaders().setContentType(MediaType.APPLICATION_JSON);
response.getHeaders().add("X-Degraded", entry.getValue().flag());
String body = entry.getValue().responseBody();
if (body.isEmpty()) {
return response.setComplete();
}
DataBuffer buffer = response.bufferFactory()
.wrap(body.getBytes(StandardCharsets.UTF_8));
return response.writeWith(Mono.just(buffer));
}
return chain.filter(exchange);
}))
.orElse(chain.filter(exchange));
}
record KillSwitchConfig(String flag, String criticality, String responseBody) {}
}
Response Contract with Degraded Field
// SCALED: API response with degradation transparency
public record RideBookingResponse(
String rideId,
String driverId,
FareEstimate fare,
List<String> degradedFeatures,
Map<String, String> degradedMessages
) {
public static RideBookingResponse from(Trip trip, List<String> degraded) {
Map<String, String> messages = new LinkedHashMap<>();
for (String feature : degraded) {
messages.put(feature, DEGRADED_MESSAGES.getOrDefault(feature,
"This feature is temporarily in degraded mode"));
}
return new RideBookingResponse(
trip.getRideId(),
trip.getDriverId(),
trip.getFare(),
degraded,
messages
);
}
private static final Map<String, String> DEGRADED_MESSAGES = Map.of(
"exact_fare", "Showing estimated fare. Exact fare calculated after trip.",
"surge_pricing", "Surge pricing unavailable. Booking at standard rate.",
"trip_persistence", "Trip saved temporarily. Full receipt available soon.",
"driver_eta", "Driver ETA temporarily unavailable."
);
}
Grafana Degraded Mode Dashboard
# SCALED: Grafana dashboard for degraded mode monitoring
# Panels:
# Panel 1: Feature Flag Status (Stat panel, red/green)
# Query: feature_flag_status{feature=~".*"}
# Threshold: 0 = red (disabled), 1 = green (enabled)
# Panel 2: Kill Switch Activations (Time series)
# Query: rate(killswitch_activated_total[5m])
# Group by: feature
# Panel 3: Fallback Chain Usage (Pie chart)
# Query: sum by (source) (fare_estimate_total)
# Shows distribution: exact vs cached vs estimated vs deferred
# Panel 4: Degraded Response Rate (Time series)
# Query: sum(rate(http_server_requests_seconds_count{degraded="true"}[5m]))
# / sum(rate(http_server_requests_seconds_count[5m])) * 100
# Alert if > 20% of responses are degraded for > 5 minutes
# Alert: More than 20% degraded responses for 5 minutes
- alert: HighDegradationRate
expr: |
sum(rate(http_server_requests_seconds_count{degraded="true"}[5m]))
/ sum(rate(http_server_requests_seconds_count[5m])) * 100
> 20
for: 5m
labels:
severity: warning
annotations:
summary: "{{ $value | humanize }}% of responses are degraded"
description: "Check which features are in degraded mode and investigate root cause"
The Proof
Scenario: surge pricing killed, PostgreSQL at 50% capacity, analytics service down.
Without Degraded With Degraded Mode
Feature Design Design
Surge pricing 500 errors Base fare (1.0x)
Fare calculation Slow (PG at 50%) Cached rules from Redis
Trip persistence Works (PG at 50%) Works (PG at 50%)
Trip history Slow Kill-switched (503)
Analytics 500 errors Kill-switched (silent)
Promotions Works Kill-switched (full price)
Booking error rate 34% 0.3%
Booking throughput 3,200 RPS 4,600 RPS (92%)
p50 booking 1,400ms 155ms
92% throughput with three services degraded or down. The 8% reduction comes from Redis-based fare lookups being slightly slower than the hot PostgreSQL cache under normal conditions.
The 0.3% error rate comes from edge cases: new zones with no cached pricing rules, new riders with no historical data. Those hit Level 4 of the fallback chain (deferred fare), which succeeds but produces a response the frontend has to handle differently.
Kill switches for trip history, analytics, and promotions freed up 15% of the rider API’s capacity. Those features were consuming thread pool slots, Redis connections, and PostgreSQL queries that the critical booking path now uses instead.