SLOs, Error Budgets, and Resilience Decision Automation
SLOs, Error Budgets, and Resilience Decision Automation
The circuit breaker opens when the failure rate crosses a threshold. The alert fires when the circuit breaker opens. The on-call engineer investigates. This is reactive: the system has already degraded, and a human is responding.
SLOs (Service Level Objectives) add a predictive layer. Instead of reacting to the circuit breaker opening, the SLO tracks how much failure budget remains. When the budget is running low, automated actions can prevent the circuit breaker from opening at all.
Defining SLOs for the Payment Service
# PRODUCTION - SLO definitions
slos:
payment-service:
# Availability SLO: 99.9% of payment requests succeed
availability:
target: 0.999
window: 30d # Rolling 30-day window
good_events: |
http_server_requests_seconds_count{
uri="/payments",
status=~"2.."
}
total_events: |
http_server_requests_seconds_count{
uri="/payments"
}
# Latency SLO: 99% of payments complete within 500ms
latency:
target: 0.99
window: 30d
good_events: |
http_server_requests_seconds_bucket{
uri="/payments",
le="0.5"
}
total_events: |
http_server_requests_seconds_count{
uri="/payments"
}
# Fraud check SLO: 95% of payments get real-time fraud scores
# (not fallback)
fraud_check_quality:
target: 0.95
window: 30d
good_events: |
fraud_check_total{result="real_time"}
total_events: |
fraud_check_total
The availability SLO (99.9%) allows 43 minutes of downtime per 30-day window (0.1% of 30 days). The fraud check quality SLO (95%) is deliberately lower: fallback fraud scores are acceptable 5% of the time. This SLO acknowledges that some degradation is expected and budgets for it.
Error Budget Calculation
# Remaining error budget (as a ratio of the total budget)
# 1.0 = full budget remaining, 0.0 = budget exhausted
# Availability error budget
1 - (
(1 - (
sum(rate(http_server_requests_seconds_count{uri="/payments",status=~"2.."}[30d]))
/
sum(rate(http_server_requests_seconds_count{uri="/payments"}[30d]))
))
/
(1 - 0.999)
)
When the error budget is at 0.5 (50% remaining), the service has consumed half of its allowed failure budget for the 30-day window. When it reaches 0, the SLO has been violated.
Automated Decisions Based on Budget
// PRODUCTION - Error budget-aware resilience tuning
@Component
public class ErrorBudgetController {
private final PrometheusMeterRegistry registry;
private final CircuitBreakerRegistry cbRegistry;
@Scheduled(fixedRate = 60_000) // Check every minute
public void adjustResilienceBasedOnBudget() {
double budget = calculateRemainingBudget();
if (budget < 0.2) {
// Less than 20% budget remaining: tighten resilience
tightenResilience();
} else if (budget > 0.8) {
// More than 80% budget remaining: relax resilience
// (allow more risk for better quality)
relaxResilience();
}
}
private void tightenResilience() {
// Lower the circuit breaker threshold:
// open the breaker earlier to preserve error budget
cbRegistry.getAllCircuitBreakers().forEach(cb -> {
// Reduce failure rate threshold from 50% to 30%
// This causes the breaker to open sooner,
// serving fallbacks instead of errors
});
log.warn("Error budget low. Tightening resilience parameters.");
}
private void relaxResilience() {
// If budget is ample, allow more real calls through
// (better fraud detection quality at the cost of error budget)
}
}
This is an advanced pattern. The tradeoff is explicit: when the error budget is low, the system becomes more conservative (more fallbacks, fewer real fraud checks) to avoid violating the SLO. When the budget is ample, the system becomes more aggressive (fewer fallbacks, more real fraud checks) to provide better service quality.
SLO Dashboard
The SLO dashboard shows three rows:
Row 1: SLO status. One gauge per SLO showing the current achievement vs. target. Green if meeting the SLO. Amber if within 10% of violation. Red if violated.
Row 2: Error budget remaining. One time series per SLO showing the error budget burn rate. A steep downward slope indicates rapid budget consumption. A flat line indicates stable operation.
Row 3: Budget burn alerts. The alert fires when the burn rate predicts SLO violation before the window ends:
# PRODUCTION - Burn rate alert
- alert: ErrorBudgetBurnRateHigh
expr: >
(
1 - (
sum(rate(http_server_requests_seconds_count{
uri="/payments", status=~"2.."}[1h]))
/
sum(rate(http_server_requests_seconds_count{
uri="/payments"}[1h]))
)
)
/
(1 - 0.999)
> 14.4
for: 5m
labels:
severity: critical
annotations:
summary: "Payment service burning error budget 14.4x faster than allowed"
description: >
At the current error rate, the 30-day error budget will be
exhausted within 2 hours. Immediate investigation required.
The burn rate of 14.4 means the error budget is being consumed 14.4 times faster than the sustainable rate. At this burn rate, the entire 30-day budget is exhausted in ~50 hours. The alert fires after 5 minutes of sustained high burn rate, giving the operations team time to investigate before the budget is significantly depleted.
The combination of resilience patterns (circuit breakers, fallbacks, retries) and SLO-based observability creates a closed loop: patterns contain failures, metrics measure the impact, SLOs quantify the acceptable impact, and alerts fire when the impact approaches the boundary. No human judgment is needed to determine whether the current failure rate is “bad”: the SLO defines “bad” and the error budget tracks how close the system is to that boundary.