Advanced Chaos Experiments and Failure Injection Patterns
Advanced Chaos Experiments and Failure Injection Patterns
Latency injection is the safest starting point. After validating that latency-based resilience works, the next experiments target failure modes that latency injection cannot simulate: network partitions, resource exhaustion, and cascading multi-dependency failures.
Network Partition Simulation
A network partition differs from a service crash. In a crash, connections are refused immediately (TCP RST). In a partition, packets are silently dropped. The client sends a request and waits for a response that never arrives. The TCP stack retransmits at the kernel level (invisible to the application), and eventually the connection times out. The time between sending the request and the timeout exception can be 30 seconds or more, depending on TCP retransmission settings.
// PRODUCTION - Network partition simulation with Testcontainers
@Test
void networkPartition_circuitBreakerOpensBeforeTcpTimeout() {
// Simulate network partition by pausing the fraud container
// This causes the container to stop processing packets
// but keeps the TCP connection alive
fraudService.getDockerClient()
.pauseContainerCmd(fraudService.getContainerId())
.exec();
try {
long start = System.nanoTime();
// Send requests until circuit breaker opens
int requestCount = 0;
while (requestCount < 50) {
restTemplate.postForEntity("/payments",
samplePayment(), PaymentResponse.class);
requestCount++;
}
long elapsed = Duration.ofNanos(
System.nanoTime() - start).toMillis();
CircuitBreaker cb = cbRegistry.circuitBreaker("fraudDetection");
// The circuit breaker should open based on timeouts
// (not TCP timeouts, but application-level TimeLimiter timeouts)
assertThat(cb.getState())
.isEqualTo(CircuitBreaker.State.OPEN);
// Total time should be bounded by TimeLimiter, not TCP timeout
// With 20 minimum calls and 2s TimeLimiter: ~40s worst case
assertThat(elapsed).isLessThan(60_000);
} finally {
// Always unpause to allow clean container shutdown
fraudService.getDockerClient()
.unpauseContainerCmd(fraudService.getContainerId())
.exec();
}
}
This experiment validates that the TimeLimiter catches network partitions before the TCP stack’s retransmission timeout fires. Without a TimeLimiter, the HTTP client waits for the TCP timeout (often 30+ seconds), consuming a thread for the entire duration.
Resource Exhaustion Experiments
Connection Pool Exhaustion
# PRODUCTION - Chaos Monkey assault: kill connections
chaos:
monkey:
assaults:
memory-active: false
kill-application-active: false
latency-active: false
exceptions-active: true
exception:
type: java.net.SocketException
arguments:
- className: java.lang.String
value: "Connection reset"
This assault throws SocketException("Connection reset") on every Nth service call, simulating a load balancer or firewall resetting connections. The experiment validates that:
- The HTTP client retries on connection reset (transient error)
- The circuit breaker counts connection resets as failures
- The connection pool recovers (broken connections are evicted, new ones are created)
Memory Pressure
# PRODUCTION - Memory pressure experiment
chaos:
monkey:
assaults:
memory-active: true
memory-fill-target-fraction: 0.8 # Fill 80% of heap
memory-milliseconds-hold-filled-memory: 30000 # Hold for 30s
memory-caching-enabled: false
Memory pressure causes GC pauses, which cause request latency spikes. This experiment validates that the TimeLimiter catches GC-induced delays and the circuit breaker treats them as slow calls.
Multi-Dependency Failure
The payment service depends on fraud detection, balance service, payment gateway, and notification service. A single dependency failure is the common case. Multiple simultaneous failures are rare but devastating.
// PRODUCTION - Experiment: two dependencies fail simultaneously
@Test
void twoDependenciesDown_paymentStillProcessed() {
// Fraud detection: returning 503
fraudWireMock().register(
WireMock.post("/fraud/score")
.willReturn(WireMock.serviceUnavailable()));
// Balance service: not responding (paused container)
balanceService.getDockerClient()
.pauseContainerCmd(balanceService.getContainerId())
.exec();
try {
// Wait for circuit breakers to open
for (int i = 0; i < 25; i++) {
restTemplate.postForEntity("/payments",
samplePayment(), PaymentResponse.class);
}
// Both circuit breakers should be open
assertThat(cbRegistry.circuitBreaker("fraudDetection").getState())
.isEqualTo(CircuitBreaker.State.OPEN);
assertThat(cbRegistry.circuitBreaker("balanceCheck").getState())
.isEqualTo(CircuitBreaker.State.OPEN);
// Payment should still process (with fallbacks for both)
ResponseEntity<PaymentResponse> response =
restTemplate.postForEntity("/payments",
samplePayment(), PaymentResponse.class);
// Payment processed with degraded fraud check AND cached balance
assertThat(response.getStatusCode()).isEqualTo(HttpStatus.OK);
} finally {
balanceService.getDockerClient()
.unpauseContainerCmd(balanceService.getContainerId())
.exec();
}
}
Experiment Prioritization
Not all chaos experiments provide equal value. Prioritize by:
- Most likely failure mode. Latency spikes are more common than network partitions. Test latency first.
- Highest impact failure. A payment gateway failure stops revenue. A notification failure delays emails. Test the payment gateway earlier.
- Least understood failure. If the team has never seen a connection pool exhaustion, that is where the unknown risks are.
- Recently changed code. New dependencies, new resilience configurations, or refactored retry logic should be validated with chaos experiments before reaching production.
The first five experiments for the transaction platform:
- Fraud detection latency injection (most frequent failure)
- Payment gateway connection reset (highest business impact)
- Balance service network partition (validates TimeLimiter)
- Dual-dependency failure: fraud + balance (compound failure)
- Memory pressure during peak load (GC-induced latency)
Each experiment generates findings that improve the resilience configuration. After five rounds of experiment-fix-retest, the transaction platform’s resilience behavior is well-characterized for the most likely production failure scenarios.