Bulkhead Sizing and Dynamic Adjustment
Bulkhead Sizing and Dynamic Adjustment
A bulkhead that is too small rejects legitimate requests during normal operation. A bulkhead that is too large provides no protection during degradation. Correct sizing requires knowing your traffic rate and your dependency’s latency profile.
The Sizing Formula
max_concurrent_calls = request_rate * p99_latency * safety_factor
This is Little’s Law applied to the bulkhead. Under normal conditions, the number of concurrent calls to a dependency equals the request rate multiplied by the per-call latency.
For fraud detection:
- Request rate: 100 requests/second
- Normal p99 latency: 120ms (0.12 seconds)
- Safety factor: 1.5 (buffer for traffic spikes)
- Bulkhead size: 100 _ 0.12 _ 1.5 = 18, rounded to 20
Under normal conditions, approximately 12 of the 20 permits are in use at any time. There is room for traffic spikes up to 167 requests/second before the bulkhead fills.
When fraud detection degrades to 5-second response times:
- Concurrent calls needed: 100 * 5.0 = 500
- Bulkhead size: 20
- 480 calls per second are rejected by the bulkhead
That rejection is the point. The bulkhead caps the damage at 20 threads regardless of how slow the dependency becomes. Without the bulkhead, all 200 Tomcat threads could be consumed by fraud detection calls.
Impact of Incorrect Sizing
Too small (5 permits for 100 rps with 120ms p99): Normal concurrent calls needed: 12. Bulkhead capacity: 5. Even under normal conditions, 7 calls per second are rejected. That is a 7% error rate during normal operation. Users see intermittent failures for no reason.
Too large (100 permits for 100 rps with 120ms p99): When fraud detection degrades to 5-second response times, 100 threads are consumed instead of 20. The bulkhead has 80 more permits than needed for protection, which means 80 more threads can be blocked by slow calls. The bulkhead still provides some protection (100 threads instead of 200), but it provides less protection than a correctly sized one.
The safety factor accounts for natural traffic variation. A factor of 1.5 means the bulkhead tolerates a 50% traffic spike without rejecting requests. A factor of 2.0 tolerates a 100% spike but provides weaker isolation. For the transaction platform, 1.5 is appropriate because payment traffic follows predictable patterns and a 50% spike is an extreme scenario.
Per-Dependency Sizing
| Dependency | Rate (rps) | p99 Latency | Safety | Bulkhead Size |
|---|---|---|---|---|
| Fraud Detection | 100 | 120ms | 1.5x | 20 |
| Balance Service | 100 | 80ms | 1.5x | 12 -> 15 |
| Payment Gateway | 100 | 800ms | 1.5x | 120 |
| Notification | 100 | 500ms | 1.0x | 50 -> 10* |
| Audit Log | 100 | 50ms | 1.5x | 8 -> 10 |
*Notification is intentionally undersized relative to the formula. It is non-critical path, and we want to limit its resource consumption even during normal operation. If notifications slow down, we would rather reject early and queue for later delivery than consume 50 threads on email dispatch.
The payment gateway has the largest bulkhead (120) because it has the highest normal latency. This is unavoidable: if each call legitimately takes 800ms, you need 80 concurrent threads under normal conditions to maintain 100 rps throughput. The safety factor adds another 40. If the payment gateway degrades, the bulkhead still limits consumption to 120 threads, leaving 80 for fraud detection, balance checks, and other operations.