Skip to main content
surviving the spike

Multi-Region: The Complexity Tax and the Conditions That Justify It

7 min read Chapter 61 of 66

Multi-Region: The Complexity Tax and the Conditions That Justify It

The Symptom

The ride-hailing platform serves 14 million requests per day from US-East. Performance is solid. p99 is 180ms for riders in New York, 210ms for riders in Los Angeles, 240ms for riders in Chicago. Then the company launches in Berlin, London, and Paris. European riders hit US-East. p99 for a fare estimate in Berlin: 420ms. For a ride booking: 680ms. For driver location updates: 310ms, which means a rider sees their driver’s position with a visible lag.

The product team files a bug: “European ride experience feels sluggish.” The infrastructure team measures the round-trip time from Frankfurt to us-east-1: 85ms on a good day, 120ms when transatlantic cables are congested. Every API call pays this penalty. Calls that chain (fare estimate, then surge check, then booking) pay it three times.

The CTO says: “We need multi-region.” The principal engineer says: “Tell me what problem you are solving, how much it will cost, and whether you have tried the alternatives.”

The Cause

Physics. Light through fiber from Frankfurt to Virginia takes 37ms one way, 74ms round trip, minimum. Add TLS handshake, DNS resolution, and the application’s own processing chain, and European riders face 150-250ms of pure network overhead per API call. No amount of code optimization eliminates transatlantic latency.

But multi-region is not a feature toggle. It is a fundamental change to the system’s data model, deployment pipeline, incident response, and operational cost. The complexity tax includes:

CategorySingle-RegionMulti-Region
Deployments1 target2+ targets, coordinated
Database migrations1 executionCoordinated with replication lag
Incident response1 region to debugReplication lag? Split brain? Regional?
Data consistencyStrong (single PG)Eventual (async replication)
Monitoring1 set of dashboardsPer-region + cross-region
On-call1 region’s alertsPer-region alerts + replication alerts
Cost$X$2.4X (not 2X, overhead is real)

Three conditions must ALL be true to justify multi-region:

  1. Regulatory: Data residency laws require user data to stay in-region (GDPR, LGPD, PIPL)
  2. Latency: Single-region latency exceeds the SLO for more than 20% of users
  3. Funding: The business can sustain 2-3x infrastructure and operational cost

The ride-hailing platform meets all three. GDPR requires EU rider data to reside in the EU. 28% of riders are now European, all exceeding the 300ms p99 SLO. Revenue from EU operations justifies the cost.

The Baseline

All traffic routed to US-East. European riders pay transatlantic latency on every request.

// BOTTLENECK: All API requests route to us-east-1 regardless of caller location
@RestController
@RequestMapping("/api/rides")
public class RideController {

    @PostMapping("/book")
    public Mono<RideResponse> bookRide(@RequestBody RideRequest request) {
        // European rider hits us-east-1
        // Network latency: 85-120ms one way
        // Fare estimate call: +85-120ms (calls surge pricing)
        // Total: 250-400ms before application logic starts
        return fareService.estimate(request)
            .flatMap(fare -> surgeService.getMultiplier(request.getZoneId())
                .map(multiplier -> fare.apply(multiplier)))
            .flatMap(finalFare -> tripRepository.save(
                new Trip(request, finalFare)));
    }
}

Locust baseline from an EU load generator:

# BOTTLENECK: Locust test from EU load generator hitting US-East
from locust import HttpUser, task, between

class EURiderUser(HttpUser):
    wait_time = between(0.5, 1.0)
    host = "https://api.us-east.ridehailing.com"

    @task(5)
    def book_ride(self):
        self.client.post("/api/rides/book", json={
            "riderId": "eu-rider-1",
            "pickupLat": 52.5200, "pickupLng": 13.4050,
            "dropoffLat": 52.5300, "dropoffLng": 13.4000,
            "zoneId": "berlin-mitte"
        })

    @task(3)
    def fare_estimate(self):
        self.client.get(
            "/api/fares/estimate?zoneId=berlin-mitte")

    @task(2)
    def driver_location(self):
        self.client.get(
            "/api/drivers/nearby?lat=52.52&lng=13.405&radius=2000")
EU → US-East Results (500 EU users, 5 min):
  /api/rides/book      p50=420ms  p99=680ms  errors=0.02%
  /api/fares/estimate  p50=310ms  p99=520ms  errors=0.01%
  /api/drivers/nearby  p50=280ms  p99=460ms  errors=0.01%

SLO: p99 < 300ms for fare estimate
Status: VIOLATED for 28% of users (all EU)

The Fix

Architecture: US-East Primary, EU-West Secondary

Multi-region architecture with DNS geo-routing, US-East primary and EU-West secondary regions, async PostgreSQL replication, and cross-region Kafka event bus

The multi-region architecture routes riders to the nearest region via DNS geo-routing (Route 53). Each region runs the full application stack—Rider API, Driver API, Surge Pricing, and Fare Service—with its own PostgreSQL instance and regional Redis cache. The US-East PostgreSQL serves as the primary with read/write access, replicating asynchronously to the EU-West read replica. A cross-region Kafka event bus at the bottom synchronizes domain events between regions, ensuring eventual consistency for operations that span both deployments.

DNS Geo-Routing

// SCALED: Region-aware API gateway with DNS geo-routing
@Configuration
public class RegionConfig {

    @Value("${app.region}")
    private String currentRegion;  // "us-east" or "eu-west"

    @Bean
    public RegionRouter regionRouter() {
        return new RegionRouter(currentRegion);
    }
}

@Component
public class RegionRouter {
    private final String region;

    public RegionRouter(String region) {
        this.region = region;
    }

    public boolean isLocalRegion(String zoneId) {
        // Berlin, London, Paris zones route to eu-west
        // NYC, LA, Chicago zones route to us-east
        return ZoneRegistry.regionFor(zoneId).equals(region);
    }

    public String getLocalDatabaseUrl() {
        return switch (region) {
            case "us-east" -> "jdbc:postgresql://pg-us-east:5432/rides";
            case "eu-west" -> "jdbc:postgresql://pg-eu-west:5432/rides";
            default -> throw new IllegalStateException(
                "Unknown region: " + region);
        };
    }
}

Region-Aware Ride Booking

// SCALED: Ride booking writes to regional database
@RestController
@RequestMapping("/api/rides")
public class RideController {

    private final RegionRouter regionRouter;
    private final FareService fareService;
    private final TripRepository tripRepository;

    @PostMapping("/book")
    public Mono<RideResponse> bookRide(@RequestBody RideRequest request) {
        // EU rider hits eu-west, no transatlantic hop
        // Fare estimate: local call, 15ms
        // Surge check: local call, 8ms
        // DB write: local PG, 5ms
        return fareService.estimate(request)
            .flatMap(fare -> surgeService
                .getMultiplier(request.getZoneId())
                .map(multiplier -> fare.apply(multiplier)))
            .flatMap(finalFare -> {
                Trip trip = new Trip(request, finalFare);
                trip.setRegion(regionRouter.getCurrentRegion());
                return tripRepository.save(trip);
            });
    }
}

PostgreSQL Logical Replication

-- SCALED: US-East primary publishes changes
-- On US-East (primary)
CREATE PUBLICATION ride_platform_pub
    FOR TABLE user_profiles, fare_config, surge_zones
    WITH (publish = 'insert, update, delete');

-- Trip data is NOT replicated globally.
-- A trip in Berlin stays in EU-West.
-- A trip in NYC stays in US-East.

-- On EU-West (replica)
CREATE SUBSCRIPTION ride_platform_sub
    CONNECTION 'host=pg-us-east.internal port=5432
                dbname=rides user=replicator
                password=<secure> sslmode=require'
    PUBLICATION ride_platform_pub
    WITH (copy_data = true, streaming = true);

Redis: Regional, Not Replicated

// SCALED: Regional Redis configuration
@Configuration
public class RedisRegionalConfig {

    @Value("${app.region}")
    private String region;

    @Bean
    @Primary
    public ReactiveRedisConnectionFactory regionalRedis() {
        // Each region has its own Redis
        // Driver locations are regional (Berlin drivers
        // are irrelevant to NYC riders)
        String host = switch (region) {
            case "us-east" -> "redis-us-east.internal";
            case "eu-west" -> "redis-eu-west.internal";
            default -> throw new IllegalStateException(
                "Unknown region");
        };
        return new LettuceConnectionFactory(host, 6379);
    }
}

Cross-Region Cache Invalidation via Kafka

// SCALED: When a user updates their profile in US-East,
// EU-West invalidates its cached copy
@KafkaListener(
    topics = "profile-updates",
    groupId = "${app.region}-profile-consumer")
public class ProfileCacheInvalidator {

    private final ReactiveRedisTemplate<String, String> redis;

    @KafkaHandler
    public void onProfileUpdate(ProfileUpdateEvent event) {
        // Invalidate cached profile in this region
        // Next read will fetch from local PG replica
        redis.delete("profile:" + event.getUserId())
            .subscribe();
    }
}

The Proof

Locust test with EU traffic routed to EU-West instead of US-East:

# SCALED: Locust test from EU load generator hitting EU-West
from locust import HttpUser, task, between

class EURiderLocalUser(HttpUser):
    wait_time = between(0.5, 1.0)
    host = "https://api.eu-west.ridehailing.com"

    @task(5)
    def book_ride(self):
        self.client.post("/api/rides/book", json={
            "riderId": "eu-rider-1",
            "pickupLat": 52.5200, "pickupLng": 13.4050,
            "dropoffLat": 52.5300, "dropoffLng": 13.4000,
            "zoneId": "berlin-mitte"
        })

    @task(3)
    def fare_estimate(self):
        self.client.get(
            "/api/fares/estimate?zoneId=berlin-mitte")

    @task(2)
    def driver_location(self):
        self.client.get(
            "/api/drivers/nearby?lat=52.52&lng=13.405&radius=2000")
EU → EU-West Results (500 EU users, 5 min):
  /api/rides/book      p50=45ms   p99=120ms  errors=0.01%
  /api/fares/estimate  p50=28ms   p99=85ms   errors=0.01%
  /api/drivers/nearby  p50=22ms   p99=70ms   errors=0.01%

Before (EU → US-East):  p99 fare estimate = 520ms
After  (EU → EU-West):  p99 fare estimate =  85ms
Improvement: 83.6% latency reduction for EU riders

SLO: p99 < 300ms for fare estimate
Status: MET for all regions

The latency improvement is real. The cost is also real. Sections CH21-S1 and CH21-S2 cover the data replication details and the decision framework for whether your system justifies paying this complexity tax.