Read Replicas and Connection Routing
Read Replicas and Connection Routing
The Symptom
The trip history endpoint serves 4,200 RPS during evening surge. Every request hits the primary PostgreSQL instance. Read queries compete with write transactions for CPU, buffer pool pages, and disk I/O. The primary’s CPU sits at 91% and connection pool acquisition time spikes to 1,800ms at p99. Write latency for trip completion (the revenue-critical path) degrades from 12ms to 340ms because reads are starving writes.
The Cause
The ride-hailing platform’s read/write ratio is 19:1. For every trip creation, there are 19 trip history lookups, driver earnings checks, and dashboard polls. All 20 queries hit the same PostgreSQL instance. The primary does double duty: accepting WAL writes for trip events and serving full table scans for history queries. Read replicas exist for this exact pattern. Route the 95% of read traffic elsewhere. Let the primary do what only the primary can do: write.
The Baseline
# load-tests/replica_comparison_locustfile.py
from locust import HttpUser, task, between, LoadTestShape
class TripUser(HttpUser):
wait_time = between(0.3, 0.8)
@task(19)
def get_trip_history(self):
rider_id = f"rider-{self.environment.runner.user_count % 50000}"
self.client.get(
f"/api/trips/history?riderId={rider_id}&page=0&size=20",
name="/api/trips/history [READ]"
)
@task(1)
def complete_trip(self):
self.client.post(
"/api/trips/complete",
json={
"tripId": f"trip-{self.environment.runner.user_count}-{id(self)}",
"fare": 18.75,
"distance": 6.2
},
name="/api/trips/complete [WRITE]"
)
class ReplicaTestShape(LoadTestShape):
stages = [
{"duration": 60, "users": 200, "spawn_rate": 20},
{"duration": 120, "users": 400, "spawn_rate": 20},
{"duration": 180, "users": 600, "spawn_rate": 20},
{"duration": 240, "users": 800, "spawn_rate": 20},
{"duration": 300, "users": 1000, "spawn_rate": 20},
]
def tick(self):
run_time = self.get_run_time()
for stage in self.stages:
if run_time < stage["duration"]:
return (stage["users"], stage["spawn_rate"])
return None
Single primary, no replicas, 50M rows in the trips table:
| Users | Read RPS | Write RPS | p99 Read (ms) | p99 Write (ms) | PG CPU |
|---|---|---|---|---|---|
| 200 | 350 | 19 | 180 | 22 | 41% |
| 400 | 620 | 33 | 540 | 85 | 68% |
| 600 | 780 | 41 | 1,400 | 210 | 84% |
| 800 | 810 | 43 | 2,600 | 340 | 91% |
| 1000 | 790 | 38 | 5,200 | 890 | 95% |
Write p99 tripled from 22ms to 890ms. Not because writes got slower. Because reads consumed all available CPU and I/O, forcing writes to wait.
The Fix
Step 1: DataSource Routing
Spring’s AbstractRoutingDataSource selects a DataSource based on a lookup key. The key is determined per-transaction: readOnly = true routes to a replica, everything else routes to primary.
// SCALED: Enum for routing targets
public enum DataSourceType {
PRIMARY,
REPLICA
}
// SCALED: ThreadLocal holds the routing key for the current transaction
public class DataSourceContext {
private static final ThreadLocal<DataSourceType> CONTEXT =
ThreadLocal.withInitial(() -> DataSourceType.PRIMARY);
public static void set(DataSourceType type) {
CONTEXT.set(type);
}
public static DataSourceType get() {
return CONTEXT.get();
}
public static void clear() {
CONTEXT.remove();
}
}
// SCALED: Routes to primary or replica based on ThreadLocal
public class ReadWriteRoutingDataSource extends AbstractRoutingDataSource {
@Override
protected Object determineCurrentLookupKey() {
return DataSourceContext.get();
}
}
Step 2: AOP Aspect for Automatic Routing
An aspect intercepts @Transactional(readOnly = true) methods and sets the routing context before the transaction begins.
// SCALED: AOP sets routing key before transaction opens connection
@Aspect
@Component
@Order(Ordered.HIGHEST_PRECEDENCE)
public class ReadOnlyRoutingAspect {
@Around("@annotation(transactional)")
public Object route(ProceedingJoinPoint joinPoint,
Transactional transactional) throws Throwable {
if (transactional.readOnly()) {
DataSourceContext.set(DataSourceType.REPLICA);
} else {
DataSourceContext.set(DataSourceType.PRIMARY);
}
try {
return joinPoint.proceed();
} finally {
DataSourceContext.clear();
}
}
}
Step 3: DataSource Configuration
Two HikariCP pools: one for the primary, one for replicas. The replica pool uses round-robin across two replica hosts.
// SCALED: Separate connection pools for primary and replicas
@Configuration
public class DataSourceConfig {
@Bean
public DataSource dataSource(
@Qualifier("primaryDataSource") DataSource primary,
@Qualifier("replicaDataSource") DataSource replica) {
var router = new ReadWriteRoutingDataSource();
var dataSources = Map.<Object, Object>of(
DataSourceType.PRIMARY, primary,
DataSourceType.REPLICA, replica
);
router.setTargetDataSources(dataSources);
router.setDefaultTargetDataSource(primary);
return router;
}
@Bean("primaryDataSource")
@ConfigurationProperties("spring.datasource.primary")
public DataSource primaryDataSource() {
return DataSourceBuilder.create()
.type(HikariDataSource.class)
.build();
}
@Bean("replicaDataSource")
@ConfigurationProperties("spring.datasource.replica")
public DataSource replicaDataSource() {
return DataSourceBuilder.create()
.type(HikariDataSource.class)
.build();
}
}
# application.yml
spring:
datasource:
primary:
jdbc-url: jdbc:postgresql://pg-primary:5432/ridehail
username: app
password: ${DB_PASSWORD}
hikari:
maximum-pool-size: 20
connection-timeout: 5000
replica:
jdbc-url: jdbc:postgresql://pg-replica-1:5432/ridehail,pg-replica-2:5432/ridehail
username: app_readonly
password: ${DB_READONLY_PASSWORD}
hikari:
maximum-pool-size: 40
connection-timeout: 5000
read-only: true
The replica pool is larger (40 vs 20) because it handles 19x the traffic. The read-only: true on HikariCP sets the JDBC connection to read-only mode, which PostgreSQL uses to reject any accidental writes.
Step 4: Replication Lag Monitoring
Streaming replication has lag. Under normal load, lag is sub-millisecond. Under write bursts (surge pricing recalculation, batch trip imports), lag can hit seconds. Monitor it.
-- Run on the primary: check replication lag per replica
SELECT
client_addr,
state,
sent_lsn,
replay_lsn,
pg_wal_lsn_diff(sent_lsn, replay_lsn) AS lag_bytes,
replay_lag
FROM pg_stat_replication;
Expose it as a Prometheus metric:
// SCALED: Custom gauge for replication lag per replica
@Component
public class ReplicationLagMonitor {
private final JdbcTemplate primaryJdbc;
private final MeterRegistry registry;
@Scheduled(fixedRate = 5000)
public void checkReplicationLag() {
primaryJdbc.query(
"""
SELECT client_addr,
EXTRACT(EPOCH FROM replay_lag) AS lag_seconds
FROM pg_stat_replication
""",
rs -> {
while (rs.next()) {
Gauge.builder("pg.replication.lag.seconds")
.tag("replica", rs.getString("client_addr"))
.register(registry)
.set(rs.getDouble("lag_seconds"));
}
}
);
}
}
Alert when lag exceeds 500ms:
# prometheus/alerts.yml
- alert: PostgresReplicationLagHigh
expr: pg_replication_lag_seconds > 0.5
for: 30s
labels:
severity: warning
annotations:
summary: "Replication lag on {{ $labels.replica }} is {{ $value }}s"
Step 5: Read-Your-Writes Consistency
A rider completes a trip and immediately opens trip history. The trip was written to the primary 50ms ago. The replica has not replayed it yet. The rider sees stale data. They tap refresh. Still stale. They contact support.
Fix: route reads to the primary for users who have written in the last N seconds.
// SCALED: Track recent writers, route their reads to primary
@Component
public class RecentWriterTracker {
private final RedisTemplate<String, String> redis;
private static final Duration CONSISTENCY_WINDOW = Duration.ofSeconds(5);
public void markWriter(String userId) {
String key = "recent-writer:" + userId;
redis.opsForValue().set(key, "1", CONSISTENCY_WINDOW);
}
public boolean isRecentWriter(String userId) {
return Boolean.TRUE.equals(redis.hasKey("recent-writer:" + userId));
}
}
// SCALED: Modified routing aspect checks for recent writers
@Aspect
@Component
@Order(Ordered.HIGHEST_PRECEDENCE)
public class ReadOnlyRoutingAspect {
private final RecentWriterTracker recentWriterTracker;
@Around("@annotation(transactional)")
public Object route(ProceedingJoinPoint joinPoint,
Transactional transactional) throws Throwable {
if (transactional.readOnly()) {
String userId = SecurityContextHolder.getContext()
.getAuthentication().getName();
// Recent writers read from primary for consistency
if (recentWriterTracker.isRecentWriter(userId)) {
DataSourceContext.set(DataSourceType.PRIMARY); // BOTTLENECK: bypasses replica
} else {
DataSourceContext.set(DataSourceType.REPLICA); // SCALED: normal read path
}
} else {
DataSourceContext.set(DataSourceType.PRIMARY);
}
try {
return joinPoint.proceed();
} finally {
DataSourceContext.clear();
}
}
}
// SCALED: Trip completion marks the rider as a recent writer
@Service
public class TripCompletionService {
private final RecentWriterTracker recentWriterTracker;
private final TripRepository tripRepository;
@Transactional
public Trip completeTrip(TripCompletionRequest request) {
Trip trip = tripRepository.findById(request.getTripId())
.orElseThrow();
trip.complete(request.getFare(), request.getDistance());
Trip saved = tripRepository.save(trip);
recentWriterTracker.markWriter(trip.getRiderId());
return saved;
}
}
The 5-second consistency window means ~2% of read traffic routes to primary (riders who completed a trip in the last 5 seconds). The other 98% goes to replicas. The primary is not overloaded. The rider sees their completed trip immediately.
The Proof
Same Locust test, now with 1 primary + 2 replicas + read-your-writes routing:
| Users | Read RPS | Write RPS | p99 Read (ms) | p99 Write (ms) | Primary CPU | Replica CPU |
|---|---|---|---|---|---|---|
| 200 | 380 | 20 | 32 | 14 | 8% | 18% |
| 400 | 750 | 40 | 38 | 16 | 11% | 24% |
| 600 | 1,120 | 59 | 42 | 18 | 14% | 31% |
| 800 | 1,480 | 78 | 48 | 19 | 16% | 38% |
| 1000 | 1,840 | 97 | 55 | 21 | 19% | 44% |
| Metric | Single Primary | Primary + 2 Replicas | Improvement |
|---|---|---|---|
| Max read RPS | 810 | 1,840 | 2.3x |
| p99 read at 800 users | 2,600ms | 48ms | 54x |
| p99 write at 800 users | 340ms | 19ms | 18x |
| Primary CPU at 800 users | 91% | 16% | 5.7x reduction |
| Error rate at 1000 users | timeouts | 0% | Eliminated |
Write latency improved 18x without changing any write code. Removing read load from the primary freed CPU and I/O for writes. The replicas each run at 38% CPU at 800 users, meaning each replica has headroom for 2x more read traffic before needing a third replica.
The zero-code-change claim is real for the repository layer. The TripRepository interface has no idea it talks to replicas. The routing is entirely in the DataSource layer and the AOP aspect. Add a new @Transactional(readOnly = true) service method and it automatically routes to replicas.