Database Scaling: Read Replicas, Sharding, and When PostgreSQL Is Not the Problem
Database Scaling: Read Replicas, Sharding, and When PostgreSQL Is Not the Problem
The Symptom
The trip history endpoint is slow. At 500 RPS, p99 latency is 3,400ms. The dashboard shows PostgreSQL CPU at 92%. The on-call engineer opens a ticket: “Need to shard the database.” The team spends three weeks evaluating sharding strategies. The fix turns out to be a composite index that takes 30 seconds to create.
This chapter covers the full spectrum of PostgreSQL scaling: read replicas for read-heavy workloads, sharding for write-heavy workloads, and the discipline to run EXPLAIN ANALYZE before doing either.
The staircase shows the correct order for database scaling decisions. Steps 1 through 3—adding indexes, optimizing queries with EXPLAIN ANALYZE, and tuning connection pools—are free and take hours, not days. Steps 4 and 5—read replicas and sharding—cost real money and introduce operational complexity like replication lag and cross-shard joins. Most teams jump to step 4 or 5 when their actual problem is a missing composite index that takes 30 seconds to create. Run the free steps first. Measure after each one. Most platforms never need sharding.
The Cause
The ride-hailing platform has a common read/write ratio: 95% reads, 5% writes. Every rider checking trip history, every driver viewing earnings, every ops dashboard polling for active rides. All reads hitting a single PostgreSQL primary. The primary handles writes (trip creation, status updates, payments) and reads (trip history, driver earnings, analytics) on the same instance.
The CPU is high not because of write contention. The CPU is high because 10,000 concurrent read queries compete with write transactions for CPU cycles, buffer pool pages, and I/O bandwidth.
The Baseline
# load-tests/trip_history_locustfile.py
from locust import HttpUser, task, between, LoadTestShape
class TripHistoryUser(HttpUser):
wait_time = between(0.5, 1.0)
@task(9)
def get_trip_history(self):
rider_id = f"rider-{self.environment.runner.user_count % 10000}"
self.client.get(
f"/api/trips/history?riderId={rider_id}&page=0&size=20",
name="/api/trips/history"
)
@task(1)
def complete_trip(self):
self.client.post(
"/api/trips/complete",
json={
"tripId": f"trip-{self.environment.runner.user_count}",
"fare": 24.50,
"distance": 8.3
},
name="/api/trips/complete"
)
class RampLoad(LoadTestShape):
stages = [
{"duration": 60, "users": 100, "spawn_rate": 20},
{"duration": 120, "users": 300, "spawn_rate": 20},
{"duration": 180, "users": 500, "spawn_rate": 20},
{"duration": 240, "users": 800, "spawn_rate": 20},
{"duration": 300, "users": 1000, "spawn_rate": 20},
]
def tick(self):
run_time = self.get_run_time()
for stage in self.stages:
if run_time < stage["duration"]:
return (stage["users"], stage["spawn_rate"])
return None
Single PostgreSQL instance, 50M trips, no read replicas:
| Users | RPS | p50 (ms) | p99 (ms) | PG CPU | Errors |
|---|---|---|---|---|---|
| 100 | 180 | 45 | 210 | 28% | 0% |
| 300 | 490 | 120 | 890 | 61% | 0% |
| 500 | 680 | 340 | 2,100 | 85% | 0.2% |
| 800 | 720 | 1,200 | 3,400 | 92% | 4.1% |
| 1000 | 710 | 2,800 | 8,100 | 95% | 12.3% |
RPS flatlines at ~720. The database cannot push more throughput. Every read query after that point sits in queue.
The Fix: Three Layers
Three interventions, applied in order of complexity and operational cost:
Layer 1: Read Replicas
Route the 95% of read traffic to replicas. The primary handles only writes. Implementation: Spring’s AbstractRoutingDataSource with @Transactional(readOnly = true) driving the routing decision.
// SCALED: Read/write routing based on transaction type
public class ReadWriteRoutingDataSource extends AbstractRoutingDataSource {
@Override
protected Object determineCurrentLookupKey() {
return TransactionSynchronizationManager.isCurrentTransactionReadOnly()
? DataSourceType.REPLICA
: DataSourceType.PRIMARY;
}
}
// SCALED: Service layer declares read-only intent
@Service
public class TripHistoryService {
@Transactional(readOnly = true)
public Page<TripSummary> getTripHistory(String riderId, Pageable pageable) {
// This query routes to a replica automatically
return tripRepository.findByRiderIdOrderByCompletedAtDesc(riderId, pageable);
}
@Transactional
public Trip completeTrip(TripCompletionRequest request) {
// This routes to primary automatically
Trip trip = tripRepository.findById(request.getTripId())
.orElseThrow();
trip.complete(request.getFare(), request.getDistance());
return tripRepository.save(trip);
}
}
Section CH8-S1 covers the full implementation: DataSource configuration, AOP-based routing, replication lag monitoring, and read-your-writes consistency.
Layer 2: Index Optimization
Before sharding, run EXPLAIN ANALYZE. The trip history query:
-- BOTTLENECK: Sequential scan on 50M rows
EXPLAIN ANALYZE
SELECT * FROM trips
WHERE rider_id = 'rider-4821'
ORDER BY completed_at DESC
LIMIT 20;
-- Seq Scan on trips (cost=0.00..1284567.00 rows=50000000 width=248)
-- Filter: (rider_id = 'rider-4821'::text)
-- Rows Removed by Filter: 49999482
-- Planning Time: 0.15 ms
-- Execution Time: 2,143.82 ms
The fix:
-- SCALED: Composite index for the exact query pattern
CREATE INDEX CONCURRENTLY idx_trips_rider_completed
ON trips (rider_id, completed_at DESC);
After the index:
-- Index Scan Backward using idx_trips_rider_completed on trips
-- Index Cond: (rider_id = 'rider-4821'::text)
-- Planning Time: 0.12 ms
-- Execution Time: 3.87 ms
From 2,143ms to 3.87ms. No sharding. No new infrastructure. One DDL statement.
Layer 3: Sharding (When You Actually Need It)
Sharding is for when a single PostgreSQL instance cannot handle the write throughput. Not the read throughput (replicas solve that). Not the query latency (indexes solve that). The write throughput.
For the ride-hailing platform, the decision point: when trip creation rate exceeds what a single primary can handle with WAL writes, index updates, and replication to replicas. For most platforms, that threshold is 10,000-50,000 writes/second on properly tuned hardware.
Section CH8-S2 covers sharding strategies, the ShardResolver implementation, and why the trip history problem was never a sharding problem.
The Proof
After read replicas + composite index, same Locust test:
| Users | RPS | p50 (ms) | p99 (ms) | PG Primary CPU | Errors |
|---|---|---|---|---|---|
| 100 | 195 | 4 | 18 | 6% | 0% |
| 300 | 580 | 5 | 22 | 9% | 0% |
| 500 | 960 | 6 | 28 | 12% | 0% |
| 800 | 1,520 | 7 | 35 | 15% | 0% |
| 1000 | 1,890 | 9 | 42 | 18% | 0% |
| Metric | Before | After | Factor |
|---|---|---|---|
| Max RPS | 720 | 1,890 | 2.6x |
| p99 at 500 users | 2,100ms | 28ms | 75x |
| PG CPU at 800 users | 92% | 15% | 6x reduction |
| Error rate at 1000 users | 12.3% | 0% | Eliminated |
The 75x p99 improvement came from two changes: routing reads to replicas (removing load from primary) and adding a composite index (removing sequential scans from every query). Neither required sharding. Neither required new hardware beyond two replica instances.
The Decision Framework
Before proposing database scaling, walk through this checklist:
- Run EXPLAIN ANALYZE on the slow queries. If any show sequential scans on tables with more than 1M rows, add indexes first.
- Check connection pool metrics. If
hikaricp_connections_acquire_secondsp99 > 50ms, the pool is undersized (Chapter 4). - Check cache hit rates. If the same query runs thousands of times per minute with identical parameters, add caching (Chapter 6).
- Add read replicas if read RPS exceeds what a single instance can serve after indexes and caching are in place.
- Shard only when write throughput on a single primary is the bottleneck, after all of the above.
Most teams jump to step 5. The ride-hailing platform’s trip history problem was step 1.