Measuring and Monitoring Replication Lag
Measuring and Monitoring Replication Lag
The Symptom
Secondary reads are returning data that is 45 seconds old despite maxStalenessSeconds: 30. The driver should exclude secondaries with more than 30 seconds of lag, but it does not. Users report seeing stale dashboard data.
The Cause
The driver’s staleness estimation is approximate. It relies on periodic heartbeats (every 10 seconds) to check each secondary’s oplog position. Between heartbeats, the lag can grow without the driver knowing. The driver also uses a different lag calculation than rs.printSecondaryReplicationInfo():
- Driver: Compares the secondary’s last reported oplog timestamp with the primary’s last reported oplog timestamp from the most recent heartbeat.
- rs.printSecondaryReplicationInfo(): Compares the secondary’s oplog tail with the primary’s oplog tail at the moment the command runs.
If the primary’s write rate spikes between heartbeats, the actual lag may exceed the driver’s estimate. The minimum maxStalenessSeconds of 90 seconds accounts for this imprecision, but the application configured 30 seconds (which the driver rounds up to 90 internally).
// Check replication lag from the primary
rs.printSecondaryReplicationInfo()
// Output:
// source: secondary1:27017
// syncedTo: Mon Jun 15 2024 12:00:45 GMT+0000
// 45 secs (0 hrs) behind the primary
// source: secondary2:27017
// syncedTo: Mon Jun 15 2024 12:01:15 GMT+0000
// 15 secs (0 hrs) behind the primary
Secondary1 has 45 seconds of lag. The root cause: an index build on secondary1 is consuming disk I/O, slowing oplog application.
The Benchmark
| Lag cause | Typical lag | Detection method |
|---|---|---|
| Normal replication | 0-2 seconds | rs.printSecondaryReplicationInfo() |
| Slow disk (IOPS exhaustion) | 5-60 seconds | iostat, disk metrics |
| Foreground index build | 30 seconds to hours | db.currentOp() |
| Long-running write batch | 10-120 seconds | db.currentOp(), slow query log |
| Network saturation | 5-30 seconds | Network metrics |
| Secondary reading cold data | 5-45 seconds | WiredTiger cache metrics |
The Fix
Step 1: Monitor replication lag with Prometheus.
// FAST: Export replication lag as a Prometheus gauge
@Component
public class ReplicationLagMonitor {
private final MongoClient client;
private final Gauge replicationLag;
public ReplicationLagMonitor(MongoClient client, MeterRegistry registry) {
this.client = client;
this.replicationLag = Gauge.builder("mongodb.replication.lag.seconds",
this, ReplicationLagMonitor::measureLag)
.description("Replication lag in seconds")
.tag("member", "self")
.register(registry);
}
private double measureLag() {
try {
Document replStatus = client.getDatabase("admin")
.runCommand(new Document("replSetGetStatus", 1));
List<Document> members = replStatus.getList("members", Document.class);
Date primaryOptime = null;
Date selfOptime = null;
for (Document member : members) {
Date optime = member.get("optimeDate", Date.class);
if (member.getInteger("state") == 1) { // PRIMARY
primaryOptime = optime;
}
if (member.getBoolean("self", false)) {
selfOptime = optime;
}
}
if (primaryOptime != null && selfOptime != null) {
return (primaryOptime.getTime() - selfOptime.getTime()) / 1000.0;
}
return 0;
} catch (Exception e) {
return -1; // Error state
}
}
}
Step 2: Set up Grafana alerts for lag thresholds.
Alert on replication lag exceeding the staleness bounds configured in the application:
| Alert | Threshold | Severity | Action |
|---|---|---|---|
| Lag warning | > 10 seconds for 2 minutes | Warning | Investigate secondary I/O |
| Lag critical | > 30 seconds for 2 minutes | Critical | Route reads to primary |
| Lag emergency | > 300 seconds for 5 minutes | Page | Secondary may need resync |
Step 3: Diagnose the root cause.
// Check what's running on the lagging secondary
db.currentOp({ $ors: [
{ "command.createIndexes": { $exists: true } },
{ "waitingForLock": true },
{ "secs_running": { $gt: 30 } }
]})
// Check oplog window (how much history the oplog holds)
rs.printReplicationInfo()
// configured oplog size: 10240MB
// log length start to end: 172800 secs (48 hrs)
// oplog first event time: Thu Jun 13 2024 12:00:00 GMT+0000
// oplog last event time: Sat Jun 15 2024 12:00:00 GMT+0000
// If oplog window < replication lag, the secondary cannot catch up
// and requires a full resync
Step 4: Application-level fallback when lag is excessive.
// FAST: Fall back to primary when secondaries are lagging
public MongoCollection<Document> getReadCollection(boolean toleratesStale) {
if (!toleratesStale) {
return database.getCollection("readings")
.withReadPreference(ReadPreference.primary());
}
// Check if secondaries are healthy
double lag = replicationLagMonitor.measureLag();
if (lag > 30 || lag < 0) {
// Secondaries are too far behind or unreachable
return database.getCollection("readings")
.withReadPreference(ReadPreference.primary());
}
return database.getCollection("readings")
.withReadPreference(ReadPreference.secondaryPreferred(
90, TimeUnit.SECONDS));
}
The Proof
After implementing monitoring and fallback:
| Metric | Before | After |
|---|---|---|
| Stale data incidents/week | 12 | 0 |
| Undetected lag > 30s | Common | Alerted within 2 min |
| Index build lag impact | 45s undetected | Automatic fallback to primary |
| Mean time to diagnose lag | 15 minutes | 2 minutes (dashboard) |
The Trade-off
Application-level lag monitoring adds a replSetGetStatus command every measurement interval (typically 10-30 seconds). This command is lightweight but runs on the admin database. On a busy cluster, adding monitoring queries increases load slightly.
The fallback-to-primary strategy defeats the purpose of secondary reads during lag events. If the lag is caused by heavy write load on the primary (which is being replicated), falling back to primary adds more read load to the already-stressed primary. Monitor the cause: if lag is due to secondary-specific issues (index build, slow disk), falling back to primary is correct. If lag is due to the primary being overloaded, neither option is good; the system is at capacity.
The oplog window must be larger than the longest expected lag. If the oplog holds 48 hours of history and a secondary goes down for 49 hours, it cannot catch up by replaying the oplog and requires a full resync (hours for large datasets). Size the oplog to cover the longest maintenance window plus a safety margin.