Measuring and Monitoring Replication Lag

The Symptom

Secondary reads are returning data that is 45 seconds old despite maxStalenessSeconds: 30. The driver should exclude secondaries with more than 30 seconds of lag, but it does not. Users report seeing stale dashboard data.

The Cause

The driver’s staleness estimation is approximate. It relies on periodic heartbeats (every 10 seconds) to check each secondary’s oplog position. Between heartbeats, the lag can grow without the driver knowing. The driver also uses a different lag calculation than rs.printSecondaryReplicationInfo():

Driver: Compares the secondary’s last reported oplog timestamp with the primary’s last reported oplog timestamp from the most recent heartbeat.
rs.printSecondaryReplicationInfo(): Compares the secondary’s oplog tail with the primary’s oplog tail at the moment the command runs.

If the primary’s write rate spikes between heartbeats, the actual lag may exceed the driver’s estimate. The minimum maxStalenessSeconds of 90 seconds accounts for this imprecision, but the application configured 30 seconds (which the driver rounds up to 90 internally).

// Check replication lag from the primary
rs.printSecondaryReplicationInfo()

// Output:
// source: secondary1:27017
//   syncedTo: Mon Jun 15 2024 12:00:45 GMT+0000
//   45 secs (0 hrs) behind the primary
// source: secondary2:27017
//   syncedTo: Mon Jun 15 2024 12:01:15 GMT+0000
//   15 secs (0 hrs) behind the primary

Secondary1 has 45 seconds of lag. The root cause: an index build on secondary1 is consuming disk I/O, slowing oplog application.

The Benchmark

Lag cause	Typical lag	Detection method
Normal replication	0-2 seconds	rs.printSecondaryReplicationInfo()
Slow disk (IOPS exhaustion)	5-60 seconds	iostat, disk metrics
Foreground index build	30 seconds to hours	db.currentOp()
Long-running write batch	10-120 seconds	db.currentOp(), slow query log
Network saturation	5-30 seconds	Network metrics
Secondary reading cold data	5-45 seconds	WiredTiger cache metrics

The Fix

Step 1: Monitor replication lag with Prometheus.

// FAST: Export replication lag as a Prometheus gauge
@Component
public class ReplicationLagMonitor {

    private final MongoClient client;
    private final Gauge replicationLag;

    public ReplicationLagMonitor(MongoClient client, MeterRegistry registry) {
        this.client = client;
        this.replicationLag = Gauge.builder("mongodb.replication.lag.seconds",
                this, ReplicationLagMonitor::measureLag)
            .description("Replication lag in seconds")
            .tag("member", "self")
            .register(registry);
    }

    private double measureLag() {
        try {
            Document replStatus = client.getDatabase("admin")
                .runCommand(new Document("replSetGetStatus", 1));
            List<Document> members = replStatus.getList("members", Document.class);

            Date primaryOptime = null;
            Date selfOptime = null;

            for (Document member : members) {
                Date optime = member.get("optimeDate", Date.class);
                if (member.getInteger("state") == 1) {  // PRIMARY
                    primaryOptime = optime;
                }
                if (member.getBoolean("self", false)) {
                    selfOptime = optime;
                }
            }

            if (primaryOptime != null && selfOptime != null) {
                return (primaryOptime.getTime() - selfOptime.getTime()) / 1000.0;
            }
            return 0;
        } catch (Exception e) {
            return -1;  // Error state
        }
    }
}

Step 2: Set up Grafana alerts for lag thresholds.

Alert on replication lag exceeding the staleness bounds configured in the application:

Alert	Threshold	Severity	Action
Lag warning	> 10 seconds for 2 minutes	Warning	Investigate secondary I/O
Lag critical	> 30 seconds for 2 minutes	Critical	Route reads to primary
Lag emergency	> 300 seconds for 5 minutes	Page	Secondary may need resync

Step 3: Diagnose the root cause.

// Check what's running on the lagging secondary
db.currentOp({ $ors: [
  { "command.createIndexes": { $exists: true } },
  { "waitingForLock": true },
  { "secs_running": { $gt: 30 } }
]})

// Check oplog window (how much history the oplog holds)
rs.printReplicationInfo()
// configured oplog size: 10240MB
// log length start to end: 172800 secs (48 hrs)
// oplog first event time: Thu Jun 13 2024 12:00:00 GMT+0000
// oplog last event time: Sat Jun 15 2024 12:00:00 GMT+0000

// If oplog window < replication lag, the secondary cannot catch up
// and requires a full resync

Step 4: Application-level fallback when lag is excessive.

// FAST: Fall back to primary when secondaries are lagging
public MongoCollection<Document> getReadCollection(boolean toleratesStale) {
    if (!toleratesStale) {
        return database.getCollection("readings")
            .withReadPreference(ReadPreference.primary());
    }

    // Check if secondaries are healthy
    double lag = replicationLagMonitor.measureLag();
    if (lag > 30 || lag < 0) {
        // Secondaries are too far behind or unreachable
        return database.getCollection("readings")
            .withReadPreference(ReadPreference.primary());
    }

    return database.getCollection("readings")
        .withReadPreference(ReadPreference.secondaryPreferred(
            90, TimeUnit.SECONDS));
}

The Proof

After implementing monitoring and fallback:

Metric	Before	After
Stale data incidents/week	12	0
Undetected lag > 30s	Common	Alerted within 2 min
Index build lag impact	45s undetected	Automatic fallback to primary
Mean time to diagnose lag	15 minutes	2 minutes (dashboard)

The Trade-off

Application-level lag monitoring adds a replSetGetStatus command every measurement interval (typically 10-30 seconds). This command is lightweight but runs on the admin database. On a busy cluster, adding monitoring queries increases load slightly.

The fallback-to-primary strategy defeats the purpose of secondary reads during lag events. If the lag is caused by heavy write load on the primary (which is being replicated), falling back to primary adds more read load to the already-stressed primary. Monitor the cause: if lag is due to secondary-specific issues (index build, slow disk), falling back to primary is correct. If lag is due to the primary being overloaded, neither option is good; the system is at capacity.

The oplog window must be larger than the longest expected lag. If the oplog holds 48 hours of history and a secondary goes down for 49 hours, it cannot catch up by replaying the oplog and requires a full resync (hours for large datasets). Size the oplog to cover the longest maintenance window plus a safety margin.