Skip to main content
unbound mongodb at scale

Measuring and Monitoring Replication Lag

5 min read Chapter 57 of 72

Measuring and Monitoring Replication Lag

The Symptom

Secondary reads are returning data that is 45 seconds old despite maxStalenessSeconds: 30. The driver should exclude secondaries with more than 30 seconds of lag, but it does not. Users report seeing stale dashboard data.

The Cause

The driver’s staleness estimation is approximate. It relies on periodic heartbeats (every 10 seconds) to check each secondary’s oplog position. Between heartbeats, the lag can grow without the driver knowing. The driver also uses a different lag calculation than rs.printSecondaryReplicationInfo():

  • Driver: Compares the secondary’s last reported oplog timestamp with the primary’s last reported oplog timestamp from the most recent heartbeat.
  • rs.printSecondaryReplicationInfo(): Compares the secondary’s oplog tail with the primary’s oplog tail at the moment the command runs.

If the primary’s write rate spikes between heartbeats, the actual lag may exceed the driver’s estimate. The minimum maxStalenessSeconds of 90 seconds accounts for this imprecision, but the application configured 30 seconds (which the driver rounds up to 90 internally).

// Check replication lag from the primary
rs.printSecondaryReplicationInfo()

// Output:
// source: secondary1:27017
//   syncedTo: Mon Jun 15 2024 12:00:45 GMT+0000
//   45 secs (0 hrs) behind the primary
// source: secondary2:27017
//   syncedTo: Mon Jun 15 2024 12:01:15 GMT+0000
//   15 secs (0 hrs) behind the primary

Secondary1 has 45 seconds of lag. The root cause: an index build on secondary1 is consuming disk I/O, slowing oplog application.

The Benchmark

Lag causeTypical lagDetection method
Normal replication0-2 secondsrs.printSecondaryReplicationInfo()
Slow disk (IOPS exhaustion)5-60 secondsiostat, disk metrics
Foreground index build30 seconds to hoursdb.currentOp()
Long-running write batch10-120 secondsdb.currentOp(), slow query log
Network saturation5-30 secondsNetwork metrics
Secondary reading cold data5-45 secondsWiredTiger cache metrics

The Fix

Step 1: Monitor replication lag with Prometheus.

// FAST: Export replication lag as a Prometheus gauge
@Component
public class ReplicationLagMonitor {

    private final MongoClient client;
    private final Gauge replicationLag;

    public ReplicationLagMonitor(MongoClient client, MeterRegistry registry) {
        this.client = client;
        this.replicationLag = Gauge.builder("mongodb.replication.lag.seconds",
                this, ReplicationLagMonitor::measureLag)
            .description("Replication lag in seconds")
            .tag("member", "self")
            .register(registry);
    }

    private double measureLag() {
        try {
            Document replStatus = client.getDatabase("admin")
                .runCommand(new Document("replSetGetStatus", 1));
            List<Document> members = replStatus.getList("members", Document.class);

            Date primaryOptime = null;
            Date selfOptime = null;

            for (Document member : members) {
                Date optime = member.get("optimeDate", Date.class);
                if (member.getInteger("state") == 1) {  // PRIMARY
                    primaryOptime = optime;
                }
                if (member.getBoolean("self", false)) {
                    selfOptime = optime;
                }
            }

            if (primaryOptime != null && selfOptime != null) {
                return (primaryOptime.getTime() - selfOptime.getTime()) / 1000.0;
            }
            return 0;
        } catch (Exception e) {
            return -1;  // Error state
        }
    }
}

Step 2: Set up Grafana alerts for lag thresholds.

Alert on replication lag exceeding the staleness bounds configured in the application:

AlertThresholdSeverityAction
Lag warning> 10 seconds for 2 minutesWarningInvestigate secondary I/O
Lag critical> 30 seconds for 2 minutesCriticalRoute reads to primary
Lag emergency> 300 seconds for 5 minutesPageSecondary may need resync

Step 3: Diagnose the root cause.

// Check what's running on the lagging secondary
db.currentOp({ $ors: [
  { "command.createIndexes": { $exists: true } },
  { "waitingForLock": true },
  { "secs_running": { $gt: 30 } }
]})

// Check oplog window (how much history the oplog holds)
rs.printReplicationInfo()
// configured oplog size: 10240MB
// log length start to end: 172800 secs (48 hrs)
// oplog first event time: Thu Jun 13 2024 12:00:00 GMT+0000
// oplog last event time: Sat Jun 15 2024 12:00:00 GMT+0000

// If oplog window < replication lag, the secondary cannot catch up
// and requires a full resync

Step 4: Application-level fallback when lag is excessive.

// FAST: Fall back to primary when secondaries are lagging
public MongoCollection<Document> getReadCollection(boolean toleratesStale) {
    if (!toleratesStale) {
        return database.getCollection("readings")
            .withReadPreference(ReadPreference.primary());
    }

    // Check if secondaries are healthy
    double lag = replicationLagMonitor.measureLag();
    if (lag > 30 || lag < 0) {
        // Secondaries are too far behind or unreachable
        return database.getCollection("readings")
            .withReadPreference(ReadPreference.primary());
    }

    return database.getCollection("readings")
        .withReadPreference(ReadPreference.secondaryPreferred(
            90, TimeUnit.SECONDS));
}

The Proof

After implementing monitoring and fallback:

MetricBeforeAfter
Stale data incidents/week120
Undetected lag > 30sCommonAlerted within 2 min
Index build lag impact45s undetectedAutomatic fallback to primary
Mean time to diagnose lag15 minutes2 minutes (dashboard)

The Trade-off

Application-level lag monitoring adds a replSetGetStatus command every measurement interval (typically 10-30 seconds). This command is lightweight but runs on the admin database. On a busy cluster, adding monitoring queries increases load slightly.

The fallback-to-primary strategy defeats the purpose of secondary reads during lag events. If the lag is caused by heavy write load on the primary (which is being replicated), falling back to primary adds more read load to the already-stressed primary. Monitor the cause: if lag is due to secondary-specific issues (index build, slow disk), falling back to primary is correct. If lag is due to the primary being overloaded, neither option is good; the system is at capacity.

The oplog window must be larger than the longest expected lag. If the oplog holds 48 hours of history and a secondary goes down for 49 hours, it cannot catch up by replaying the oplog and requires a full resync (hours for large datasets). Size the oplog to cover the longest maintenance window plus a safety margin.