Skip to main content
unbound mongodb at scale

Oplog Tuning: The Replication Backbone

3 min read Chapter 61 of 72

Oplog Tuning

The oplog (operations log) is a capped collection on each replica set member that records every write operation. Secondaries read from the primary’s oplog to replicate data. The oplog is a fixed size; when it fills, the oldest entries are overwritten.

The oplog window is the time span between the oldest and newest entry in the oplog. If a secondary falls behind by more than the oplog window, it cannot catch up and requires a full initial sync (copying the entire dataset from scratch).

Oplog diagram. Shows capped collection as a circular buffer. New writes enter at the tail, oldest entries are overwritten at the head. Oplog window = tail timestamp - head timestamp. Shows secondary reading from oplog with its current position. If secondary position < head position, initial sync required.

Oplog Window Calculation

// Check oplog size and window
rs.printReplicationInfo()

// Output:
// configured oplog size: 5120MB
// log length start to end: 43200 secs (12 hrs)
// oplog first event time: Sat Jun 15 2024 00:00:00 GMT+0000
// oplog last event time: Sat Jun 15 2024 12:00:00 GMT+0000
// now: Sat Jun 15 2024 12:05:00 GMT+0000

The oplog window is 12 hours. This means a secondary can be offline for up to 12 hours and still catch up by replaying the oplog. If it is offline for 13 hours, the entries it needs have been overwritten, and it must perform an initial sync.

Oplog Size vs Write Rate

The oplog size determines the window. The write rate determines how fast the oplog fills:

Oplog sizeWrite rateOplog window
5 GB1,000 ops/s (avg 500 bytes/op)~2.8 hours
5 GB10,000 ops/s (avg 500 bytes/op)~17 minutes
50 GB10,000 ops/s (avg 500 bytes/op)~2.8 hours
50 GB50,000 ops/s (avg 500 bytes/op)~33 minutes
200 GB50,000 ops/s (avg 500 bytes/op)~2.2 hours

The telemetry platform at 50,000 writes/second with an average oplog entry of 500 bytes consumes approximately 25 MB/second of oplog space. A 50 GB oplog lasts 33 minutes. A 200 GB oplog lasts 2.2 hours.

// Oplog entry size depends on the operation
// Insert: full document is stored in oplog
// Update: only the changed fields (with $set, $inc, etc.)
// Delete: only the _id

// Check average oplog entry size
var oplog = db.getSiblingDB("local").oplog.rs;
var sample = oplog.aggregate([
  { $sample: { size: 1000 } },
  { $project: { size: { $bsonSize: "$$ROOT" } } },
  { $group: { _id: null, avgSize: { $avg: "$size" } } }
]);
// { "avgSize": 487 }

When the Oplog Window is Too Small

The telemetry platform needs to perform rolling maintenance: take a secondary offline, upgrade it, and bring it back. The upgrade takes 45 minutes. The oplog window is 33 minutes. The secondary cannot catch up after the upgrade.

The result: an initial sync. For a 2 TB dataset on NVMe storage, initial sync takes 4-8 hours. During this time, the secondary is unavailable for reads and does not vote in elections. The replica set operates with reduced redundancy.

// Detect when a secondary is in RECOVERING state (initial sync)
ClusterDescription cluster = client.getClusterDescription();
for (ServerDescription server : cluster.getServerDescriptions()) {
    if (server.getType() == ServerType.REPLICA_SET_SECONDARY) {
        // Check if the secondary is in initial sync
        // State 3 = RECOVERING (which includes initial sync)
        Document status = client.getDatabase("admin")
            .runCommand(new Document("replSetGetStatus", 1));
        // Check member state
    }
}