Oplog Tuning: The Replication Backbone
Oplog Tuning
The oplog (operations log) is a capped collection on each replica set member that records every write operation. Secondaries read from the primary’s oplog to replicate data. The oplog is a fixed size; when it fills, the oldest entries are overwritten.
The oplog window is the time span between the oldest and newest entry in the oplog. If a secondary falls behind by more than the oplog window, it cannot catch up and requires a full initial sync (copying the entire dataset from scratch).
Oplog Window Calculation
// Check oplog size and window
rs.printReplicationInfo()
// Output:
// configured oplog size: 5120MB
// log length start to end: 43200 secs (12 hrs)
// oplog first event time: Sat Jun 15 2024 00:00:00 GMT+0000
// oplog last event time: Sat Jun 15 2024 12:00:00 GMT+0000
// now: Sat Jun 15 2024 12:05:00 GMT+0000
The oplog window is 12 hours. This means a secondary can be offline for up to 12 hours and still catch up by replaying the oplog. If it is offline for 13 hours, the entries it needs have been overwritten, and it must perform an initial sync.
Oplog Size vs Write Rate
The oplog size determines the window. The write rate determines how fast the oplog fills:
| Oplog size | Write rate | Oplog window |
|---|---|---|
| 5 GB | 1,000 ops/s (avg 500 bytes/op) | ~2.8 hours |
| 5 GB | 10,000 ops/s (avg 500 bytes/op) | ~17 minutes |
| 50 GB | 10,000 ops/s (avg 500 bytes/op) | ~2.8 hours |
| 50 GB | 50,000 ops/s (avg 500 bytes/op) | ~33 minutes |
| 200 GB | 50,000 ops/s (avg 500 bytes/op) | ~2.2 hours |
The telemetry platform at 50,000 writes/second with an average oplog entry of 500 bytes consumes approximately 25 MB/second of oplog space. A 50 GB oplog lasts 33 minutes. A 200 GB oplog lasts 2.2 hours.
// Oplog entry size depends on the operation
// Insert: full document is stored in oplog
// Update: only the changed fields (with $set, $inc, etc.)
// Delete: only the _id
// Check average oplog entry size
var oplog = db.getSiblingDB("local").oplog.rs;
var sample = oplog.aggregate([
{ $sample: { size: 1000 } },
{ $project: { size: { $bsonSize: "$$ROOT" } } },
{ $group: { _id: null, avgSize: { $avg: "$size" } } }
]);
// { "avgSize": 487 }
When the Oplog Window is Too Small
The telemetry platform needs to perform rolling maintenance: take a secondary offline, upgrade it, and bring it back. The upgrade takes 45 minutes. The oplog window is 33 minutes. The secondary cannot catch up after the upgrade.
The result: an initial sync. For a 2 TB dataset on NVMe storage, initial sync takes 4-8 hours. During this time, the secondary is unavailable for reads and does not vote in elections. The replica set operates with reduced redundancy.
// Detect when a secondary is in RECOVERING state (initial sync)
ClusterDescription cluster = client.getClusterDescription();
for (ServerDescription server : cluster.getServerDescriptions()) {
if (server.getType() == ServerType.REPLICA_SET_SECONDARY) {
// Check if the secondary is in initial sync
// State 3 = RECOVERING (which includes initial sync)
Document status = client.getDatabase("admin")
.runCommand(new Document("replSetGetStatus", 1));
// Check member state
}
}