Skip to main content
unbound mongodb at scale

Oplog Interaction with Index Builds and Initial Sync

6 min read Chapter 63 of 72

Oplog Interaction with Index Builds and Initial Sync

The Symptom

The team creates a new compound index on the 800 GB readings collection. The index build starts on all replica set members simultaneously (MongoDB 4.4+ hybrid index build). During the build, secondary2 falls further behind: its replication lag grows from 2 seconds to 180 seconds over 30 minutes. After 45 minutes, secondary2 enters RECOVERING state. The oplog window was exceeded.

The Cause

Hybrid index builds (MongoDB 4.4+) run on all members simultaneously but at different speeds. The primary may complete the index build in 20 minutes, but a secondary with slower storage takes 45 minutes. During the index build, the secondary’s oplog application pauses for the build phase. Writes that arrive during this pause accumulate in the oplog. If the index build duration exceeds the oplog window, the secondary cannot catch up.

The compounding factor: the index build itself generates oplog entries. The createIndexes command is recorded in the oplog, and each member applies it. But the data scanning and sorting for the index build consume disk I/O, which slows down the secondary’s ability to apply other oplog entries simultaneously.

// Check index build progress on all members
db.currentOp({ "command.createIndexes": { $exists: true } })

// Output shows:
// {
//   "desc": "IndexBuildsCoordinatorMongod",
//   "command": { "createIndexes": "readings" },
//   "progress": { "done": 450000000, "total": 800000000 },
//   "msg": "Index Build: scanning collection"
// }

The Benchmark

Collection sizeIndex build time (NVMe)Index build time (SSD)Min oplog window needed
100 GB8 minutes15 minutes25 minutes
500 GB40 minutes75 minutes2 hours
800 GB65 minutes120 minutes3 hours
2 TB160 minutes300 minutes7 hours

The minimum oplog window should be at least 2x the expected index build time on the slowest member.

The Fix

Step 1: Verify oplog window before starting an index build.

// FAST: Pre-flight check before index creation
public boolean canBuildIndex(long estimatedBuildMinutes) {
    double oplogWindowSeconds = oplogWindowMonitor.measureWindow();
    double oplogWindowMinutes = oplogWindowSeconds / 60.0;

    // Need at least 2x the build time as oplog window
    double requiredWindow = estimatedBuildMinutes * 2.0;

    if (oplogWindowMinutes < requiredWindow) {
        logger.error(
            "Oplog window ({} min) is less than 2x estimated build time ({} min). " +
            "Resize oplog before building index.",
            oplogWindowMinutes, estimatedBuildMinutes);
        return false;
    }

    logger.info("Oplog window ({} min) is sufficient for index build ({} min estimate)",
        oplogWindowMinutes, estimatedBuildMinutes);
    return true;
}

Step 2: Build indexes during low-traffic periods.

Lower write rates mean slower oplog consumption. Building the index during a period with 10,000 ops/s instead of 50,000 ops/s gives 5x more oplog runway.

// FAST: Schedule index build during maintenance window
public void buildIndexSafely(MongoCollection<Document> collection,
        Bson keys, IndexOptions options) {
    // Verify oplog window
    if (!canBuildIndex(90)) {  // Estimate 90 minutes for 800 GB
        throw new IllegalStateException("Oplog window insufficient for index build");
    }

    // Verify replication lag is low
    double lag = replicationLagMonitor.measureLag();
    if (lag > 5) {
        throw new IllegalStateException(
            "Replication lag is " + lag + "s. Wait for secondaries to catch up.");
    }

    // Build the index
    String indexName = collection.createIndex(keys, options);
    logger.info("Index build started: {}", indexName);
}

Step 3: Handle initial sync oplog requirements.

Initial sync copies the entire dataset from a sync source (primary or secondary) to the new member. During the copy, the source continues to receive writes. These writes are recorded in the oplog. After the data copy completes, the new member applies the oplog entries that accumulated during the copy to catch up to the current state.

If the data copy takes 6 hours and the oplog window is 3 hours, the oplog entries from the first 3 hours of the copy are overwritten before the new member can apply them. The initial sync fails and restarts.

// Estimate initial sync time
var dataSize = db.stats().dataSize;  // bytes
var copySpeed = 100 * 1024 * 1024;   // ~100 MB/s typical initial sync speed
var estimatedSeconds = dataSize / copySpeed;
var estimatedHours = estimatedSeconds / 3600;
print("Estimated initial sync time: " + estimatedHours.toFixed(1) + " hours");
print("Required oplog window: " + (estimatedHours * 2).toFixed(1) + " hours");

// For 2 TB dataset:
// Estimated initial sync time: 5.6 hours
// Required oplog window: 11.2 hours

Step 4: Reduce write rate during initial sync if oplog is tight.

// FAST: Throttle ingestion during initial sync
@Component
public class AdaptiveIngestionThrottle {

    private final AtomicBoolean syncInProgress = new AtomicBoolean(false);

    public void setSyncInProgress(boolean inProgress) {
        syncInProgress.set(inProgress);
    }

    public int getBatchSize() {
        // Normal: 100 documents per batch
        // During sync: 25 documents per batch (4x slower ingestion)
        return syncInProgress.get() ? 25 : 100;
    }

    public long getBatchDelayMs() {
        // Normal: 0ms between batches
        // During sync: 50ms between batches
        return syncInProgress.get() ? 50 : 0;
    }
}

The Proof

After implementing pre-flight checks and maintenance window scheduling:

ScenarioBeforeAfter
Index build on 800 GBSecondary entered RECOVERINGCompleted in 65 min, 45s max lag
Initial sync (2 TB)Failed twice, succeeded third timeSucceeded first attempt
Unplanned initial syncs/quarter30
Time spent on initial sync recovery18 hours/quarter0

The Trade-off

Pre-flight checks prevent index builds when the oplog window is insufficient. This means the team cannot build indexes immediately when they are needed. They must either resize the oplog first (which takes seconds but consumes disk space) or wait for a low-traffic window.

Throttling ingestion during initial sync reduces data freshness. For 6 hours during the sync, the telemetry platform ingests data at 25% of normal rate. Sensor readings queue in the application layer or in Kafka. After the sync completes, the backlog must be processed, which creates a temporary spike.

The fundamental constraint: large datasets and high write rates require large oplogs. A 2 TB dataset with 50,000 writes/second needs a 270+ GB oplog to safely handle maintenance. This is not configurable around; it is a capacity planning requirement. Budget 10-15% of disk for the oplog when planning storage for write-heavy MongoDB workloads.