Skip to main content
unbound mongodb at scale

Diagnosing Migration-Induced Latency Spikes

4 min read Chapter 50 of 72

Diagnosing Migration-Induced Latency Spikes

The Symptom

The telemetry platform’s read latency shows periodic spikes: p99 jumps from 15ms to 120ms every 15-20 minutes, then returns to baseline. The spikes last 30-90 seconds. They are not correlated with traffic patterns or deployment events.

The Cause

Each spike coincides with a chunk migration. When the balancer moves a chunk from shard 2 to shard 4:

  1. Shard 2 reads the chunk’s documents from WiredTiger, evicting cached query data.
  2. Shard 4 writes the incoming documents, consuming write I/O bandwidth.
  3. After the migration completes, shard 2 performs range deletion (cleanup), which holds locks and generates I/O.

The range deletion phase (step 3) is the worst offender. MongoDB deletes migrated documents in batches, but each batch acquires a lock and generates write I/O that competes with normal operations.

// Identify active migrations
db.adminCommand({ currentOp: true, desc: /moveChunk|migrat/ })

// Check range deletion queue
db.adminCommand({ currentOp: true, desc: "RangeDeleter" })

// Output shows:
// {
//   "desc": "RangeDeleter",
//   "active": true,
//   "ns": "telemetry.readings",
//   "range": { "min": {"sensorId": "sensor-02000"}, "max": {"sensorId": "sensor-03000"} },
//   "numDocsDeleted": 45000,
//   "totalDocs": 180000
// }

Correlate migration timestamps with latency spikes:

// Migration history with timing
use config
db.changelog.find({
  what: { $in: ["moveChunk.start", "moveChunk.commit"] },
  time: { $gte: new Date(Date.now() - 3600000) }  // last hour
}).sort({ time: 1 }).forEach(function(entry) {
  print(entry.time.toISOString() + " " + entry.what + " " + 
        entry.details.from + " -> " + entry.details.to);
});

The Benchmark

Measure the read latency impact during migrations:

// SLOW: Default migration settings, no balancer window
// Migrations happen during peak traffic hours
// k6 results during migration:
//   p50: 8ms -> 25ms (3.1x increase)
//   p99: 15ms -> 120ms (8x increase)
//   Duration: 30-90 seconds per migration

// Monitor per-shard latency during migration
MongoClient client = MongoClients.create(connectionString);
ServerDescription sourceDesc = client.getClusterDescription()
    .getServerDescriptions().stream()
    .filter(s -> s.getAddress().equals(sourceShardAddress))
    .findFirst().orElseThrow();
MetricNo migrationDuring migration (source)During migration (target)
Read p505ms18ms7ms
Read p9915ms120ms22ms
Write p508ms12ms35ms
WiredTiger cache evictions/s2001,400850
Disk read IOPS3,0008,5003,200

The source shard suffers most on reads (cache eviction), and the target shard suffers on writes (incoming documents and index builds).

The Fix

Step 1: Set a balancer window to avoid peak traffic.

// Run migrations only during off-peak hours (2 AM - 5 AM UTC)
db.settings.updateOne(
  { _id: "balancer" },
  { $set: { 
    activeWindow: { start: "02:00", stop: "05:00" }
  }},
  { upsert: true }
)

Step 2: Tune range deletion to reduce lock contention.

// Increase delay between range deletion batches
// Default: 20ms. Increase to 100ms to reduce I/O pressure.
db.adminCommand({
  setParameter: 1,
  rangeDeleterBatchDelayMS: 100
})

// Reduce batch size for range deletions
// Default: Unlimited. Set to 1000 to limit per-batch I/O.
db.adminCommand({
  setParameter: 1,
  rangeDeleterBatchSize: 1000
})

Step 3: Monitor migrations in the Java application.

// FAST: Connection pool listener that logs migration-related events
public class MigrationAwareCommandListener implements CommandListener {
    private final MeterRegistry registry;

    @Override
    public void commandSucceeded(CommandSucceededEvent event) {
        if (event.getElapsedTime(TimeUnit.MILLISECONDS) > 50) {
            // Correlate slow commands with migration windows
            registry.counter("mongo.slow_commands",
                "command", event.getCommandName(),
                "server", event.getConnectionDescription()
                    .getServerAddress().toString()
            ).increment();
        }
    }
}

The Proof

After applying balancer window and range deletion tuning:

MetricBefore (default)After (tuned)
Migration-induced p99 spike120ms45ms
Spike frequency during peakEvery 15-20 minZero (migrations only 2-5 AM)
Range deletion duration30s90s (slower but less impactful)
Cache evictions during migration1,400/s800/s

The Trade-off

Restricting the balancer window means that chunk imbalance accumulates during peak hours. If the cluster receives 500 GB of new data during the 21-hour peak window, the balancer has only 3 hours to redistribute. For rapidly growing collections, this may not be enough time to rebalance.

Increasing range deletion batch delay reduces the I/O impact but extends the time that orphaned documents exist on the source shard. During this window, a query that targets the source shard may see stale data if it reads from a secondary that has not yet replicated the deletions. For the telemetry platform, where readings are append-only and queries use readPreference: primary, this is acceptable.