Tier Transition Operations and Verification

The Symptom

An ISM policy transitions an index from hot to warm. The force merge action begins and runs for 45 minutes on a 60GB index. During the merge, search latency on that index doubles. After the merge completes, the allocation change moves shards to warm nodes. Shard relocation takes another 30 minutes. During relocation, some search requests return partial results because one shard is in transit.

The Internals

A tier transition involves three sequential operations, each with distinct performance implications:

Force merge. Compacts all segments into a single segment per shard. This rewrites the entire index, consuming CPU and disk I/O on the current node. During the merge, search must read from both old and new segments, increasing memory pressure.
Allocation change. Updates index.routing.allocation.require.temp from hot to warm. OpenSearch begins relocating shards to warm-tier nodes. During relocation, each shard exists on both the source and target node. Searches are served from the source until relocation completes.
Replica adjustment. Reduces replica count (typically from 1 to 0 for cold tier). OpenSearch deletes the replica shards, freeing storage on the source nodes.

The ISM policy executes these as ordered actions within a state. If any action fails, the policy retries according to the retry configuration and eventually marks the index as failed.

The Implementation

Manual Tier Transition with Verification

// HARDENED: Manual tier transition with step-by-step verification
// Used for large indices where ISM automatic transition is too risky

public class TierTransitionManager {

    private final OpenSearchClient client;

    public TierTransitionManager(OpenSearchClient client) {
        this.client = client;
    }

    public void transitionToWarm(String indexName) throws Exception {
        // Step 1: Block writes
        client.indices().putSettings(ps -> ps
            .index(indexName)
            .settings(s -> s
                .blocksWrite(true)
            )
        );

        // Step 2: Force merge to 1 segment per shard
        client.indices().forcemerge(fm -> fm
            .index(indexName)
            .maxNumSegments(1)
        );

        // Step 3: Verify merge completed
        verifySegmentCount(indexName, 1);

        // Step 4: Move to warm tier
        client.indices().putSettings(ps -> ps
            .index(indexName)
            .settings(s -> s
                .putAll(Map.of(
                    "index.routing.allocation.require.temp",
                    JsonData.of("warm")
                ))
            )
        );

        // Step 5: Wait for shard relocation to complete
        waitForGreenHealth(indexName);

        // Step 6: Reduce replicas
        client.indices().putSettings(ps -> ps
            .index(indexName)
            .settings(s -> s
                .numberOfReplicas("1")
            )
        );

        waitForGreenHealth(indexName);
    }

    private void verifySegmentCount(String indexName, int expectedMax)
            throws IOException {
        var segments = client.indices().segments(s -> s.index(indexName));

        for (var indexEntry : segments.indices().entrySet()) {
            for (var shardEntry : indexEntry.getValue().shards().entrySet()) {
                for (var shardSegments : shardEntry.getValue()) {
                    int segmentCount = shardSegments.segments().size();
                    if (segmentCount > expectedMax) {
                        throw new TransitionException(
                            "Shard " + shardEntry.getKey() +
                            " has " + segmentCount +
                            " segments, expected max " + expectedMax);
                    }
                }
            }
        }
    }

    private void waitForGreenHealth(String indexName) throws Exception {
        int maxAttempts = 60;
        for (int i = 0; i < maxAttempts; i++) {
            var health = client.cluster().health(h -> h
                .index(indexName)
                .waitForStatus(HealthStatus.Green)
                .timeout(t -> t.time("10s"))
            );

            if (health.status() == HealthStatus.Green) {
                return;
            }

            if (health.relocatingShards() > 0) {
                Thread.sleep(10_000);
                continue;
            }

            Thread.sleep(5_000);
        }

        throw new TransitionException(
            "Index " + indexName + " did not reach green health within timeout");
    }
}

Shrink Operation for Oversized Indices

// Shrink reduces shard count. Useful when an index was created with
// too many shards and is transitioning to a read-only warm/cold tier.

public void shrinkIndex(String sourceIndex, String targetIndex,
        int targetShards) throws Exception {

    // Prerequisite: index must be read-only and all shards on one node
    client.indices().putSettings(ps -> ps
        .index(sourceIndex)
        .settings(s -> s
            .blocksWrite(true)
            .putAll(Map.of(
                "index.routing.allocation.require._name",
                JsonData.of(selectShrinkNode())
            ))
        )
    );

    waitForGreenHealth(sourceIndex);

    // Shrink: creates a new index with fewer shards using hard links
    client.indices().shrink(sh -> sh
        .index(sourceIndex)
        .target(targetIndex)
        .settings(s -> s
            .numberOfShards(String.valueOf(targetShards))
            .numberOfReplicas("1")
            .putAll(Map.of(
                "index.routing.allocation.require._name",
                JsonData.of(""),  // Clear the node constraint
                "index.routing.allocation.require.temp",
                JsonData.of("warm")
            ))
        )
    );

    waitForGreenHealth(targetIndex);
}

The Measurement

Tier transition timing for a 50GB index (2 shards, 1 replica):

Operation	Duration	Cluster Impact
Block writes	< 1s	None
Force merge (to 1 segment)	22 min	+40% CPU on source node
Shard relocation (hot → warm)	18 min	Network: ~50MB/s per shard
Replica adjustment	8 min	Storage freed immediately
Total	~48 min	Degraded on source node

Schedule tier transitions during low-traffic windows. The force merge phase is the most disruptive, consuming significant CPU and disk I/O on the source node for the duration.

The Decision Rule

Execute tier transitions during off-peak hours. Force merge operations consume 100% of a shard’s disk I/O capacity for the duration, degrading search latency on co-located shards.

Verify each step before proceeding to the next. A failed force merge followed by an allocation change moves unmerged data to warm nodes, wasting warm-tier disk space on deleted documents that were never reclaimed.

Use the shrink operation when transitioning indices with more shards than the target tier requires. Cold-tier indices rarely need more than one shard. Reducing shard count before cold migration reduces per-shard overhead on cold nodes.