The Coordinating Node Tax

The Symptom

The documentation platform’s search latency doubles when the shard count increases from 3 to 6, even though the total data volume stays the same. The data nodes are not saturated. CPU and I/O are well within capacity. The coordinating node’s CPU spikes during queries, and its heap usage grows proportionally with shard count.

The Internals

Every search query in OpenSearch executes in two phases:

Query phase. The coordinating node sends the query to every relevant shard (or every shard if no routing is specified). Each shard executes the query against its local Lucene index, scores the matching documents, and returns a priority queue of the top N document IDs and scores. It does not return the full documents yet.

Fetch phase. The coordinating node merges all shard-level priority queues into a global ranking, selects the final top N documents, and requests the full _source content from the shards that hold those documents.

The coordinating node’s work during the query phase is proportional to:

$$\text{merge_cost} = \text{size} \times \text{number_of_shards}$$

For a size=10 query across 3 shards, the coordinating node merges 30 results. For the same query across 12 shards, it merges 120 results. The merge is a heap operation, but the memory allocation and garbage collection overhead are real.

For deep pagination, the cost becomes severe. A request for page 100 with size=10 (i.e., from=990, size=10) requires each shard to score and return its top 1,000 documents. The coordinating node merges $1000 \times \text{number_of_shards}$ results. On a 12-shard index, that is 12,000 documents sorted in memory.

// FRAGILE: Deep pagination with large shard count
// from=990, size=10 on a 12-shard index:
// Each shard returns 1,000 results = 12,000 results merged on coordinating node

SearchRequest request = SearchRequest.of(s -> s
    .index("docs-v1")
    .from(990)
    .size(10)
    .query(q -> q.match(m -> m.field("body").query(userQuery)))
);

// HARDENED: Use search_after for deep pagination
// Only processes size results per shard, regardless of page depth

SearchRequest request = SearchRequest.of(s -> s
    .index("docs-v1")
    .size(10)
    .searchAfter(List.of(
        FieldValue.of(lastSortScore),
        FieldValue.of(lastDocId)
    ))
    .sort(SortOptions.of(so -> so.score(sc -> sc.order(SortOrder.Desc))))
    .sort(SortOptions.of(so -> so.field(f -> f.field("_id").order(SortOrder.Asc))))
    .query(q -> q.match(m -> m.field("body").query(userQuery)))
);

The search_after parameter tells each shard to start scoring after a specific sort value, avoiding the need to re-process earlier pages. The coordinating node merges only size results per shard instead of (from + size) results per shard.

The Implementation

Dedicated Coordinating Nodes

For the documentation platform under high query load, dedicated coordinating nodes separate the merge work from data node indexing and search execution:

# Kubernetes deployment for coordinating-only nodes
apiVersion: apps/v1
kind: Deployment
metadata:
  name: opensearch-coordinator
spec:
  replicas: 2
  template:
    spec:
      containers:
        - name: opensearch
          image: opensearchproject/opensearch:2.12.0
          env:
            - name: node.roles
              value: "" # No roles = coordinating only
            - name: OPENSEARCH_JAVA_OPTS
              value: "-Xms4g -Xmx4g"
          resources:
            requests:
              memory: "8Gi"
              cpu: "4"

Coordinating nodes need moderate heap (they hold merge buffers, not segment data) and moderate CPU (merge is CPU-bound). They do not need large disks.

Monitoring Coordinating Node Load

public record CoordinatorMetrics(
    long searchQueryTotal,
    long searchQueryTimeMillis,
    long searchFetchTotal,
    long searchFetchTimeMillis,
    double avgQueryLatency,
    double avgFetchLatency
) {
    public static CoordinatorMetrics fromNodeStats(
            NodesStatsResponse stats, String nodeId) {

        var indices = stats.nodes().get(nodeId).indices();
        var search = indices.search();

        long queryTotal = search.queryTotal();
        long queryTime = search.queryTimeInMillis();
        long fetchTotal = search.fetchTotal();
        long fetchTime = search.fetchTimeInMillis();

        return new CoordinatorMetrics(
            queryTotal, queryTime,
            fetchTotal, fetchTime,
            queryTotal > 0 ? (double) queryTime / queryTotal : 0,
            fetchTotal > 0 ? (double) fetchTime / fetchTotal : 0
        );
    }
}

The Measurement

Compare query latency with different shard counts on the same data volume:

Shards	Data per Shard	Query Latency (p50)	Query Latency (p99)	Coordinating Node CPU
1	5GB	8ms	25ms	5%
3	1.7GB	12ms	40ms	12%
6	850MB	18ms	65ms	22%
12	425MB	28ms	110ms	38%

The query latency increases with shard count even though each shard processes less data. The coordinating node’s merge overhead dominates at small per-shard data volumes.

The Decision Rule

Deploy dedicated coordinating nodes when: query volume exceeds 500 queries per second, or when data nodes show CPU contention between indexing and query coordination. Below 500 QPS on a well-sized cluster, any data node can serve as a coordinating node without measurable impact.

Never use from/size pagination beyond page 10 (i.e., from > 100). Use search_after for any pagination beyond the first few pages. Use the scroll API only for export operations that need to iterate through all matching documents, never for user-facing pagination.

When increasing shard count to improve search parallelism, measure the actual query latency improvement against the coordinating node merge overhead. If per-shard query time is already under 10ms, adding more shards makes total query time worse, not better.