Read Path Internals: Query Execution, Shard Fan-out, and the Coordinating Node Tax
Read Path Internals
A search request takes 15ms at low load and 200ms under concurrent search pressure. The query itself is simple: a match query on the body field filtered by tenant_id. The index has 3 shards, 1 replica. No aggregations. No highlighting. The hardware is not saturated.
The problem is in the read path, and understanding the read path requires understanding every step from the moment the query arrives at the coordinating node to the moment the results leave.
The Query Phase
When a coordinating node receives a search request, it enters the query phase:
-
Route to shards. The coordinating node determines which shards to query. With routing specified, it queries only the shard(s) mapped to that routing value. Without routing, it queries all primary or replica shards.
-
Forward to data nodes. The query is sent to one copy (primary or replica) of each targeted shard. OpenSearch uses adaptive replica selection to choose the copy with the lowest estimated latency.
-
Shard-level execution. On each data node, the shard-level query executes against every Lucene segment in the shard. Each segment is a self-contained inverted index. The query is run against each segment independently, and results are merged into a shard-level priority queue of size
from + size. -
Return partial results. Each shard returns its top
from + sizedocument IDs and scores to the coordinating node. -
Global merge. The coordinating node merges all shard-level results into a single priority queue and selects the final top
sizedocuments.
The Fetch Phase
After the global merge, the coordinating node knows which documents to return but has only their IDs and scores. The fetch phase retrieves the actual document content:
- Identify document locations. The coordinating node groups the selected document IDs by shard.
- Fetch _source. For each shard, the coordinating node requests the
_sourcecontent of the selected documents. - Apply post-processing. Highlighting, field filtering, and script fields are computed on the fetched documents.
- Return response. The coordinating node assembles the final response and returns it to the client.
Caching Layers
OpenSearch has three caching layers, each operating at a different granularity:
Node query cache (filter cache). Caches the results of filter clauses as bitsets. A filter for tenant_id: "acme" that matches 50,000 documents stores a bitset of those 50,000 document IDs. Subsequent queries with the same filter reuse the bitset without re-evaluating the filter. This cache operates per-segment: when a segment is merged away, its cache entries are invalidated.
// HARDENED: Structure queries to maximize filter cache hits
// Filters in bool.filter are cacheable. Filters in bool.must are not.
SearchRequest request = SearchRequest.of(s -> s
.index("docs-v1")
.query(q -> q
.bool(b -> b
.filter(f -> f.term(t -> t.field("tenant_id").value(tenantId)))
.filter(f -> f.term(t -> t.field("version").value(version)))
.must(mu -> mu.match(m -> m.field("body").query(userQuery)))
)
)
);
The tenant_id and version filters are cached as bitsets after their first execution. The match query in the must clause is scored and not cached. On subsequent searches for the same tenant and version with different query terms, the filter cache eliminates the tenant and version filtering cost entirely.
Shard request cache. Caches the complete response for a search request at the shard level. Keyed by the request body. Only caches requests with size: 0 (aggregation-only queries) by default. Can be enabled for search requests with request_cache: true, but invalidates on every refresh, making it useful only for indices that change infrequently.
Filesystem cache. The operating system’s page cache keeps frequently-accessed segment files in memory. On a data node with 64GB of RAM and 32GB allocated to the JVM heap, the remaining 32GB is available for the OS page cache. This is the most important cache in the system. Segment files accessed from page cache complete in microseconds; segment files read from SSD complete in hundreds of microseconds; segment files read from spinning disk complete in milliseconds.
_source Filtering
By default, OpenSearch returns the entire _source document. For the documentation platform, a document page’s body field can be 50KB of markdown. Returning 10 results means transferring 500KB of data over the network. If the application only needs the title and URL for the search results page, _source filtering reduces the transfer to a few kilobytes:
// HARDENED: Fetch only the fields needed for search results display
SearchRequest request = SearchRequest.of(s -> s
.index("docs-v1")
.source(src -> src
.filter(f -> f
.includes("title", "slug", "version", "content_type")
)
)
.query(q -> q.match(m -> m.field("body").query(userQuery)))
);
// FRAGILE: Fetching full _source including the 50KB body field
// for 10 results = 500KB response for a search results list page
// that only displays titles
SearchRequest request = SearchRequest.of(s -> s
.index("docs-v1")
.query(q -> q.match(m -> m.field("body").query(userQuery)))
// No _source filtering - full documents returned
);
The read path diagram shows the complete journey of a search request. The coordinating node fans the query out to the relevant shards. Each shard searches its Lucene segments independently, producing a shard-level top-N list. The coordinating node merges these lists, identifies the globally top-ranked documents, and issues fetch requests to the shards that hold those documents. The final response is assembled from the fetched sources and returned to the client.
The Decision Rule
Structure queries to maximize filter cache utilization. Place non-scoring predicates (tenant ID, version, content type) in bool.filter clauses, not bool.must clauses. The filter cache eliminates repeated evaluation of these predicates across searches.
Allocate at least 50% of data node RAM to the OS page cache (by not allocating it to the JVM heap). For a 64GB node, set the JVM heap to 31GB (staying below the compressed oops threshold) and leave 33GB for page cache. The page cache has more impact on query latency than any JVM tuning.
Use _source filtering for search results pages that do not need full document content. The network transfer reduction is significant when documents are large, and the reduction compounds across concurrent searches.