Skip to main content
search at depth

How Documents Become Searchable

5 min read Chapter 2 of 60

How Documents Become Searchable

The Symptom

A documentation platform tenant reports that newly published API reference pages do not appear in search results for up to 30 seconds after publishing. The developer who built the indexing integration insists the documents are being indexed because the API returns 201 Created. The documents are indexed. They are not yet searchable. These are different things.

The Internals

When a document is submitted to OpenSearch, it does not immediately become part of the searchable inverted index. The document passes through a pipeline with distinct phases, each with different durability and visibility guarantees.

Phase 1: Coordinate and Route. The coordinating node receives the index request, determines which shard owns the document (using routing_value % number_of_shards), and forwards the request to the primary shard’s node.

Phase 2: Write to Translog. The primary shard appends the document to the transaction log (translog) on disk. This is a sequential write, append-only, and fast. The translog provides durability: if the node crashes before the next segment flush, the translog is replayed on recovery. The document is now durable but not searchable.

Phase 3: Write to In-Memory Buffer. The document is analyzed (tokenized, normalized, filtered) and the resulting terms are added to an in-memory indexing buffer. This buffer is a partial Lucene segment that exists only in heap memory. The document is now durable and in memory, but still not searchable.

Phase 4: Refresh. The refresh operation flushes the in-memory buffer to a new Lucene segment on the filesystem (using the OS page cache, not necessarily fsync’d to disk). Once the segment is written, the new searcher is opened and the document becomes visible to queries. By default, this happens every 1 second (index.refresh_interval).

Phase 5: Flush. The flush operation calls fsync on all unflushed segments, ensuring they are written to durable storage, and then truncates the translog. This is the point at which the translog is no longer needed for recovery.

The 30-second search delay the tenant reported was caused by a refresh_interval of 30s, set during a bulk import and never reverted.

The Implementation

The OpenSearch Java client provides two paths for indexing documents: single-document and bulk. For the documentation platform, single-document indexing is used for real-time page updates, and bulk indexing is used for initial tenant onboarding.

// HARDENED: Single document indexing with explicit refresh control

@Repository
public class DocumentSearchRepository {

    private final OpenSearchClient client;

    public DocumentSearchRepository(OpenSearchClient client) {
        this.client = client;
    }

    public void indexDocument(DocPage page) throws IOException {
        IndexRequest<DocPage> request = IndexRequest.of(r -> r
            .index("docs-v1")
            .id(page.tenantId() + ":" + page.slug())
            .routing(page.tenantId())
            .document(page)
            .refresh(Refresh.False)  // Do not force refresh on every write
        );

        IndexResponse response = client.index(request);

        if (response.result() != Result.Created && response.result() != Result.Updated) {
            throw new IndexingException(
                "Unexpected index result: " + response.result() +
                " for document " + page.slug()
            );
        }
    }

    public record DocPage(
        String tenantId,
        String title,
        String body,
        String slug,
        String apiMethod,
        String version,
        String contentType,
        List<String> codeSnippets
    ) {}
}
// FRAGILE: Forcing refresh on every single write
// This creates a new segment per document, destroying query performance
// under any meaningful write load.

IndexRequest<DocPage> request = IndexRequest.of(r -> r
    .index("docs-v1")
    .id(page.tenantId() + ":" + page.slug())
    .routing(page.tenantId())
    .document(page)
    .refresh(Refresh.True)  // New segment on every write
);

Using Refresh.True on every index operation forces OpenSearch to create a new Lucene segment after each document. On the documentation platform, a tenant publishing 500 API reference pages in a batch creates 500 segments. Each subsequent search query must search all 500 segments and merge results. The query latency degrades from 15ms to 400ms until the merge policy consolidates segments, consuming CPU and I/O in the process.

The correct approach: set refresh_interval to an appropriate value for your read latency requirements (1 second for near-real-time, 5-30 seconds for write-heavy workloads) and let OpenSearch batch refreshes.

The Measurement

The indexing path is observable through the _nodes/stats API:

GET _nodes/stats/indices/indexing,refresh,flush,translog

Key metrics to export to Prometheus:

MetricWhat it tells you
indices.indexing.index_totalTotal documents indexed
indices.indexing.index_time_in_millisTime spent in analysis and indexing
indices.refresh.totalNumber of refresh operations
indices.refresh.total_time_in_millisTime spent creating new segments
indices.translog.operationsDocuments in translog not yet flushed
indices.translog.size_in_bytesTranslog size on disk

A growing translog size with stable flush.total indicates flushes are not keeping up with writes. A high refresh.total_time_in_millis relative to refresh.total indicates segments are large and refresh is expensive.

The Decision Rule

Use Refresh.False (the default) when the application can tolerate the configured refresh_interval delay between indexing and search visibility. This covers the majority of documentation platform operations: page updates, new version publishes, bulk imports.

Use Refresh.WaitFor when the application must confirm that an indexed document is searchable before returning a response to the user, but you do not want to force a refresh that affects all other pending documents. This is appropriate for a documentation platform’s “publish and verify” workflow where the publisher needs to see their page in search results immediately after publishing.

Never use Refresh.True in a loop or in bulk operations. The segment-per-document cost makes it unsuitable for any indexing path that processes more than a handful of documents per minute.