Scoring

A search for “retry policy” returns 47 documents from the documentation index. The user sees the first five. Those five are the entirety of their search experience. The remaining 42 are invisible. Scoring determines the five they see.

The scoring algorithm is not a ranking secret. It is BM25, published in a research paper, implemented in Lucene, and applied by default in every OpenSearch index. Understanding BM25 is the difference between guessing at relevance and engineering it.

From TF-IDF to BM25

Term Frequency (TF)

A document that mentions “retry” 12 times is probably more about retries than a document that mentions it once. Term frequency counts how many times a term appears in a document. In the raw TF model, a document with tf=12 scores 12 times higher than a document with tf=1.

This is wrong. The 12th mention of “retry” adds less information than the first. The first mention establishes that the document is about retries. The twelfth mention is repetition. Raw TF scores reward verbose documents over concise ones.

Inverse Document Frequency (IDF)

A term that appears in every document is not useful for distinguishing between documents. The word “the” appears in every documentation page. A search for “the” should not surface every document equally. IDF measures how rare a term is across the corpus:

$$\text{IDF}(t) = \log\left(\frac{N}{df(t)}\right)$$

Where $N$ is the total number of documents and $df(t)$ is the number of documents containing term $t$. A term appearing in 10 out of 1,000,000 documents has a high IDF. A term appearing in 900,000 out of 1,000,000 has a low IDF.

BM25: The Saturation Fix

BM25 (Best Matching 25) replaces raw term frequency with a saturating function:

$$\text{score}(D, Q) = \sum_{i=1}^{n} \text{IDF}(q_i) \cdot \frac{f(q_i, D) \cdot (k_1 + 1)}{f(q_i, D) + k_1 \cdot \left(1 - b + b \cdot \frac{|D|}{\text{avgdl}}\right)}$$

Where:

$f(q_i, D)$ is the term frequency of query term $q_i$ in document $D$
$|D|$ is the document length (in terms)
$\text{avgdl}$ is the average document length across the index
$k_1$ controls term frequency saturation (default: 1.2)
$b$ controls length normalization (default: 0.75)

Two parameters matter:

k1 (default 1.2) controls how quickly term frequency saturates. At k1=0, term frequency is completely ignored: a document with tf=1 and tf=100 score the same. At k1=10, term frequency has a strong linear effect for the first 10 occurrences. The default of 1.2 saturates quickly: a document with tf=3 scores only marginally higher than tf=2.

b (default 0.75) controls document length normalization. At b=0, document length is completely ignored. At b=1, scoring is fully normalized by document length: a short document with one mention of “retry” scores the same as a long document with four mentions. The default of 0.75 provides moderate normalization.

For the documentation search platform, the defaults are almost always correct. A page explaining retry policies in detail (long, many mentions) should score similarly to a concise reference page (short, few mentions) when both are genuinely about retry policies. The b=0.75 normalization achieves this.

The Explain API

The explain API decomposes a score into its BM25 components for a specific document and query:

ExplainResponse<DocPage> explanation = client.explain(e -> e
        .index("docs-v1")
        .id("tenant-acme:retry-policy-guide")
        .query(q -> q
            .match(m -> m
                .field("body")
                .query("retry policy")
            )
        ),
    DocPage.class
);

if (explanation.matched()) {
    System.out.println("Score: " + explanation.explanation().value());
    printExplanation(explanation.explanation(), 0);
}

private static void printExplanation(ExplanationDetail detail, int indent) {
    System.out.println(" ".repeat(indent) + detail.value() + " " + detail.description());
    for (ExplanationDetail child : detail.details()) {
        printExplanation(child, indent + 2);
    }
}

The output reveals the full scoring breakdown:

3.8714 sum of:
  2.1045 weight(body:retry in 0) [BM25], result of:
    2.1045 score(freq=4.0), computed as:
      1.3862 idf, computed as log(1 + (N - n + 0.5) / (n + 0.5))
        (N=10000, n=2500)
      1.5174 tf, computed as freq / (freq + k1 * (1 - b + b * dl/avgdl))
        (freq=4.0, k1=1.2, b=0.75, dl=450, avgdl=380)
  1.7669 weight(body:policy in 0) [BM25], result of:
    1.7669 score(freq=2.0), computed as:
      1.6094 idf, computed as log(1 + (N - n + 0.5) / (n + 0.5))
        (N=10000, n=2000)
      1.0978 tf, computed as freq / (freq + k1 * (1 - b + b * dl/avgdl))
        (freq=2.0, k1=1.2, b=0.75, dl=450, avgdl=380)

Reading this output: the term “retry” appears in 2,500 of 10,000 documents (moderately common), appears 4 times in this document, and contributes a score of 2.1045. The term “policy” is slightly rarer (2,000 of 10,000), appears twice, and contributes 1.7669. The total score is the sum: 3.8714.

Shard-Level Scoring and Its Traps

BM25 computes IDF locally within each shard. In a 3-shard index, if shard 0 has 100 documents containing “retry” and shard 1 has 5,000 documents containing “retry,” the IDF for “retry” is different on each shard. The same document content, placed on different shards, gets different scores.

This matters for small indices or highly skewed data distributions. On the documentation platform, a tenant with 500 documents spread across 3 shards means some shards may have 150 documents and others 200. The per-shard IDF can vary enough to produce noticeably inconsistent rankings.

Two solutions:

// Option 1: Use DFS_QUERY_THEN_FETCH for accurate global scoring
// Adds a pre-query round trip to collect global term statistics from all shards
// Cost: one extra network round trip per query

SearchRequest request = SearchRequest.of(s -> s
    .index("docs-v1")
    .searchType(SearchType.DfsQueryThenFetch)
    .query(q -> q.match(m -> m.field("body").query("retry policy")))
);

/*
Option 2: Set number_of_shards to 1 for small indices
This eliminates the shard-level IDF problem entirely.
Appropriate when the index fits on a single shard (under 30-50GB).
*/

For the documentation platform, most tenants have fewer than 100,000 documents. A single shard handles this volume. The IDF skew problem disappears. For the few tenants with millions of documents, DFS_QUERY_THEN_FETCH adds 2-5ms to query latency but produces consistent scoring.

BM25 scoring formula annotated with component values from a real explain API output, showing how TF saturation and document length normalization produce the final score

The BM25 score is the sum of per-term contributions. Each term’s contribution is the product of its IDF (how rare is this term?) and its saturated TF (how often does this document mention it, with diminishing returns). The document length normalization via parameter $b$ penalizes long documents proportionally, preventing verbose pages from dominating results purely through repetition.

The Decision Rule

Leave BM25 parameters at their defaults (k1=1.2, b=0.75) unless you have measured a specific scoring pathology with the explain API and confirmed that parameter adjustment improves NDCG on your query test set (built in chapter 8).

Use DFS_QUERY_THEN_FETCH when the index has multiple shards with fewer than 10,000 documents per shard, which makes per-shard IDF statistics unreliable. Use the default QUERY_THEN_FETCH when each shard has sufficient documents for stable statistics (above 50,000 documents per shard).

Never tune BM25 parameters based on a single query. The explain API reveals the scoring breakdown for one query against one document. Parameter changes affect every query across the entire index. Test with the full query test set.