Relevance Tuning

A developer increases the title field boost from 3 to 5. The search results “feel better” for a handful of test queries. This is not relevance tuning. This is adjusting a knob and hoping. Relevance tuning means: change the boost, run the evaluation, compare the NDCG score before and after, and decide based on the number.

NDCG: The Metric That Matters

Normalized Discounted Cumulative Gain (NDCG) measures ranking quality by comparing the actual result order to the ideal result order. It accounts for two things:

Relevance grade: a document rated 3 (perfect) contributes more to the score than a document rated 1 (marginal)
Position discount: a relevant document at position 1 contributes more than the same document at position 5

The formula:

$$\text{DCG@k} = \sum_{i=1}^{k} \frac{2^{rel_i} - 1}{\log_2(i + 1)}$$

$$\text{NDCG@k} = \frac{\text{DCG@k}}{\text{IDCG@k}}$$

Where $rel_i$ is the relevance grade of the document at position $i$, and IDCG is the DCG of the ideal ranking (documents sorted by relevance grade, highest first).

NDCG ranges from 0 to 1. An NDCG@5 of 1.0 means the top 5 results are in perfect order according to the relevance grades. An NDCG@5 of 0.6 means the ranking is mediocre, with relevant documents appearing below irrelevant ones.

The OpenSearch Ranking Evaluation API

OpenSearch provides a built-in API for computing ranking metrics:

// HARDENED: Automated relevance evaluation using the Ranking Evaluation API

public class RelevanceEvaluator {

    private final OpenSearchClient client;

    public RelevanceEvaluator(OpenSearchClient client) {
        this.client = client;
    }

    public record EvaluationResult(
        double overallNdcg,
        Map<String, Double> perQueryNdcg,
        Map<String, List<String>> perQueryTopResults
    ) {}

    public EvaluationResult evaluate(String index,
            List<QueryTestSetLoader.RelevanceJudgment> testSet, Query queryTemplate)
            throws IOException {

        Map<String, Double> perQueryNdcg = new LinkedHashMap<>();
        Map<String, List<String>> perQueryTopResults = new LinkedHashMap<>();

        for (var judgment : testSet) {
            // Build the query with tenant filter
            Query query = Query.of(q -> q
                .bool(b -> b
                    .filter(f -> f.term(t -> t
                        .field("tenant_id")
                        .value(judgment.filters().get("tenant_id"))
                    ))
                    .must(mu -> mu.multiMatch(mm -> mm
                        .query(judgment.query())
                        .fields("title^3", "body", "code_snippets^0.5")
                        .type(TextQueryType.CrossFields)
                    ))
                )
            );

            // Execute the search
            SearchResponse<DocPage> response = client.search(s -> s
                    .index(index)
                    .query(query)
                    .size(10),
                DocPage.class
            );

            // Compute NDCG@5
            List<String> topResults = response.hits().hits().stream()
                .map(Hit::id)
                .limit(5)
                .toList();

            Map<String, Integer> grades = new HashMap<>();
            for (var jd : judgment.judgments()) {
                grades.put(jd.documentId(), jd.relevanceGrade());
            }

            double ndcg = computeNdcg(topResults, grades, 5);

            perQueryNdcg.put(judgment.queryId(), ndcg);
            perQueryTopResults.put(judgment.queryId(), topResults);
        }

        double overallNdcg = perQueryNdcg.values().stream()
            .mapToDouble(Double::doubleValue)
            .average()
            .orElse(0.0);

        return new EvaluationResult(overallNdcg, perQueryNdcg, perQueryTopResults);
    }

    private double computeNdcg(List<String> results,
            Map<String, Integer> grades, int k) {

        double dcg = 0.0;
        double idcg = 0.0;

        // Compute DCG
        for (int i = 0; i < Math.min(results.size(), k); i++) {
            int grade = grades.getOrDefault(results.get(i), 0);
            dcg += (Math.pow(2, grade) - 1) / (Math.log(i + 2) / Math.log(2));
        }

        // Compute IDCG (ideal ordering)
        List<Integer> idealGrades = grades.values().stream()
            .sorted(Comparator.reverseOrder())
            .limit(k)
            .toList();

        for (int i = 0; i < idealGrades.size(); i++) {
            idcg += (Math.pow(2, idealGrades.get(i)) - 1) / (Math.log(i + 2) / Math.log(2));
        }

        return idcg == 0 ? 0 : dcg / idcg;
    }
}

Field Weight Tuning

With the evaluation framework in place, field weight changes become experiments with measurable outcomes:

@Test
void fieldWeightExperiment() throws Exception {
    // Index the test corpus
    indexTestCorpus(client, "docs-v1");

    var evaluator = new RelevanceEvaluator(client);
    var testSet = new QueryTestSetLoader().loadTestSet();

    // Baseline: title^3, body^1, code_snippets^0.5
    EvaluationResult baseline = evaluator.evaluate("docs-v1", testSet, null);

    // Experiment: title^5, body^1, code_snippets^0.5
    // (change the query template to use different weights)
    EvaluationResult experiment = evaluator.evaluate("docs-v1", testSet, null);

    System.out.printf("Baseline NDCG@5: %.4f%n", baseline.overallNdcg());
    System.out.printf("Experiment NDCG@5: %.4f%n", experiment.overallNdcg());
    System.out.printf("Delta: %+.4f%n",
        experiment.overallNdcg() - baseline.overallNdcg());

    // Print per-category breakdown
    for (var entry : baseline.perQueryNdcg().entrySet()) {
        String queryId = entry.getKey();
        double baselineScore = entry.getValue();
        double experimentScore = experiment.perQueryNdcg().get(queryId);
        if (Math.abs(experimentScore - baselineScore) > 0.01) {
            System.out.printf("  %s: %.4f -> %.4f (%+.4f)%n",
                queryId, baselineScore, experimentScore,
                experimentScore - baselineScore);
        }
    }
}

A typical experiment output:

Baseline NDCG@5: 0.7340
Experiment NDCG@5: 0.7185
Delta: -0.0155

  Q001: 0.8500 -> 0.9200 (+0.0700)   // Method name: improved
  Q002: 0.7800 -> 0.6500 (-0.1300)   // Concept: degraded
  Q005: 0.7300 -> 0.6100 (-0.1200)   // How-to: degraded

The title boost increase from 3 to 5 improves method name queries (where the exact method name appears in the title) but degrades concept and how-to queries (where the query terms are distributed across title and body). The overall NDCG drops. The change is rejected.

Relevance Evaluation in CI

@Test
void relevanceRegressionCheck() throws Exception {
    indexTestCorpus(client, "docs-v1");

    var evaluator = new RelevanceEvaluator(client);
    var testSet = new QueryTestSetLoader().loadTestSet();

    EvaluationResult result = evaluator.evaluate("docs-v1", testSet, null);

    // Minimum acceptable NDCG thresholds per category
    Map<String, Double> thresholds = Map.of(
        "method_name", 0.80,
        "concept", 0.70,
        "error_message", 0.65,
        "config_key", 0.75,
        "how_to", 0.60
    );

    for (var entry : thresholds.entrySet()) {
        double categoryNdcg = computeCategoryNdcg(result, entry.getKey());
        assertThat(categoryNdcg)
            .as("NDCG@5 for category '%s' below threshold", entry.getKey())
            .isGreaterThanOrEqualTo(entry.getValue());
    }
}

Relevance evaluation pipeline showing the flow from query test set through search execution, NDCG computation, and regression detection

The relevance evaluation pipeline runs in CI on every pull request that modifies query construction, analyzer configuration, or field mappings. The test set is loaded from the fixture, each query is executed against a Testcontainers OpenSearch instance with the full production mapping, and NDCG@5 is computed per query and per category. If any category falls below its threshold, the build fails and the developer sees exactly which queries regressed.

The Decision Rule

Accept a relevance change only when it improves overall NDCG@5 without degrading any category below its minimum threshold. A change that improves one category at the expense of another is not an improvement; it is a trade-off that must be justified with a specific product decision.

Set minimum NDCG thresholds conservatively at first (0.05 below the current score) and tighten them as the evaluation framework matures. The purpose of the threshold is regression prevention, not perfection.

Run relevance evaluation on every pull request that touches query construction, analysis, or mapping. The Testcontainers integration test adds 30-60 seconds to the build. The alternative is discovering relevance regressions in production.