Building a Search Quality Dashboard

The Symptom

The team deploys a synonym list update on Tuesday. Search relevance for technical queries improves by 0.04 NDCG. On Thursday, a teammate deploys a mapping change that accidentally removes the code_snippets field from the multi_match query. Relevance for code-related queries drops by 0.12 NDCG. Nobody notices because the only relevance metric is a monthly manual evaluation.

The Internals

Search quality is a time-series metric, not a one-time evaluation. Every change to the mapping, analyzer, query template, or synonym list potentially affects relevance. Without continuous measurement, regressions hide behind feature launches.

The search quality pipeline:

Query test set. A fixed set of queries with graded relevance judgments (from Chapter 8).
Automated evaluation. Run the test set against the current index, compute NDCG@5 per category.
Historical storage. Store each evaluation result with a timestamp and the deployment version.
Regression detection. Compare the current NDCG@5 with the previous deployment. Alert on drops > 0.02.

The Implementation

Automated NDCG Tracker

public class NdcgTracker {

    private final SearchService searchService;
    private final RelevanceEvaluator evaluator;
    private final OpenSearchClient client;

    public NdcgTracker(SearchService searchService,
            RelevanceEvaluator evaluator,
            OpenSearchClient client) {
        this.searchService = searchService;
        this.evaluator = evaluator;
        this.client = client;
    }

    public record NdcgSnapshot(
        Instant timestamp,
        String deploymentVersion,
        double overallNdcg,
        Map<String, Double> categoryNdcg,
        int queryCount,
        int failedQueries
    ) {}

    public NdcgSnapshot evaluate(String deploymentVersion,
            List<QueryTestCase> testSet) throws Exception {

        Map<String, List<Double>> categoryScores = new LinkedHashMap<>();
        int failedQueries = 0;

        for (QueryTestCase testCase : testSet) {
            try {
                var results = searchService.search(
                    testCase.tenantId(), testCase.query(), 0);
                List<String> returnedSlugs = results.hits().stream()
                    .map(Hit::id)
                    .toList();

                double ndcg = evaluator.computeNdcg(
                    returnedSlugs, testCase.judgments(), 5);

                categoryScores
                    .computeIfAbsent(testCase.category(), k -> new ArrayList<>())
                    .add(ndcg);
            } catch (Exception e) {
                failedQueries++;
            }
        }

        Map<String, Double> categoryAverages = categoryScores.entrySet().stream()
            .collect(Collectors.toMap(
                Map.Entry::getKey,
                e -> e.getValue().stream()
                    .mapToDouble(Double::doubleValue).average().orElse(0)
            ));

        double overallNdcg = categoryAverages.values().stream()
            .mapToDouble(Double::doubleValue).average().orElse(0);

        NdcgSnapshot snapshot = new NdcgSnapshot(
            Instant.now(),
            deploymentVersion,
            overallNdcg,
            categoryAverages,
            testSet.size(),
            failedQueries
        );

        // Store in the search-quality index
        storeSnapshot(snapshot);

        return snapshot;
    }

    private void storeSnapshot(NdcgSnapshot snapshot) throws IOException {
        client.index(i -> i
            .index("search-quality-metrics")
            .document(snapshot)
        );
    }
}

Regression Detector

public class RegressionDetector {

    private final OpenSearchClient client;
    private static final double REGRESSION_THRESHOLD = 0.02;

    public RegressionDetector(OpenSearchClient client) {
        this.client = client;
    }

    public record RegressionAlert(
        String category,
        double previousNdcg,
        double currentNdcg,
        double delta,
        String previousVersion,
        String currentVersion
    ) {}

    public List<RegressionAlert> detectRegressions(
            NdcgTracker.NdcgSnapshot current) throws IOException {

        // Fetch the previous snapshot
        var response = client.search(s -> s
            .index("search-quality-metrics")
            .query(q -> q.range(r -> r
                .field("timestamp")
                .lt(JsonData.of(current.timestamp().toString()))
            ))
            .sort(so -> so.field(f -> f
                .field("timestamp")
                .order(SortOrder.Desc)
            ))
            .size(1),
            NdcgTracker.NdcgSnapshot.class
        );

        if (response.hits().hits().isEmpty()) {
            return List.of();  // No previous snapshot to compare
        }

        var previous = response.hits().hits().get(0).source();
        List<RegressionAlert> alerts = new ArrayList<>();

        for (var entry : current.categoryNdcg().entrySet()) {
            String category = entry.getKey();
            double currentNdcg = entry.getValue();
            double previousNdcg = previous.categoryNdcg()
                .getOrDefault(category, 0.0);
            double delta = currentNdcg - previousNdcg;

            if (delta < -REGRESSION_THRESHOLD) {
                alerts.add(new RegressionAlert(
                    category, previousNdcg, currentNdcg, delta,
                    previous.deploymentVersion(),
                    current.deploymentVersion()
                ));
            }
        }

        return alerts;
    }
}

The Measurement

Search quality tracking over 30 days:

Week	Overall NDCG	Method Name	Concept	Error Message	Event
1	0.77	0.89	0.71	0.72	Baseline
2	0.79	0.89	0.75	0.72	Synonym update
3	0.82	0.89	0.78	0.72	Hybrid search launch
4	0.78	0.76	0.78	0.72	Mapping change (regression)

The regression in week 4 affected only the “method name” category, dropping from 0.89 to 0.76. The overall NDCG dropped by 0.04. Without per-category tracking, this regression would be averaged away: the overall drop of 0.04 might not trigger an alert, but the category-specific drop of 0.13 is clearly a problem.

The Decision Rule

Run NDCG evaluation on every deployment that changes mappings, analyzers, query templates, or synonym lists. Store results in a time-series index for historical comparison.

Alert on per-category regression, not just overall NDCG. A mapping change that improves concept queries (+0.02) while destroying method name queries (-0.10) has a net negative impact on user experience despite a modest overall NDCG change.

Include NDCG evaluation in the CI pipeline as a deployment gate. A deployment that reduces any category’s NDCG by more than 0.02 should require explicit approval before proceeding.