Building the Query Test Set for Relevance Evaluation

The Symptom

A developer changes the title field boost from 3 to 5. The search results “look better” for the three queries they tested manually. The change is deployed. Support tickets arrive: method name searches that used to return the API reference now return a conceptual guide with the method name in the title. The three manually tested queries improved. Forty other query patterns regressed.

Manual spot-checking is not relevance evaluation. It is confirmation bias with a browser tab.

The Internals

A query test set (also called a judgment list or relevance assessment) is a dataset mapping queries to their expected results, graded by relevance. It serves the same purpose as a unit test suite: it codifies expected behavior and detects regressions when the system changes.

The test set must cover the distribution of actual user queries. For the documentation platform, analysis of search logs reveals five query categories:

Exact method names (25% of queries): getConnection, HttpClient.Builder, setRetryPolicy
Concept searches (30%): “connection pooling,” “retry policy,” “SSL configuration”
Error messages (15%): “Connection refused,” “NullPointerException in UserService”
Configuration keys (15%): “spring.datasource.url,” “server.port,” “logging.level”
How-to questions (15%): “how to configure connection timeout,” “debug slow queries”

Each category exercises different parts of the analysis and scoring pipeline. Method name queries depend on the code analyzer. Concept searches depend on BM25 and field boosting. Error message queries depend on phrase matching. Configuration key queries depend on the whitespace analyzer. How-to queries depend on natural language analysis.

The Implementation

/**
 * Query test set fixture for the documentation platform.
 * Stored in src/test/resources/relevance/query-test-set.json
 * and loaded in integration tests.
 */
public class QueryTestSetLoader {

    public record RelevanceJudgment(
        String queryId,
        String query,
        String category,
        Map<String, String> filters,
        List<JudgedDocument> judgments
    ) {}

    public record JudgedDocument(
        String documentId,
        int relevanceGrade  // 0=irrelevant, 1=marginal, 2=relevant, 3=perfect
    ) {}

    public List<RelevanceJudgment> loadTestSet() throws IOException {
        try (var stream = getClass().getResourceAsStream(
                "/relevance/query-test-set.json")) {
            return objectMapper.readValue(stream,
                new TypeReference<List<RelevanceJudgment>>() {});
        }
    }
}

The test set file:

[
  {
    "queryId": "Q001",
    "query": "getConnection",
    "category": "method_name",
    "filters": { "tenant_id": "acme" },
    "judgments": [
      { "documentId": "acme:api-ref-jdbc-connection", "relevanceGrade": 3 },
      { "documentId": "acme:guide-connection-pooling", "relevanceGrade": 2 },
      { "documentId": "acme:api-ref-datasource", "relevanceGrade": 1 },
      { "documentId": "acme:changelog-v3.2", "relevanceGrade": 0 }
    ]
  },
  {
    "queryId": "Q002",
    "query": "retry policy configuration",
    "category": "concept",
    "filters": { "tenant_id": "acme" },
    "judgments": [
      { "documentId": "acme:guide-retry-policies", "relevanceGrade": 3 },
      { "documentId": "acme:api-ref-http-client", "relevanceGrade": 2 },
      { "documentId": "acme:guide-error-handling", "relevanceGrade": 1 },
      { "documentId": "acme:guide-authentication", "relevanceGrade": 0 }
    ]
  },
  {
    "queryId": "Q003",
    "query": "Connection refused port 5432",
    "category": "error_message",
    "filters": { "tenant_id": "acme" },
    "judgments": [
      {
        "documentId": "acme:troubleshooting-db-connection",
        "relevanceGrade": 3
      },
      { "documentId": "acme:guide-postgres-setup", "relevanceGrade": 2 },
      { "documentId": "acme:api-ref-datasource", "relevanceGrade": 1 }
    ]
  },
  {
    "queryId": "Q004",
    "query": "spring.datasource.hikari.maximum-pool-size",
    "category": "config_key",
    "filters": { "tenant_id": "acme" },
    "judgments": [
      { "documentId": "acme:ref-config-properties", "relevanceGrade": 3 },
      { "documentId": "acme:guide-connection-pooling", "relevanceGrade": 2 }
    ]
  },
  {
    "queryId": "Q005",
    "query": "how to configure connection timeout",
    "category": "how_to",
    "filters": { "tenant_id": "acme" },
    "judgments": [
      { "documentId": "acme:guide-connection-timeout", "relevanceGrade": 3 },
      { "documentId": "acme:guide-connection-pooling", "relevanceGrade": 2 },
      { "documentId": "acme:api-ref-http-client", "relevanceGrade": 1 }
    ]
  }
]

Determining Ground Truth

Relevance grades are assigned by people who understand the documentation domain, not by the search system. For the documentation platform, this means:

A developer familiar with the documentation corpus reviews each query
For each query, the reviewer identifies the 3-10 most relevant documents and assigns grades
Grade 3 (perfect): the document directly answers the query
Grade 2 (relevant): the document contains useful information for the query
Grade 1 (marginal): the document is tangentially related
Grade 0 (irrelevant): the document should not appear in results

The initial test set requires a one-time investment of 2-4 hours for 50 queries. It is updated when new document types are added or when user search patterns shift.

The Measurement

Track test set coverage by query category:

Category	Queries in Test Set	% of User Traffic	Coverage
Method name	12	25%	Adequate
Concept	15	30%	Adequate
Error message	8	15%	Adequate
Config key	8	15%	Adequate
How-to	7	15%	Adequate
Total	50	100%

A test set with fewer than 30 queries is too small to detect regressions in minority query categories. A test set with more than 100 queries becomes burdensome to maintain and grade. 50 queries, distributed across categories in proportion to user traffic, provides reliable regression detection.

The Decision Rule

Create the query test set before making any relevance changes. The test set establishes the baseline against which all changes are measured. Without it, relevance tuning is guessing.

Grade relevance with domain experts, not with the search system’s output. If the test set is built by running queries and accepting the current top results as “correct,” the test set codifies the current behavior rather than the desired behavior.

Update the test set when user search patterns change (e.g., a new document type is added to the platform) or when a specific relevance failure surfaces a query pattern not covered by the existing test set.