Debugging Analysis with the _analyze API
Debugging Analysis with the _analyze API
The Symptom
A search for “Spring Boot configuration” returns results. A search for “SpringBoot configuration” returns nothing. Both queries are searching the same field. Both contain the same words. The user is confused. The developer is confused. The inverted index is doing exactly what it was told.
The Internals
The _analyze API is the single most useful debugging tool for search relevance problems. It shows exactly what tokens an analyzer produces from a given input. When a query returns zero results, the first question is always: what tokens does the analyzer produce for the query terms, and do those tokens exist in the index?
Analysis happens twice for every search operation:
- Index-time analysis: when a document is indexed, its text fields are analyzed and the resulting tokens are stored in the inverted index.
- Query-time analysis: when a search query is executed, the query text is analyzed using the same analyzer (by default) and the resulting tokens are used for lookup.
A match occurs when a query-time token equals an index-time token. If the analyzers produce different tokens, no match is possible.
The _analyze API can be called three ways:
# Using a built-in analyzer
POST _analyze
{
"analyzer": "standard",
"text": "SpringBoot configuration"
}
# Using a specific index's analyzer for a field
POST docs-v1/_analyze
{
"field": "title",
"text": "SpringBoot configuration"
}
# Building an analyzer inline for experimentation
POST _analyze
{
"tokenizer": "standard",
"filter": ["lowercase"],
"text": "SpringBoot configuration"
}
The third form is the most powerful for debugging. It lets you isolate the effect of each component.
The Implementation
A Java utility class for analyzer debugging during development and testing:
public class AnalyzerDebugger {
private final OpenSearchClient client;
public AnalyzerDebugger(OpenSearchClient client) {
this.client = client;
}
/**
* Analyze text using a specific index and field analyzer,
* returning the list of tokens for inspection.
*/
public List<String> analyzeForField(String index, String field, String text)
throws IOException {
AnalyzeRequest request = AnalyzeRequest.of(a -> a
.index(index)
.field(field)
.text(text)
);
AnalyzeResponse response = client.indices().analyze(request);
return response.tokens().stream()
.map(AnalyzeToken::token)
.toList();
}
/**
* Compare tokens produced by two different analyzers on the same text.
*/
public record AnalyzerComparison(
List<String> analyzerATokens,
List<String> analyzerBTokens,
List<String> onlyInA,
List<String> onlyInB,
List<String> common
) {}
public AnalyzerComparison compareAnalyzers(
String index, String fieldA, String fieldB, String text)
throws IOException {
List<String> tokensA = analyzeForField(index, fieldA, text);
List<String> tokensB = analyzeForField(index, fieldB, text);
Set<String> setA = new HashSet<>(tokensA);
Set<String> setB = new HashSet<>(tokensB);
List<String> common = tokensA.stream().filter(setB::contains).distinct().toList();
List<String> onlyInA = tokensA.stream().filter(t -> !setB.contains(t)).distinct().toList();
List<String> onlyInB = tokensB.stream().filter(t -> !setA.contains(t)).distinct().toList();
return new AnalyzerComparison(tokensA, tokensB, onlyInA, onlyInB, common);
}
}
Using this in a test to verify analyzer behavior:
@Test
void codeAnalyzerDecomposeCamelCase() throws Exception {
// Create index with code_analyzer (as defined in CH2)
createDocsIndex(client);
var debugger = new AnalyzerDebugger(client);
List<String> tokens = debugger.analyzeForField("docs-v1", "title", "HttpClientFactory");
assertThat(tokens).contains("httpclientfactory"); // preserved original
assertThat(tokens).contains("http"); // camelCase split
assertThat(tokens).contains("client");
assertThat(tokens).contains("factory");
}
@Test
void standardAnalyzerFailsOnCamelCase() throws Exception {
var debugger = new AnalyzerDebugger(client);
List<String> standard = debugger.analyzeForField("docs-v1", "title.standard",
"HttpClientFactory");
// Standard analyzer does NOT decompose camelCase
assertThat(standard).contains("httpclientfactory");
assertThat(standard).doesNotContain("http");
assertThat(standard).doesNotContain("client");
}
The Measurement
Build a diagnostic report comparing analyzer behavior across representative queries from the documentation platform:
| Query | Standard Tokens | Code Analyzer Tokens | Match Difference |
|---|---|---|---|
getConnection | getconnection | getconnection, get, connection | +2 tokens, broader match |
java.sql.Connection | java.sql.connection | java, sql, connection | Dot-split enables component match |
max_pool_size | max_pool_size | max, pool, size, max_pool_size | Underscore-split enables partial match |
Spring Boot | spring, boot | spring, boot | Identical for natural language |
The measurement reveals the trade-off: the code analyzer produces more tokens, which means broader recall (more documents match) but potentially lower precision (documents match that should not). This trade-off is managed through field boosting in the query, covered in chapter 9.
The Decision Rule
Use the _analyze API as the first debugging step when search returns unexpected results. Before modifying boost weights, before adding synonyms, before restructuring the query DSL, verify that the analyzer produces tokens that make the match possible. If the tokens do not match, no amount of query tuning will fix the problem.
Use the analyzer comparison utility during development to validate that changes to the analysis pipeline do not break existing search behavior. Run it against the query test set (built in chapter 8) as a regression check.