Skip to main content
search at depth

Analysis Pipeline: Tokenization, Normalization, Stemming, and Why Search Breaks on Technical Terms

5 min read Chapter 4 of 60

Analysis Pipeline

A user searches for getConnection in the documentation search engine. Zero results. The document titled “Managing Database Connections with getConnection()” is in the index. It was indexed yesterday. The inverted index contains the document. The query returns nothing.

The problem is not in the query. The problem is in the analysis pipeline. The standard analyzer lowercased getConnection to getconnection and stored it as a single token. It did the same to the document body. But the document body contains the method name inside a sentence: “Call getConnection() to obtain a pooled connection.” The standard analyzer tokenized this into call, getconnection, to, obtain, a, pooled, connection. The search for getconnection should match. And it does, if the user types it exactly.

But users do not search for getconnection. They search for getConnection, or get connection, or get_connection. The standard analyzer cannot bridge these variations because it was designed for natural language, not for identifiers written in camelCase, snake_case, or dot-separated notation.

The Three Stages of Analysis

Every text field in OpenSearch passes through an analysis pipeline when indexed and (by default) when queried. The pipeline has three stages:

Character Filters operate on the raw character stream before tokenization. They can strip HTML tags, replace characters, or normalize Unicode. Zero or more character filters can be configured.

Tokenizer splits the character stream into individual tokens. Exactly one tokenizer is required. The choice of tokenizer determines the fundamental unit of search: words, whitespace-delimited chunks, n-grams, or path components.

Token Filters modify, add, or remove tokens after tokenization. Lowercasing, stemming, synonym expansion, and stop word removal all happen here. Zero or more token filters can be configured, and they execute in order.

The standard analyzer uses no character filters, the standard tokenizer (which splits on word boundaries defined by Unicode text segmentation), and two token filters: lowercase and stop words (disabled by default in OpenSearch).

Why Standard Analysis Fails for Technical Documentation

Consider indexing this documentation page:

The HttpClientFactory.getConnection() method returns a pooled java.sql.Connection from the HikariDataSource. Configure maxPoolSize in application.yml.

The standard analyzer produces these tokens:

httpclientfactory.getconnection
method
returns
pooled
java.sql.connection
hikariDatasource       (lowercased: hikaridatasource)
configure
maxpoolsize
application.yml

Three problems:

  1. Dotted identifiers remain intact. HttpClientFactory.getConnection() becomes httpclientfactory.getconnection. A user searching for getConnection does not match because getconnection is a substring of the stored token, not a separate token.

  2. CamelCase is not decomposed. HikariDataSource becomes hikaridatasource. A search for data source returns zero results.

  3. File extensions are split inconsistently. application.yml may or may not split on the dot depending on the tokenizer version and Unicode rules.

The _analyze API reveals exactly what tokens an analyzer produces:

POST _analyze
{
  "analyzer": "standard",
  "text": "HttpClientFactory.getConnection() returns a java.sql.Connection"
}

Response:

{
  "tokens": [
    {
      "token": "httpclientfactory.getconnection",
      "start_offset": 0,
      "end_offset": 31
    },
    { "token": "returns", "start_offset": 34, "end_offset": 41 },
    { "token": "a", "start_offset": 42, "end_offset": 43 },
    { "token": "java.sql.connection", "start_offset": 44, "end_offset": 63 }
  ]
}

This is the analysis pipeline working exactly as designed. It is the wrong design for this content.

Building a Code-Aware Analyzer

The documentation search platform needs an analyzer that handles three patterns: camelCase decomposition, dot-separated splitting, and underscore splitting. The following custom analyzer addresses all three:

// HARDENED: Custom analyzer for technical documentation
// Decomposes camelCase, splits on dots and underscores, preserves original form

CreateIndexRequest request = CreateIndexRequest.of(idx -> idx
    .index("docs-v1")
    .settings(s -> s
        .analysis(a -> a
            .analyzer("code_analyzer", an -> an
                .custom(c -> c
                    .tokenizer("code_tokenizer")
                    .filter("lowercase", "camel_case_split", "word_delimiter_filter")
                )
            )
            .tokenizer("code_tokenizer", t -> t
                .definition(d -> d
                    .pattern(p -> p
                        .pattern("[.\\s(){}\\[\\];,<>]")
                    )
                )
            )
            .filter("camel_case_split", f -> f
                .definition(d -> d
                    .wordDelimiterGraph(w -> w
                        .generateWordParts(true)
                        .generateNumberParts(true)
                        .splitOnCaseChange(true)
                        .splitOnNumerics(true)
                        .preserveOriginal(true)
                    )
                )
            )
            .filter("word_delimiter_filter", f -> f
                .definition(d -> d
                    .wordDelimiterGraph(w -> w
                        .generateWordParts(true)
                        .generateNumberParts(true)
                        .splitOnCaseChange(false)
                        .preserveOriginal(false)
                    )
                )
            )
        )
    )
    .mappings(m -> m
        .properties("body", p -> p.text(t -> t.analyzer("code_analyzer")))
        .properties("title", p -> p.text(t -> t
            .analyzer("code_analyzer")
            .fields("standard", f -> f.text(tx -> tx.analyzer("standard")))
            .fields("exact", f -> f.keyword(k -> k.ignoreAbove(512)))
        ))
    )
);

With this analyzer, HttpClientFactory.getConnection() produces:

httpclientfactory, http, client, factory, getconnection, get, connection

A search for getConnection matches. A search for connection matches. A search for factory matches. A search for HttpClientFactory matches via the preserved original. The search behaves the way a developer expects.

Multi-Field Analysis Strategy

The documentation platform uses multiple analysis strategies on the same field through multi-field mappings. The title field in the mapping above has three representations:

Sub-fieldAnalyzerUse Case
titlecode_analyzerFull-text search with camelCase and dot decomposition
title.standardstandardNatural language search, phrase queries
title.exact(keyword)Exact title match, aggregations, sorting

A search query can boost these sub-fields differently:

SearchRequest request = SearchRequest.of(s -> s
    .index("docs-v1")
    .query(q -> q
        .bool(b -> b
            .should(sh -> sh.match(m -> m.field("title").query(userQuery).boost(2.0f)))
            .should(sh -> sh.match(m -> m.field("title.standard").query(userQuery).boost(1.0f)))
            .should(sh -> sh.term(t -> t.field("title.exact").value(userQuery).boost(5.0f)))
        )
    )
);

An exact title match scores highest. A code-aware match scores next. A standard natural-language match provides a fallback. This layered analysis is what makes the documentation search engine return getConnection as the top result when a developer searches for exactly that method name, while still returning conceptually relevant pages when they search for “connection management.”

Analysis pipeline stages showing raw text flowing through character filters, tokenizer, and token filters to produce searchable terms in the inverted index

The diagram illustrates how a single input string passes through the three stages of analysis. The character filter cleans the raw input, the tokenizer splits it into initial tokens, and the token filter chain (lowercase, word delimiter, etc.) transforms those tokens into the final terms stored in the inverted index. Different analyzers produce different tokens from the same input, and once stored, those tokens define what queries can match.