Skip to main content
search at depth

Log-Based Search Analytics Pipeline

4 min read Chapter 51 of 60

Log-Based Search Analytics Pipeline

The Symptom

The product manager asks: “What do our users search for most?” The team checks the application logs. Search queries are logged as unstructured text mixed with HTTP access logs. Extracting the top queries requires a custom grep pipeline that misses 30% of queries due to inconsistent log formatting.

The Internals

Search logs are the raw material for understanding user behavior. Every search query, its results, and the user’s subsequent actions form a feedback loop that drives search improvement. Without structured search logs, this feedback loop is broken.

The search analytics pipeline:

  1. Structured logging. Every search request produces a structured log entry with the query, results, latency, and user context.
  2. Indexing. Log entries are indexed into a dedicated analytics index with keyword fields for exact aggregation.
  3. Aggregation. Daily and weekly rollups produce top-queries, zero-result-queries, and slow-queries reports.
  4. Action. Each report maps to a specific improvement action: add synonyms, create missing content, optimize slow queries.

The Implementation

Search Event Schema

public record SearchEvent(
    @JsonProperty("event_id") String eventId,
    @JsonProperty("timestamp") Instant timestamp,
    @JsonProperty("tenant_id") String tenantId,
    @JsonProperty("user_id") String userId,
    @JsonProperty("query") String query,
    @JsonProperty("query_normalized") String queryNormalized,
    @JsonProperty("result_count") int resultCount,
    @JsonProperty("latency_ms") long latencyMs,
    @JsonProperty("page") int page,
    @JsonProperty("results_shown") List<String> resultsShown,
    @JsonProperty("filters_applied") Map<String, String> filtersApplied,
    @JsonProperty("search_type") String searchType  // lexical, semantic, hybrid
) {}

public record ClickEvent(
    @JsonProperty("event_id") String eventId,
    @JsonProperty("timestamp") Instant timestamp,
    @JsonProperty("search_event_id") String searchEventId,
    @JsonProperty("tenant_id") String tenantId,
    @JsonProperty("user_id") String userId,
    @JsonProperty("clicked_doc_slug") String clickedDocSlug,
    @JsonProperty("click_position") int clickPosition
) {}

Search Event Logger

public class SearchEventLogger {

    private final OpenSearchClient client;

    public SearchEventLogger(OpenSearchClient client) {
        this.client = client;
    }

    public void logSearch(SearchEvent event) throws IOException {
        client.index(i -> i
            .index("search-events-" + formatMonth(event.timestamp()))
            .document(event)
            .refresh(Refresh.False)
        );
    }

    public void logClick(ClickEvent event) throws IOException {
        client.index(i -> i
            .index("click-events-" + formatMonth(event.timestamp()))
            .document(event)
            .refresh(Refresh.False)
        );
    }

    private String formatMonth(Instant timestamp) {
        return timestamp.atZone(ZoneOffset.UTC)
            .format(DateTimeFormatter.ofPattern("yyyy-MM"));
    }
}

Analytics Reports

public class SearchAnalyticsReporter {

    private final OpenSearchClient client;

    public SearchAnalyticsReporter(OpenSearchClient client) {
        this.client = client;
    }

    public record TopQuery(String query, long count, double avgResultCount,
            double avgLatencyMs) {}

    public List<TopQuery> topQueries(String tenantId, int days, int topN)
            throws IOException {

        var response = client.search(s -> s
            .index("search-events-*")
            .size(0)
            .query(q -> q.bool(b -> b
                .filter(f -> f.term(t -> t.field("tenant_id").value(tenantId)))
                .filter(f -> f.range(r -> r
                    .field("timestamp")
                    .gte(JsonData.of("now-" + days + "d"))
                ))
            ))
            .aggregations("top_queries", a -> a
                .terms(t -> t
                    .field("query_normalized")
                    .size(topN)
                )
                .aggregations("avg_result_count", sub -> sub
                    .avg(avg -> avg.field("result_count"))
                )
                .aggregations("avg_latency", sub -> sub
                    .avg(avg -> avg.field("latency_ms"))
                )
            ),
            Void.class
        );

        return response.aggregations().get("top_queries")
            .sterms().buckets().array().stream()
            .map(bucket -> new TopQuery(
                bucket.key().stringValue(),
                bucket.docCount(),
                bucket.aggregations().get("avg_result_count").avg().value(),
                bucket.aggregations().get("avg_latency").avg().value()
            ))
            .toList();
    }

    public record ZeroResultQuery(String query, long count) {}

    public List<ZeroResultQuery> zeroResultQueries(String tenantId, int days)
            throws IOException {

        var response = client.search(s -> s
            .index("search-events-*")
            .size(0)
            .query(q -> q.bool(b -> b
                .filter(f -> f.term(t -> t.field("tenant_id").value(tenantId)))
                .filter(f -> f.term(t -> t.field("result_count").value(0)))
                .filter(f -> f.range(r -> r
                    .field("timestamp")
                    .gte(JsonData.of("now-" + days + "d"))
                ))
            ))
            .aggregations("zero_result_queries", a -> a
                .terms(t -> t
                    .field("query_normalized")
                    .size(50)
                    .minDocCount(3)
                )
            ),
            Void.class
        );

        return response.aggregations().get("zero_result_queries")
            .sterms().buckets().array().stream()
            .map(bucket -> new ZeroResultQuery(
                bucket.key().stringValue(),
                bucket.docCount()
            ))
            .toList();
    }
}

The Measurement

Weekly search analytics for the documentation platform (Tenant: Acme Corp):

ReportMetricCountAction
Top queries”authentication”2,340Verify ranking quality
Top queries”rate limiting”1,890Verify ranking quality
Zero-result”webhook retry policy”45Content gap → create doc
Zero-result”graphql subscription”38Synonym gap → add synonym
Slow queries”how to implement*” (wildcard)12Query rewrite → remove wildcard
Low CTR”error handling” (CTR: 8%)890Ranking problem → tune weights

Each row maps to a concrete improvement action. Zero-result queries expose content gaps and vocabulary mismatches. Low click-through rate queries expose ranking problems. Slow queries expose query optimization opportunities.

The Decision Rule

Log every search event with a structured schema that includes the normalized query, result count, latency, and tenant. Normalization (lowercase, whitespace trim, stopword removal) ensures that “API Key” and “api key” aggregate into the same bucket.

Generate weekly reports for: top 50 queries, top 50 zero-result queries, top 20 slow queries (> p95 latency), and bottom 20 CTR queries. Each report should map to a responsible team and a concrete action category (content creation, synonym addition, query optimization, ranking tuning).

Index search events into monthly indices (e.g., search-events-2024-03) with an ISM policy that deletes indices older than 12 months. Search analytics data grows linearly with traffic and provides diminishing value beyond 12 months.