Trace-Based Performance Analysis

The main chapter covered setting up OpenTelemetry and finding a sequential fan-out problem. Section 1 covered instrumenting the code paths that auto-instrumentation misses. This section is about the analysis workflow: how to use trace data systematically to find performance problems, how to configure sampling so you capture the traces that matter, and how to connect traces to metrics so a latency spike in Grafana leads directly to the root cause.

The Jaeger Analysis Workflow

Jaeger stores traces and provides a UI for searching, filtering, and visualizing them. The raw search interface is useful for ad-hoc debugging, but performance analysis requires a structured approach.

Step 1: Identify the slow operation. Start from metrics. Grafana shows that GET /articles/{id} p99 latency jumped from 180ms to 920ms at 14:00. This narrows the search to a specific operation and time window.

Step 2: Search for slow traces. In Jaeger, search for service article-service, operation GET /articles/{id}, min duration 500ms, time range 14:00-14:30. This returns traces that are actually slow, not a random sample that might be fast.

Step 3: Compare slow vs fast traces. Select a slow trace (920ms) and a fast trace from before 14:00 (170ms). Jaeger’s compare view shows the two traces side by side. Look for spans that are present in the slow trace but not the fast one, or spans that are dramatically longer:

Fast trace (170ms):                    Slow trace (920ms):
api-gateway (168ms)                    api-gateway (918ms)
├── article-service (47ms)             ├── article-service (47ms)
│   ├── redis.GET (2ms) HIT            │   ├── redis.GET (2ms) MISS
│   └── postgres.SELECT (8ms)          │   ├── postgres.SELECT (43ms)
├── recommendation (112ms)             │   └── cdn.purge (380ms)        ← new span
│   └── qdrant.search (89ms)           ├── recommendation (482ms)
└── analytics (16ms)                   │   ├── qdrant.search (89ms)
                                       │   └── article-service.batch (390ms) ← slow
                                       └── analytics (16ms)

Difference: cdn.purge (380ms) and article-service.batch (390ms)

Step 4: Drill into the anomalous span. The cdn.purge span was not present in normal requests. Click it. Attributes show cdn.purge.trigger=cache-miss-revalidation, cdn.purge.urls.count=12. Someone deployed a change that triggers CDN cache purge on cache misses. That is the regression.

Step 5: Verify with aggregates. Search for all traces containing a cdn.purge span in the last hour. Check if every slow trace has this span. If the correlation is 100%, you have found the root cause.

Tail Sampling in Practice

The main chapter introduced tail sampling conceptually. This section covers the production configuration for the content platform.

The OpenTelemetry Collector sits between the application SDKs and Jaeger. Applications send all spans to the Collector. The Collector decides which complete traces to keep.

# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  # Memory protection: drop spans if buffer is exhausted
  memory_limiter:
    check_interval: 1s
    limit_mib: 3072
    spike_limit_mib: 768

  # Tail sampling: keep interesting traces, drop boring ones
  tail_sampling:
    decision_wait: 15s
    num_traces: 200000
    expected_new_traces_per_sec: 1200
    policies:
      # Policy 1: Keep all error traces
      - name: errors-policy
        type: status_code
        status_code:
          status_codes: [ERROR]

      # Policy 2: Keep traces slower than the SLO threshold
      - name: latency-policy
        type: latency
        latency:
          threshold_ms: 400

      # Policy 3: Keep traces with specific attributes (deployment canary)
      - name: canary-traces
        type: string_attribute
        string_attribute:
          key: deployment.canary
          values: ["true"]

      # Policy 4: Keep 2% of normal traces for baseline comparison
      - name: baseline-policy
        type: probabilistic
        probabilistic:
          sampling_percentage: 2

      # Policy 5: Always keep traces from the search endpoint
      - name: search-traces
        type: string_attribute
        string_attribute:
          key: http.route
          values: ["/search", "/q"]

  # Batch spans before sending to Jaeger
  batch:
    timeout: 5s
    send_batch_size: 5000
    send_batch_max_size: 10000

exporters:
  otlp/jaeger:
    endpoint: jaeger-collector:4317
    tls:
      insecure: true
    retry_on_failure:
      enabled: true
      initial_interval: 5s
      max_interval: 30s
    sending_queue:
      enabled: true
      num_consumers: 10
      queue_size: 5000

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, tail_sampling, batch]
      exporters: [otlp/jaeger]

Sizing the decision window. The decision_wait of 15 seconds means the Collector holds all spans for a trace until 15 seconds after the last span arrives. If a trace’s spans arrive over a 12-second window (common for async Kafka processing), a 10-second decision wait would make the decision before all spans arrive. The Collector would see an incomplete trace and potentially drop a slow trace because the slow span had not arrived yet. Set the decision wait to at least 1.5x your maximum expected trace duration.

Sizing the trace buffer. num_traces: 200000 limits how many distinct traces the Collector tracks simultaneously. At 1,200 traces/second with a 15-second window, you need at minimum 18,000 trace slots. The 200,000 limit provides headroom for traffic spikes. Each trace slot consumes roughly 2KB of overhead plus the actual span data. Budget 1-3GB of RAM for the Collector.

Policy evaluation order. The Collector evaluates policies as an OR. If any policy matches, the trace is kept. The order does not affect the outcome but affects evaluation cost. Put cheap policies (status code check) before expensive ones (latency calculation across all spans).

Measuring Sampling Effectiveness

After deploying tail sampling, verify it captures what you need:

// SLOW: Head sampling at 10% randomly
Traces captured per hour: 432,000
Traces with errors: 12 out of estimated 120 (10% sampled)
Traces above SLO: 45 out of estimated 450 (10% sampled)
Coverage of anomalies: 10%

// FAST: Tail sampling with the policies above
Traces captured per hour: 38,400
Traces with errors: 120 out of 120 (100% captured)
Traces above SLO: 450 out of 450 (100% captured)
Baseline normal traces: 8,640 (2% of 432,000)
Coverage of anomalies: 100%
Storage reduction: 91% fewer traces than head sampling at 10%

Tail sampling stores 11x fewer traces than 10% head sampling while capturing every anomaly. The trade-off is operational complexity: the Collector becomes a stateful component that needs memory, monitoring, and redundancy.

Exemplars: Connecting Metrics to Traces

Metrics tell you something is wrong. Traces tell you why. Exemplars are the bridge: they attach a trace ID to a specific metric data point, so you can click from a Grafana chart directly to the trace that produced the outlier value.

The OpenTelemetry Collector’s spanmetrics connector generates metrics from spans and automatically attaches exemplars:

connectors:
  spanmetrics:
    namespace: traces.spanmetrics
    histogram:
      explicit:
        buckets: [5ms, 10ms, 25ms, 50ms, 100ms, 250ms, 500ms, 1s, 2.5s, 5s, 10s]
    dimensions:
      - name: http.method
      - name: http.route
      - name: service.name
    exemplars:
      enabled: true
      max_per_metric_stream: 4

exporters:
  prometheusremotewrite:
    endpoint: http://prometheus:9090/api/v1/write
    send_metadata: true

service:
  pipelines:
    metrics:
      receivers: [spanmetrics]
      exporters: [prometheusremotewrite]

In Grafana, configure the Prometheus data source with exemplar support and link to Jaeger:

# Grafana Prometheus data source configuration
Exemplars:
  - Label name: trace_id
    URL: http://jaeger:16686/trace/${__value.raw}
    Data source: Jaeger

Now when you hover over a latency histogram in Grafana, outlier data points show a small diamond icon. Click it and Grafana opens the Jaeger trace view for that exact request. The workflow becomes:

1. Grafana alert: p99 latency > 400ms for article-service
2. Open Grafana dashboard, zoom to the spike window
3. See exemplar diamonds on the histogram at the 95th+ percentile
4. Click exemplar → Jaeger opens trace a7f3b2c1
5. Trace shows: postgres.SELECT took 1,200ms
6. Span attributes: db.statement="SELECT * FROM articles WHERE slug = $1"
7. Check pg_stat_statements for that query → sequential scan, missing index on slug
8. Add index, verify with next exemplar trace → postgres.SELECT now takes 3ms

The entire investigation from alert to root cause takes minutes instead of hours. No log correlation. No guessing which request was slow. The exemplar points directly at the evidence.

Building a Trace-Based Performance Regression Detector

Individual trace analysis is reactive. You find problems after users experience them. A trace-based regression detector finds problems before they reach users by comparing span durations across deployments.

The approach: after each deployment, run a synthetic load test and collect traces. Compare span duration distributions between the new version and the previous version. Flag any span whose p95 duration increased by more than 20%.

# compare_trace_durations.py
import requests
from scipy import stats

JAEGER_API = "http://jaeger:16686/api"

def get_span_durations(service, operation, start_time, end_time):
    """Fetch span durations from Jaeger for a service/operation in a time window."""
    params = {
        "service": service,
        "operation": operation,
        "start": start_time,
        "end": end_time,
        "limit": 2000,
    }
    resp = requests.get(f"{JAEGER_API}/traces", params=params)
    durations = []
    for trace in resp.json()["data"]:
        for span in trace["spans"]:
            if span["operationName"] == operation:
                durations.append(span["duration"] / 1000.0)  # microseconds to ms
    return durations

def detect_regression(service, operation, before_window, after_window):
    """Compare span durations before and after a deployment."""
    before = get_span_durations(service, operation, *before_window)
    after = get_span_durations(service, operation, *after_window)

    if len(before) < 50 or len(after) < 50:
        return {"status": "insufficient_data"}

    before_p95 = sorted(before)[int(len(before) * 0.95)]
    after_p95 = sorted(after)[int(len(after) * 0.95)]
    change_pct = ((after_p95 - before_p95) / before_p95) * 100

    # Mann-Whitney U test for statistical significance
    stat, p_value = stats.mannwhitneyu(before, after, alternative="less")

    return {
        "operation": operation,
        "before_p95_ms": round(before_p95, 2),
        "after_p95_ms": round(after_p95, 2),
        "change_pct": round(change_pct, 2),
        "p_value": round(p_value, 4),
        "regression": change_pct > 20 and p_value < 0.05,
    }

Run this in CI after a canary deployment. If any critical-path operation shows a statistically significant p95 regression above 20%, block the rollout.

Deployment: article-service v2.14.0 → v2.15.0

Operation                          Before p95   After p95   Change    Regression?
GET /articles/{id}                 168ms        172ms       +2.4%     No
article.render.markdown-parse      28ms         29ms        +3.6%     No
article.render.syntax-highlight    34ms         78ms        +129.4%   YES (p<0.001)
recommendation.getRelated          118ms        121ms       +2.5%     No

BLOCKED: syntax-highlight regression detected.
Root cause: v2.15.0 upgraded Prism library, new version re-compiles
regex patterns on each invocation instead of caching them.

Trace Data Retention and Cost

Jaeger stores trace data in a backend (Elasticsearch, Cassandra, or ClickHouse). Storage cost is directly proportional to span volume and retention period.

Content platform span budget:

Metric                              Value
Requests per second                 12,000
Average spans per trace             10
Total spans per second              120,000
With tail sampling (keep ~3%)       3,600 spans/sec
Average span size (compressed)      180 bytes
Storage per day                     56 GB
Retention period                    14 days
Total storage                       784 GB

Elasticsearch cluster for Jaeger:
  3 data nodes, 300GB SSD each
  Replication factor: 1 (traces are reproducible, not critical data)
  Index rollover: daily, delete after 14 days

Reduce storage without losing analytical value:

Shorten attribute values. Replace db.statement="SELECT id, title, slug, content, author_id, created_at, updated_at FROM articles WHERE id = $1" with db.statement="SELECT * FROM articles WHERE id = $1". The full query text is rarely needed in the trace. Store the query hash as an attribute and log the full query separately.
Drop low-value spans. Health check endpoints (/health, /ready) generate spans that are never useful for performance analysis. Exclude them at the SDK level:

// Suppress spans for health check endpoints
SdkTracerProvider.builder()
    .addSpanProcessor(
        SpanProcessor.composite(
            FilteringSpanProcessor.create(
                span -> !span.getName().contains("/health"),
                batchSpanProcessor
            )
        )
    )
    .build();

Reduce span event verbosity. Events with large attribute sets consume disproportionate storage. Limit events to 5 per span and 10 attributes per event. The OpenTelemetry SDK enforces these limits via SpanLimits:

SpanLimits limits = SpanLimits.builder()
    .setMaxNumberOfEvents(5)
    .setMaxNumberOfAttributesPerEvent(10)
    .setMaxAttributeValueLength(256)
    .build();

Trade-Offs

Tail sampling reliability. The Collector is a single point of failure for trace completeness. If it crashes, in-flight traces are lost. Run at least two Collector instances behind a load balancer with consistent hashing (route all spans for a given trace ID to the same Collector). This adds operational complexity but ensures no single Collector crash loses traces.

Exemplar cardinality. Each exemplar is a metric sample with a trace ID label. Too many exemplars increase Prometheus storage. The max_per_metric_stream: 4 setting limits exemplars to 4 per histogram bucket per scrape interval. This is enough to find representative traces without exploding storage.

Regression detection sensitivity. A 20% p95 threshold catches large regressions but misses gradual degradation. A 5% threshold generates false positives from normal variance. Calibrate the threshold per operation based on historical variance. Operations with high natural variance (search, recommendation) need higher thresholds than deterministic operations (cache lookups, static content serving).

Retention vs investigation window. 14-day retention covers two sprint cycles. Most performance investigations happen within 48 hours of a deployment. Reduce retention to 7 days to halve storage cost, but accept that slow-burn regressions discovered after a week will have no trace evidence. Store aggregated span metrics (p50, p95, p99 per operation per hour) indefinitely in Prometheus as a fallback.