Skip to main content
surviving the spike

Distributed Tracing with OpenTelemetry: Finding the Slow Thing

8 min read Chapter 46 of 66

Distributed Tracing with OpenTelemetry: Finding the Slow Thing

The Symptom

Friday night. 11:47 PM. A rider reports that requesting a ride took 8 seconds. The rider API metrics show p50 at 180ms, p95 at 320ms, p99 at 480ms. All within SLO. But the Prometheus scrape interval is 15 seconds, and the anomaly happened to a single request that falls outside the p99 window. The request succeeded. The rider got their car. No error was logged.

The metrics tell you the system is healthy. The logs tell you the request completed. Neither tells you why one request took 8 seconds when 99% complete in under 500ms.

You need to see inside the request. Every service it touched, every database query it made, every queue it waited in, and the exact millisecond where 7.5 of those 8 seconds were spent. You need a trace.

The Cause

The ride-hailing platform has six services in the request path for a ride request:

Rider App → API Gateway → Rider API → Fare Service → Driver Matching → Trip Service
                                    → Kafka → Trip Analytics (async)

Metrics give you aggregate behavior per service. Logs give you events per service. Neither connects the dots across services for a single request. When a request is slow, you are left correlating timestamps across six services, hoping the clocks are synchronized and the log formats are consistent. They never are.

Distributed tracing solves this by assigning a unique trace ID to each incoming request and propagating it through every service, every database call, every message queue. Each unit of work within a trace is a span. Spans have a parent-child relationship that forms a tree. The tree shows you exactly where time was spent.

OpenTelemetry is the CNCF standard for instrumentation. It provides:

  1. Traces: A directed acyclic graph of spans representing a single request’s journey
  2. Spans: Individual operations with start time, duration, status, and attributes
  3. Context propagation: Passing trace context (trace ID + span ID) across process boundaries via HTTP headers or Kafka headers
  4. Sampling: Deciding which traces to record (you cannot record all of them at 10,000 RPS)

The Java agent handles auto-instrumentation. Attach it to the JVM, and every Spring WebFlux handler, every Redis command, every SQL query, every Kafka produce/consume gets a span. Zero code changes.

Custom spans fill the gaps. The auto-instrumenter does not know that calculateSurgePricing() is a business-critical operation worth tracing separately. You add those manually.

For the ride-hailing platform, the instrumentation plan is:

Layer          Auto-Instrumented          Custom Instrumented
HTTP           WebFlux handlers           -
Database       R2DBC queries, JDBC        Connection pool wait time
Cache          Lettuce Redis commands     Cache miss vs hit logic
Messaging      Kafka produce/consume      -
Business       -                          Surge pricing, driver matching, fare calc

The auto-instrumented layer tells you “the database query took 7.5 seconds.” The custom layer tells you “the fare calculation took 7.6 seconds because the database query inside it waited for a connection.”

The Baseline

Current observability for the ride request path:

Signal        Coverage                  Finds the 8s request?
Metrics       p50/p95/p99 per service   No (aggregated away)
Logs          Per-service text logs     Maybe (if you grep 6 services)
Traces        None                      N/A

Target:

Signal        Coverage                  Finds the 8s request?
Metrics       p50/p95/p99 per service   No (still aggregated)
Logs          Structured, trace-correlated  Yes (filter by trace ID)
Traces        Full request path          Yes (waterfall shows 7.5s pool wait)

Without tracing, diagnosing the 8-second request requires:

  1. Searching rider API logs for slow requests (hope you log request duration)
  2. Finding a correlation ID (hope you have one)
  3. Searching 5 downstream services for that correlation ID
  4. Manually summing up time spent in each service
  5. Guessing which gap between services is network vs processing

Time to diagnose: 2-4 hours. With tracing: 90 seconds.

The Fix

Adding the OpenTelemetry Java Agent

# SCALED: Dockerfile with OTel Java agent
FROM eclipse-temurin:21-jre-alpine

COPY --from=build /app/target/*.jar /app/app.jar

# Download the OpenTelemetry Java agent
ADD https://github.com/open-telemetry/opentelemetry-java-instrumentation/releases/latest/download/opentelemetry-javaagent.jar /app/opentelemetry-javaagent.jar

ENV JAVA_TOOL_OPTIONS="-javaagent:/app/opentelemetry-javaagent.jar"
ENV OTEL_SERVICE_NAME="rider-api"
ENV OTEL_EXPORTER_OTLP_ENDPOINT="http://otel-collector:4317"
ENV OTEL_TRACES_SAMPLER="parentbased_traceidratio"
ENV OTEL_TRACES_SAMPLER_ARG="0.1"

ENTRYPOINT ["java", "-jar", "/app/app.jar"]

One environment variable. No code changes. The agent hooks into Spring WebFlux, Lettuce, R2DBC, and Spring Kafka automatically.

Custom Spans for Business Logic

Auto-instrumentation covers framework code. Business logic needs manual spans:

// SCALED: Custom spans for fare calculation
@Service
public class FareService {

    private final Tracer tracer = GlobalOpenTelemetry.getTracer("ride-hailing");

    public Mono<FareEstimate> calculateFare(RideRequest request) {
        return Mono.defer(() -> {
            Span span = tracer.spanBuilder("fare.calculate")
                .setAttribute("rider.id", request.getRiderId())
                .setAttribute("zone.id", request.getPickupZoneId())
                .startSpan();

            return calculateBaseFare(request)
                .flatMap(base -> applySurgePricing(base, request))
                .flatMap(surged -> applyPromotions(surged, request))
                .doOnSuccess(estimate ->
                    span.setAttribute("fare.amount", estimate.getAmount().doubleValue())
                        .setStatus(StatusCode.OK)
                        .end())
                .doOnError(err ->
                    span.setStatus(StatusCode.ERROR, err.getMessage())
                        .recordException(err)
                        .end())
                .contextWrite(Context.current().with(span).asContextElement());
        });
    }

    @WithSpan("fare.surge_pricing")
    private Mono<FareComponents> applySurgePricing(
            FareComponents base,
            @SpanAttribute("zone.id") String zoneId) {
        return surgeService.getMultiplier(zoneId)
            .map(multiplier -> base.withSurge(multiplier));
    }
}

The fare.calculate span wraps the entire fare computation. The fare.surge_pricing span captures surge pricing specifically. When the trace waterfall shows 7.5 seconds inside fare.calculate, you drill into its child spans and find exactly which sub-operation stalled.

Context Propagation Across Kafka

HTTP context propagation happens automatically via the W3C traceparent header. Kafka needs the OTel agent attached to both producer and consumer:

// SCALED: Kafka producer with trace context propagation
@Service
public class RideEventPublisher {

    private final KafkaTemplate<String, RideEvent> kafkaTemplate;

    public Mono<Void> publishRideRequested(RideRequest request, FareEstimate fare) {
        RideEvent event = new RideEvent(
            request.getRideId(),
            "RIDE_REQUESTED",
            request.getRiderId(),
            fare.getAmount()
        );

        // The OTel agent injects W3C traceparent into Kafka headers automatically
        return Mono.fromFuture(
            kafkaTemplate.send("ride-events", request.getRideId(), event)
        ).then();
    }
}

The OpenTelemetry Java agent instruments KafkaTemplate and @KafkaListener automatically. It injects the traceparent header into produced messages and extracts it on the consumer side. The consumer’s spans become children of the producer’s trace. A ride request trace that publishes to Kafka and gets consumed by Trip Analytics appears as a single trace spanning both services.

OTLP Export and the Collector

# SCALED: OpenTelemetry Collector configuration
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: "0.0.0.0:4317"
      http:
        endpoint: "0.0.0.0:4318"

processors:
  batch:
    timeout: 5s
    send_batch_size: 8192
  memory_limiter:
    check_interval: 1s
    limit_mib: 512
    spike_limit_mib: 128

exporters:
  otlp/tempo:
    endpoint: "tempo:4317"
    tls:
      insecure: true

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [otlp/tempo]

Services send spans to the Collector via OTLP gRPC. The Collector batches them and forwards to Grafana Tempo. The Collector sits as a sidecar or daemonset, absorbing the export overhead so the application does not block on network calls to the tracing backend.

Sampling Strategy

At 10,000 RPS across 6 services, recording every span generates roughly 60,000 spans/second. At 1KB per span, that is 60MB/s or 5TB/day. Sampling is not optional.

# Development: record everything
otel.traces.sampler=always_on

# Production: record 10% of traces, follow parent decision for downstream
otel.traces.sampler=parentbased_traceidratio
otel.traces.sampler.arg=0.1

parentbased_traceidratio means: if this service is the root (no incoming trace context), sample at 10%. If this service received a trace context from upstream, follow the upstream’s sampling decision. This ensures a sampled trace is complete across all services, not fragmented.

The Proof

Locust: Generating Traced Traffic

# SCALED: Locust generating traced ride requests
from locust import HttpUser, task, between
import uuid

class TracedRideUser(HttpUser):
    wait_time = between(0.5, 2.0)

    @task
    def request_ride(self):
        trace_id = uuid.uuid4().hex
        self.client.post("/api/rides/request",
            json={
                "rider_id": f"rider-{self.environment.runner.user_count}",
                "pickup": {"lat": 40.7128, "lng": -74.0060},
                "dropoff": {"lat": 40.7589, "lng": -73.9851}
            },
            headers={
                "traceparent": f"00-{trace_id}-{uuid.uuid4().hex[:16]}-01"
            },
            name="/api/rides/request"
        )

Run with 100 concurrent users for 5 minutes. Then query Tempo:

{duration > 2s && resource.service.name = "rider-api"}

The 8-second trace appears in the results. The waterfall shows:

Trace waterfall diagram showing span durations for the ride request, with the R2DBC query span highlighted in red at 7.5 seconds due to connection pool exhaustion

The waterfall visualization makes the bottleneck immediately obvious. The R2DBC query span dominates the trace at 7.5 seconds, shown in red, while every other span completes in milliseconds. The query itself executes in 15ms — the remaining 7,485ms is pure connection pool wait time. Without tracing, this would look like a slow database; with tracing, the span attributes pinpoint pool exhaustion as the root cause.

The R2DBC query span shows 7.5 seconds. The query itself takes 15ms when a connection is available. The remaining 7,485ms is connection pool wait time. The pool has 10 connections, all in use by concurrent requests. This request waited for a connection to become available.

The fix: increase spring.r2dbc.pool.max-size from 10 to 50, and add spring.r2dbc.pool.max-idle-time=30s to prevent connections from going stale.

Before tracing, this would have been a 4-hour investigation. The trace showed the answer in 90 seconds.

Before and After

Metric                      Before Tracing       After Tracing
Time to diagnose outlier    2-4 hours            90 seconds
Confidence in root cause    Low (guessing)       High (exact span)
Ability to find p99.9       None                 Full (tail sampling)
Cross-service visibility    Manual log grep      Automatic waterfall
Connection pool issue       Invisible            Span attribute: pool.wait=7485ms

The investment: one Dockerfile change, one Collector config, four @WithSpan annotations. The return: every slow request in the platform is now diagnosable in under 2 minutes.