OpenTelemetry Instrumentation for the Ride-Hailing Platform
OpenTelemetry Instrumentation for the Ride-Hailing Platform
The Symptom
The team adds the OpenTelemetry Java agent to the rider API. Traces appear in Tempo. The waterfall shows Spring WebFlux handler spans, Lettuce Redis spans, R2DBC PostgreSQL spans. But when a ride request takes 3 seconds, the trace shows a 2.8-second gap between the WebFlux handler span and the first database span. Something is taking 2.8 seconds and it is invisible.
The gap is the surge pricing calculation. It runs in-memory, calling no external services for 95% of requests (cached multipliers). The auto-instrumenter does not see it because there is no framework call to hook into. The most critical business logic in the request path is a blind spot.
The Cause
Auto-instrumentation covers the I/O boundary: HTTP handlers, database drivers, cache clients, message brokers. It does not cover application logic that happens between those boundaries. The surge pricing engine, the driver matching algorithm, and the fare computation pipeline are all invisible to the auto-instrumenter.
Two solutions exist:
@WithSpanannotation: add to any method, get a span automatically- Manual
TracerAPI: full control over span lifecycle, attributes, events
Use @WithSpan for simple methods where you want timing. Use the manual API when you need to add attributes, record events, or manage the span across reactive operators.
The Baseline
Trace coverage with auto-instrumentation only:
Operation Instrumented? Why?
WebFlux handler Yes Agent hooks ServerWebExchange
Redis GET surge:zone:123 Yes Agent hooks Lettuce client
R2DBC SELECT pricing_rules Yes Agent hooks R2DBC driver
Kafka produce ride-events Yes Agent hooks KafkaTemplate
Surge pricing calculation No Pure application logic
Driver matching algorithm No Pure application logic
Fare computation pipeline No Pure application logic
Promotion application No Pure application logic
Four of eight critical operations are invisible. The trace waterfall has gaps where the most important work happens.
Target: every business-critical operation has a span with relevant attributes.
The Fix
Java Agent Setup
# SCALED: Multi-stage build with OTel agent
FROM eclipse-temurin:21-jdk-alpine AS build
WORKDIR /app
COPY . .
RUN ./mvnw package -DskipTests
FROM eclipse-temurin:21-jre-alpine
WORKDIR /app
COPY --from=build /app/target/*.jar app.jar
# Pin the agent version for reproducibility
ARG OTEL_AGENT_VERSION=2.5.0
ADD https://github.com/open-telemetry/opentelemetry-java-instrumentation/releases/download/v${OTEL_AGENT_VERSION}/opentelemetry-javaagent.jar opentelemetry-javaagent.jar
ENTRYPOINT ["java", \
"-javaagent:/app/opentelemetry-javaagent.jar", \
"-jar", "app.jar"]
Application properties for the exporter:
# SCALED: application-otel.yml
otel:
service:
name: rider-api
exporter:
otlp:
endpoint: http://otel-collector:4317
protocol: grpc
traces:
sampler: parentbased_traceidratio
sampler-arg: "0.1"
resource:
attributes:
deployment.environment: production
service.version: ${APP_VERSION:unknown}
k8s.namespace.name: ${K8S_NAMESPACE:default}
The agent reads OTEL_* environment variables or system properties. The service name, exporter endpoint, and sampling rate are the minimum configuration.
Auto-Instrumented Spans
With the agent attached, these spans appear automatically:
Span Name Source Library
GET /api/rides/request spring-webflux
redis.GET lettuce
SELECT pricing_rules r2dbc-postgresql
kafka.produce ride-events spring-kafka
kafka.consume ride-events spring-kafka
Each span includes timing, status, and library-specific attributes. The R2DBC span includes the SQL statement (parameterized). The Redis span includes the command and key. The Kafka span includes the topic, partition, and offset.
Custom Spans with @WithSpan
// SCALED: @WithSpan for method-level tracing
@Service
public class DriverMatchingService {
@WithSpan("driver.matching.find_nearest")
public Mono<List<Driver>> findNearestDrivers(
@SpanAttribute("location.lat") double lat,
@SpanAttribute("location.lng") double lng,
@SpanAttribute("radius.km") double radiusKm,
@SpanAttribute("vehicle.type") String vehicleType) {
return driverLocationCache.getDriversInRadius(lat, lng, radiusKm)
.filter(driver -> driver.getVehicleType().equals(vehicleType))
.filter(Driver::isAvailable)
.sort(Comparator.comparingDouble(d ->
haversine(lat, lng, d.getLat(), d.getLng())))
.take(10)
.collectList();
}
@WithSpan("driver.matching.score_candidates")
public Mono<Driver> scoreCandidates(
@SpanAttribute("candidate.count") int candidateCount,
List<Driver> candidates,
RideRequest request) {
return Flux.fromIterable(candidates)
.flatMap(driver -> scoreDriver(driver, request))
.sort(Comparator.comparingDouble(ScoredDriver::getScore).reversed())
.next()
.map(ScoredDriver::getDriver);
}
}
@SpanAttribute binds method parameters to span attributes. When you search for slow traces in Tempo, you can filter by vehicle.type=SUV or candidate.count > 5 to narrow down the problem.
Manual Tracer for Complex Logic
// SCALED: Manual span management for fare calculation
@Service
public class FareCalculationService {
private final Tracer tracer = GlobalOpenTelemetry.getTracer("ride-hailing");
public Mono<FareEstimate> calculate(RideRequest request) {
return Mono.defer(() -> {
Span fareSpan = tracer.spanBuilder("fare.calculate.full")
.setAttribute("rider.id", request.getRiderId())
.setAttribute("pickup.zone", request.getPickupZoneId())
.setAttribute("dropoff.zone", request.getDropoffZoneId())
.startSpan();
try (Scope scope = fareSpan.makeCurrent()) {
return calculateDistance(request)
.flatMap(distance -> {
fareSpan.setAttribute("distance.km", distance);
return getBaseRate(request.getPickupZoneId());
})
.flatMap(rate -> {
fareSpan.addEvent("base_rate_resolved",
Attributes.of(
AttributeKey.doubleKey("rate.per_km"), rate));
return applySurge(request, rate);
})
.flatMap(surgedRate -> {
fareSpan.addEvent("surge_applied");
return applyPromotions(request, surgedRate);
})
.map(finalFare -> {
fareSpan.setAttribute("fare.amount", finalFare.doubleValue());
fareSpan.setAttribute("fare.currency", "USD");
fareSpan.setStatus(StatusCode.OK);
return new FareEstimate(finalFare, request);
})
.doOnError(err -> {
fareSpan.setStatus(StatusCode.ERROR, err.getMessage());
fareSpan.recordException(err);
})
.doFinally(signal -> fareSpan.end());
}
});
}
}
The manual approach gives you span events (timestamped log entries within the span), dynamic attributes set at different stages, and exception recording. The doFinally ensures the span ends regardless of success or error.
The choice between @WithSpan and manual Tracer:
Criteria @WithSpan Manual Tracer
Simple timing Yes Overkill
Method parameters as attrs Yes (@SpanAttribute) Yes (setAttribute)
Dynamic attributes No Yes (set during execution)
Span events No Yes (addEvent)
Reactive chain spans Fragile Correct (doFinally)
Error recording Automatic Manual (recordException)
For the driver matching service, @WithSpan is sufficient: the method runs, returns, and the span closes. For fare calculation, the manual API is required because attributes like fare.amount are only known at the end of the reactive chain, and span events mark the progress through each computation stage.
Context Propagation Across Kafka
// SCALED: Kafka producer - OTel agent handles context injection
@Service
public class RideEventPublisher {
private final KafkaTemplate<String, RideEvent> kafkaTemplate;
public Mono<Void> publishRideRequested(RideRequest request, FareEstimate fare) {
RideEvent event = new RideEvent(
request.getRideId(),
"RIDE_REQUESTED",
request.getRiderId(),
fare.getAmount()
);
// The OTel agent injects traceparent into Kafka headers automatically
return Mono.fromFuture(
kafkaTemplate.send("ride-events", request.getRideId(), event)
).then();
}
}
// SCALED: Kafka consumer - agent extracts context and creates child span
@Component
public class TripAnalyticsConsumer {
@WithSpan("analytics.process_ride_event")
@KafkaListener(topics = "ride-events", groupId = "trip-analytics")
public void processRideEvent(
@SpanAttribute("event.ride_id") String key,
RideEvent event) {
// This span is a child of the producer's span
// The trace connects rider-api → kafka → trip-analytics
analyticsStore.recordRideRequest(event);
}
}
The agent handles W3C traceparent injection on the producer and extraction on the consumer. No code needed. The consumer’s analytics.process_ride_event span shares the same trace ID as the producer’s kafka.produce span.
Kubernetes Manifest for OTel Collector Sidecar
# SCALED: OTel Collector as sidecar in Kubernetes
apiVersion: apps/v1
kind: Deployment
metadata:
name: rider-api
spec:
template:
spec:
containers:
- name: rider-api
image: rider-api:latest
env:
- name: OTEL_EXPORTER_OTLP_ENDPOINT
value: "http://localhost:4317"
- name: OTEL_SERVICE_NAME
value: "rider-api"
- name: OTEL_TRACES_SAMPLER
value: "parentbased_traceidratio"
- name: OTEL_TRACES_SAMPLER_ARG
value: "0.1"
ports:
- containerPort: 8080
- name: otel-collector
image: otel/opentelemetry-collector-contrib:0.100.0
args: ["--config=/etc/otel/config.yaml"]
volumeMounts:
- name: otel-config
mountPath: /etc/otel
ports:
- containerPort: 4317
- containerPort: 4318
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 500m
memory: 256Mi
volumes:
- name: otel-config
configMap:
name: otel-collector-config
The sidecar pattern means the application sends spans to localhost:4317. No cross-network latency. The Collector handles batching, retry, and export to Tempo. If Tempo is unavailable, the Collector buffers spans in memory up to the configured limit. The application never blocks on trace export.
Performance Impact
The OTel Java agent adds overhead. Measure it:
Metric Without Agent With Agent Delta
p50 latency 142ms 145ms +2.1%
p99 latency 310ms 318ms +2.6%
CPU usage (avg) 34% 36% +2%
Memory (heap) 412MB 438MB +26MB
Throughput (RPS) 2,840 2,790 -1.8%
The overhead is under 3% for latency and under 2% for throughput. The 26MB heap increase comes from span buffering before export. At 10% sampling rate, the agent creates spans for all requests but only exports 10%. The span creation cost is fixed. The export cost scales with sampling rate.
If 3% overhead is unacceptable, disable specific instrumentations:
# Disable auto-instrumentation for low-value spans
otel.instrumentation.lettuce.enabled=false
otel.instrumentation.r2dbc.enabled=false
Disable only after confirming those spans are not needed for diagnosis. Disabling R2DBC instrumentation would have made the connection pool wait time invisible.
The Proof
Deploy the instrumented rider API. Send 100 ride requests:
# SCALED: Verify instrumentation coverage
for i in $(seq 1 100); do
curl -s -X POST http://rider-api:8080/api/rides/request \
-H "Content-Type: application/json" \
-d '{
"rider_id": "rider-'$i'",
"pickup": {"lat": 40.7128, "lng": -74.0060},
"dropoff": {"lat": 40.7589, "lng": -73.9851}
}' &
done
wait
Query Tempo for traces from the rider API:
{resource.service.name="rider-api" && name="POST /api/rides/request"}
Each trace should contain:
Span Attributes
POST /api/rides/request http.method, http.route
fare.calculate.full rider.id, pickup.zone, fare.amount
fare.surge_pricing zone.id
r2dbc.query db.statement
driver.matching.find_nearest location.lat, location.lng, radius.km
driver.matching.score_candidates candidate.count
redis.GET db.system=redis
kafka.produce ride-events messaging.destination
Eight spans per trace. Four auto-instrumented, four custom. Zero blind spots in the critical path.
Before instrumentation: 2.8-second gap in trace waterfall, invisible business logic. After instrumentation: complete span tree, every operation visible, filterable by rider ID, zone, fare amount.