From Confusion to Clarity: Advanced Observability Strategies for Media Workflows at Netflix

Transcript

Netflix’s media encoding process, handling up to 1 million trace spans for a single hour-long episode of Squid Game Season 2, presented significant observability challenges. The company transitioned from a monolithic architecture to a complex, distributed system based on Cosmos, requiring a fundamental shift in how they approached monitoring and debugging.

Why This Matters

Traditional observability approaches struggle with the scale and complexity of modern, distributed systems. Relying on standard tracing and logging becomes ineffective when dealing with millions of spans and hundreds of microservice calls per workflow. Without effective observability, identifying bottlenecks and optimizing performance can be incredibly difficult, leading to increased costs and degraded user experience – Netflix estimates 122,000 CPU hours were used to encode a single episode of Squid Game.

Key Insights

1 million trace spans represent the workflow to encode a single hour-long episode of Squid Game Season 2 (2026).
Request-first tree visualization helps navigate complex, hierarchical microservice calls, addressing “trace explosion.”
Netflix’s Cosmos platform combines microservices, asynchronous workflows, and serverless functions, requiring a custom observability solution.

Working Example

# Example of a simplified span processor (conceptual)
class SpanProcessor:
    def process_span(self, span):
        # Aggregate metrics based on trace ID and request ID
        trace_id = span.trace_id
        request_id = span.request_id

        # Calculate duration and queue time
        duration = span.end_time - span.start_time
        queue_time = span.queue_time

        # Store aggregated data in Elasticsearch and Iceberg
        # (Implementation details omitted for brevity)
        store_in_elasticsearch(trace_id, request_id, duration, queue_time)
        store_in_iceberg(trace_id, request_id, duration, queue_time)

Practical Applications

Netflix Encoding Pipeline: Enables real-time monitoring of encoding jobs, identifying performance bottlenecks and optimizing resource allocation.
Pitfall: Relying solely on traditional tracing without high-cardinality metadata and stream processing leads to “trace explosion” and unusable dashboards.

References:

https://www.infoq.com/presentations/stream-pipeline-observability/

On This Page

Transcript

Why This Matters

Key Insights

Working Example

Practical Applications

Continue reading

Related Content

Beyond the Green Dot: Advanced LLM Observability Lessons from OpenAI Outages

Why Single-Purpose Agents Beat One Big Automation Script: A Homelab Case Study

AI Agents vs Workflows: Choose Deterministic Pipelines Over Autonomous Hype