Skip to main content

On This Page

From Confusion to Clarity: Advanced Observability Strategies for Media Workflows at Netflix

2 min read
Share

These articles are AI-generated summaries. Please check the original sources for full details.

Transcript

Netflix’s media encoding process, handling up to 1 million trace spans for a single hour-long episode of Squid Game Season 2, presented significant observability challenges. The company transitioned from a monolithic architecture to a complex, distributed system based on Cosmos, requiring a fundamental shift in how they approached monitoring and debugging.

Why This Matters

Traditional observability approaches struggle with the scale and complexity of modern, distributed systems. Relying on standard tracing and logging becomes ineffective when dealing with millions of spans and hundreds of microservice calls per workflow. Without effective observability, identifying bottlenecks and optimizing performance can be incredibly difficult, leading to increased costs and degraded user experience – Netflix estimates 122,000 CPU hours were used to encode a single episode of Squid Game.

Key Insights

  • 1 million trace spans represent the workflow to encode a single hour-long episode of Squid Game Season 2 (2026).
  • Request-first tree visualization helps navigate complex, hierarchical microservice calls, addressing “trace explosion.”
  • Netflix’s Cosmos platform combines microservices, asynchronous workflows, and serverless functions, requiring a custom observability solution.

Working Example

# Example of a simplified span processor (conceptual)
class SpanProcessor:
    def process_span(self, span):
        # Aggregate metrics based on trace ID and request ID
        trace_id = span.trace_id
        request_id = span.request_id

        # Calculate duration and queue time
        duration = span.end_time - span.start_time
        queue_time = span.queue_time

        # Store aggregated data in Elasticsearch and Iceberg
        # (Implementation details omitted for brevity)
        store_in_elasticsearch(trace_id, request_id, duration, queue_time)
        store_in_iceberg(trace_id, request_id, duration, queue_time)

Practical Applications

  • Netflix Encoding Pipeline: Enables real-time monitoring of encoding jobs, identifying performance bottlenecks and optimizing resource allocation.
  • Pitfall: Relying solely on traditional tracing without high-cardinality metadata and stream processing leads to “trace explosion” and unusable dashboards.

References:

Continue reading

Next article

Recursive Language Models (RLMs): From MIT’s Blueprint to Prime Intellect’s RLMEnv for Long Horizon LLM Agents

Related Content