Skip to main content

On This Page

Beyond the Green Dot: Advanced LLM Observability Lessons from OpenAI Outages

3 min read
Share

These articles are AI-generated summaries. Please check the original sources for full details.

OpenAI Outage Postmortem: What Status Pages Don’t Tell You

On April 20, 2026, ChatGPT users lost access to projects mid-session during a 90-minute partial outage that status pages initially missed. Earlier in March 2026, Azure-hosted GPT-5.2 endpoints returned HTTP 400 and 429 errors for 20 hours while aggregate availability signals remained green.

Why This Matters

Vendor status pages rely on binary aggregates that mask critical failure modes like silent latency creep, where p99 latency doubles while the average stays within SLA bands. Technical teams cannot wait for vendor recovery banners; they must instrument the ‘their side of the wire’ to detect regional skew and model-routing shifts that degrade output quality without tripping global availability alarms.

Key Insights

  • Regional skew detection is critical as global aggregates can remain healthy while specific regions like EU are degraded, a failure mode observed in the March 2026 Azure GPT-5.2 incident.
  • Time-to-first-token (TTFT) provides the earliest hint of provider-side queueing, often spiking before total response latency metrics reveal an incident.
  • Token throughput measured in tokens per second catches ‘slow generation’ failures where the stream produces tokens at half-speed, bypassing standard wall-clock timeouts.
  • Structured-output validation rates identify silent quality drift caused by model-routing fallbacks, which are invisible to traditional latency and availability metrics.
  • Multi-window burn-rate alerts for LLMs prevent false positives from long prompts by paging only when the rate of breaches consumes the error budget over a 5-minute window.

Working Examples

OpenTelemetry instrumentation capturing TTFT, tokens per second, and schema validation failures for OpenAI streaming calls.

import time
from contextlib import contextmanager
from openai import OpenAI
from opentelemetry import metrics, trace
from opentelemetry.trace import Status, StatusCode

tracer = trace.get_tracer("llm.client")
meter = metrics.get_meter("llm.client")
latency_hist = meter.create_histogram("llm.request.duration", unit="s")
ttft_hist = meter.create_histogram("llm.request.ttft", unit="s")
tps_hist = meter.create_histogram("llm.request.tokens_per_second", unit="tok/s")
err_counter = meter.create_counter("llm.request.errors")
schema_counter = meter.create_counter("llm.request.schema_failures")

@contextmanager
def llm_span(model: str, region: str):
    attrs = {"llm.model": model, "llm.region": region}
    start = time.perf_counter()
    with tracer.start_as_current_span("llm.call", attributes=attrs) as span:
        try:
            yield span, attrs
        except Exception as e:
            err_counter.add(1, attrs)
            span.set_status(Status(StatusCode.ERROR, str(e)))
            raise
        finally:
            latency_hist.record(time.perf_counter() - start, attrs)

def call_streaming(model: str, region: str, prompt: str, validate):
    with llm_span(model, region) as (span, attrs):
        stream = OpenAI().chat.completions.create(
            model=model,
            messages=[{"role": "user", "content": prompt}],
            stream=True,
        )
        first_token_at = None
        token_count = 0
        chunks = []
        start = time.perf_counter()
        for chunk in stream:
            delta = chunk.choices[0].delta.content or ""
            if delta and first_token_at is None:
                first_token_at = time.perf_counter()
                ttft_hist.record(first_token_at - start, attrs)
            token_count += len(delta.split())
            chunks.append(delta)
        elapsed = time.perf_counter() - (first_token_at or start)
        if elapsed > 0:
            tps_hist.record(token_count / elapsed, attrs)
        text = "".join(chunks)
        if not validate(text):
            schema_counter.add(1, attrs)
            span.set_attribute("llm.schema.valid", False)
        return text

Practical Applications

  • Per-model p95 latency monitoring to identify when specific variants like gpt-4o degrade while the broader model family remains operational. Pitfall: Using aggregate latency metrics which hide model-specific regressions.
  • Logging provider-returned ‘system_fingerprint’ and Request IDs to correlate mid-day output changes with provider deployment updates. Pitfall: Relying only on the model name, making cost and quality reconstruction impossible during postmortems.

References:

Continue reading

Next article

Mastering git rm --cached: Removing Files from Tracking Without Local Deletion

Related Content