Beyond the Green Dot: Advanced LLM Observability Lessons from OpenAI Outages
These articles are AI-generated summaries. Please check the original sources for full details.
OpenAI Outage Postmortem: What Status Pages Don’t Tell You
On April 20, 2026, ChatGPT users lost access to projects mid-session during a 90-minute partial outage that status pages initially missed. Earlier in March 2026, Azure-hosted GPT-5.2 endpoints returned HTTP 400 and 429 errors for 20 hours while aggregate availability signals remained green.
Why This Matters
Vendor status pages rely on binary aggregates that mask critical failure modes like silent latency creep, where p99 latency doubles while the average stays within SLA bands. Technical teams cannot wait for vendor recovery banners; they must instrument the ‘their side of the wire’ to detect regional skew and model-routing shifts that degrade output quality without tripping global availability alarms.
Key Insights
- Regional skew detection is critical as global aggregates can remain healthy while specific regions like EU are degraded, a failure mode observed in the March 2026 Azure GPT-5.2 incident.
- Time-to-first-token (TTFT) provides the earliest hint of provider-side queueing, often spiking before total response latency metrics reveal an incident.
- Token throughput measured in tokens per second catches ‘slow generation’ failures where the stream produces tokens at half-speed, bypassing standard wall-clock timeouts.
- Structured-output validation rates identify silent quality drift caused by model-routing fallbacks, which are invisible to traditional latency and availability metrics.
- Multi-window burn-rate alerts for LLMs prevent false positives from long prompts by paging only when the rate of breaches consumes the error budget over a 5-minute window.
Working Examples
OpenTelemetry instrumentation capturing TTFT, tokens per second, and schema validation failures for OpenAI streaming calls.
import time
from contextlib import contextmanager
from openai import OpenAI
from opentelemetry import metrics, trace
from opentelemetry.trace import Status, StatusCode
tracer = trace.get_tracer("llm.client")
meter = metrics.get_meter("llm.client")
latency_hist = meter.create_histogram("llm.request.duration", unit="s")
ttft_hist = meter.create_histogram("llm.request.ttft", unit="s")
tps_hist = meter.create_histogram("llm.request.tokens_per_second", unit="tok/s")
err_counter = meter.create_counter("llm.request.errors")
schema_counter = meter.create_counter("llm.request.schema_failures")
@contextmanager
def llm_span(model: str, region: str):
attrs = {"llm.model": model, "llm.region": region}
start = time.perf_counter()
with tracer.start_as_current_span("llm.call", attributes=attrs) as span:
try:
yield span, attrs
except Exception as e:
err_counter.add(1, attrs)
span.set_status(Status(StatusCode.ERROR, str(e)))
raise
finally:
latency_hist.record(time.perf_counter() - start, attrs)
def call_streaming(model: str, region: str, prompt: str, validate):
with llm_span(model, region) as (span, attrs):
stream = OpenAI().chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}],
stream=True,
)
first_token_at = None
token_count = 0
chunks = []
start = time.perf_counter()
for chunk in stream:
delta = chunk.choices[0].delta.content or ""
if delta and first_token_at is None:
first_token_at = time.perf_counter()
ttft_hist.record(first_token_at - start, attrs)
token_count += len(delta.split())
chunks.append(delta)
elapsed = time.perf_counter() - (first_token_at or start)
if elapsed > 0:
tps_hist.record(token_count / elapsed, attrs)
text = "".join(chunks)
if not validate(text):
schema_counter.add(1, attrs)
span.set_attribute("llm.schema.valid", False)
return text
Practical Applications
- Per-model p95 latency monitoring to identify when specific variants like gpt-4o degrade while the broader model family remains operational. Pitfall: Using aggregate latency metrics which hide model-specific regressions.
- Logging provider-returned ‘system_fingerprint’ and Request IDs to correlate mid-day output changes with provider deployment updates. Pitfall: Relying only on the model name, making cost and quality reconstruction impossible during postmortems.
References:
Continue reading
Next article
Mastering git rm --cached: Removing Files from Tracking Without Local Deletion
Related Content
Why Observability Matters for AI Applications: A Deep Dive into LLM Monitoring
Sally O'Malley explains the unique observability challenges of Large Language Models (LLMs) and demonstrates how to implement an open-source observability stack using vLLM, Llama Stack, Prometheus, Grafana, and OpenTelemetry. She discusses key metrics for monitoring performance, cost, and quality, and the importance of tracing for debugging AI workloads.
Essential Observability: 3 Critical Alerts for LLM Systems
Prevent runaway LLM costs and quality drift using OpenTelemetry GenAI conventions to monitor per-trace spend and retrieval relevance.
From Confusion to Clarity: Advanced Observability Strategies for Media Workflows at Netflix
Netflix evolved its media processing observability from 1 million trace spans per Squid Game episode to a high-cardinality analytics platform, reducing trace loading times and enabling ROI-based analysis.