OpenAI Launches GPT-Realtime-2 and Specialized Audio Models in General Availability

OpenAI Releases Three Realtime Audio Models: GPT-Realtime-2, GPT-Realtime-Translate, and GPT-Realtime-Whisper in the Realtime API

OpenAI has officially transitioned its Realtime API out of beta with the debut of three specialized audio models. The flagship GPT-Realtime-2 achieved a 96.6% score on Big Bench Audio, a 15.2 percentage point improvement over GPT-Realtime-1.5.

Why This Matters

Traditional voice agents often fail due to ‘dead air’ and context loss during multi-step reasoning tasks. By introducing adjustable reasoning effort across five levels and expanding the context window to 128K tokens, OpenAI addresses the technical reality of high-latency bottlenecks. This allows developers to move beyond simple Q&A loops to systems that can handle complex, multi-turn conversational intelligence with controllable performance-latency tradeoffs.

Key Insights

GPT-Realtime-2 features a 128K context window, allowing for significantly longer conversational history compared to the previous 32K limit (OpenAI, 2026).
Developers can now tune performance via five reasoning levels—minimal, low, medium, high, and xhigh—to optimize for either speed or depth (OpenAI, 2026).
The Audio MultiChallenge benchmark shows GPT-Realtime-2 (xhigh) scoring 48.5%, outperforming the 34.7% achieved by version 1.5 (OpenAI, 2026).
GPT-Realtime-Translate supports live speech conversion for 70+ input languages into 13 output languages at a cost of $0.034 per minute.
GPT-Realtime-Whisper provides streaming transcription with controllable latency, enabling real-time text generation as users speak.

Practical Applications

Complex Voice Agents: Utilizing GPT-Realtime-2 for healthcare or travel booking where multi-step reasoning and parallel tool calling are required. Pitfall: Using high reasoning levels for simple customer lookups, resulting in unnecessary latency and cost.
Live Event Interpretation: Deploying GPT-Realtime-Translate for bilingual event streaming. Pitfall: Using the dedicated translation model for tasks requiring conversational context or function calling, which it does not support.
Real-time Captioning: Implementing GPT-Realtime-Whisper for live broadcast transcripts. Pitfall: Setting latency delays too low, which can decrease transcription accuracy for technical terminology.

References:

https://www.marktechpost.com/2026/05/08/openai-releases-three-realtime-audio-models-gpt-realtime-2-gpt-realtime-translate-and-gpt-realtime-whisper-in-the-realtime-api/

On This Page

OpenAI Releases Three Realtime Audio Models: GPT-Realtime-2, GPT-Realtime-Translate, and GPT-Realtime-Whisper in the Realtime API

Why This Matters

Key Insights

Practical Applications

Continue reading

Related Content

OpenAI Releases gpt-oss-safeguard: Open-Weight Safety Reasoning Models for Custom Policy Enforcement

Prior Labs Launches TabPFN-2.5: Scaling Tabular Foundation Models for Enhanced Performance and Efficiency

Meta AI Releases Omnilingual ASR: A Suite of Open-Source Multilingual Speech Recognition Models for 1600+ Languages