Fish Audio S2-Pro: High-Fidelity TTS with Dual-AR Architecture and Sub-150ms Latency

Fish Audio Releases Fish Audio S2: A New Generation of Expressive Text-to-Speech (TTS) with Absurdly Controllable Emotion

Fish Audio has released S2-Pro, a flagship Large Audio Model (LAM) capable of high-fidelity, multi-speaker synthesis. The system achieves a Time to First Audio (TTFA) of approximately 100ms on NVIDIA H200 hardware.

Why This Matters

Traditional modular TTS pipelines often compromise between audio quality and generation speed, creating a bottleneck for real-time interactive agents. S2-Pro’s integrated architecture utilizes a Dual-AR approach and RadixAttention to minimize latency while maintaining 44.1kHz fidelity across 300,000+ hours of trained audio data. This transition from modular pipelines to integrated Large Audio Models (LAMs) represents a significant shift toward open architectures capable of granular emotional control without explicit fine-tuning.

Key Insights

Dual-AR Architecture: Splits tasks between a 4B parameter ‘Slow AR’ for linguistic structure and a 400M parameter ‘Fast AR’ for acoustic refinement.
Residual Vector Quantization (RVQ): Compresses raw 44.1kHz audio into discrete layers to reconstruct high-fidelity textures including breaths and sighs.
Zero-Shot Voice Cloning: Employs In-Context Learning (ICL) using 10-30 second reference clips as context window prefixes to adopt speaker timbre.
Dynamic Emotional Control: Supports natural language inline tags such as [whisper] or [laugh] to adjust pitch and intensity in real-time.
RadixAttention Optimization: Integrated with SGLang to cache KV states of master voice prompts, drastically reducing prefill overhead for repeated speakers.

Practical Applications

Use Case: Real-time conversational AI using SGLang and NVIDIA H200 hardware to achieve sub-150ms latency for live agents. Pitfall: Failing to use RadixAttention for master voice prompts results in redundant computation and increased prefill time.
Use Case: Multi-speaker narration generated in a single inference pass by including multiple identities in one context window. Pitfall: Providing reference clips outside the 10-30 second optimal range may degrade voice cloning accuracy.

References:

https://www.marktechpost.com/2026/03/10/fish-audio-releases-fish-audio-s2-a-new-generation-of-expressive-text-to-speech-tts-with-absurdly-controllable-emotion/

On This Page

Fish Audio Releases Fish Audio S2: A New Generation of Expressive Text-to-Speech (TTS) with Absurdly Controllable Emotion

Why This Matters

Key Insights

Practical Applications

Continue reading

Related Content

Salesforce AI Research Releases VoiceAgentRAG: A Dual-Agent Memory Router that Cuts Voice RAG Retrieval Latency by 316x

Beyond Simple API Requests: How OpenAI’s WebSocket Mode Changes the Game for Low Latency Voice Powered AI Experiences

Sakana AI Introduces KAME: Real-Time LLM Knowledge Injection for Near-Zero Latency Speech