Skip to main content

On This Page

Fish Audio S2-Pro: High-Fidelity TTS with Dual-AR Architecture and Sub-150ms Latency

2 min read
Share

These articles are AI-generated summaries. Please check the original sources for full details.

Fish Audio Releases Fish Audio S2: A New Generation of Expressive Text-to-Speech (TTS) with Absurdly Controllable Emotion

Fish Audio has released S2-Pro, a flagship Large Audio Model (LAM) capable of high-fidelity, multi-speaker synthesis. The system achieves a Time to First Audio (TTFA) of approximately 100ms on NVIDIA H200 hardware.

Why This Matters

Traditional modular TTS pipelines often compromise between audio quality and generation speed, creating a bottleneck for real-time interactive agents. S2-Pro’s integrated architecture utilizes a Dual-AR approach and RadixAttention to minimize latency while maintaining 44.1kHz fidelity across 300,000+ hours of trained audio data. This transition from modular pipelines to integrated Large Audio Models (LAMs) represents a significant shift toward open architectures capable of granular emotional control without explicit fine-tuning.

Key Insights

  • Dual-AR Architecture: Splits tasks between a 4B parameter ‘Slow AR’ for linguistic structure and a 400M parameter ‘Fast AR’ for acoustic refinement.
  • Residual Vector Quantization (RVQ): Compresses raw 44.1kHz audio into discrete layers to reconstruct high-fidelity textures including breaths and sighs.
  • Zero-Shot Voice Cloning: Employs In-Context Learning (ICL) using 10-30 second reference clips as context window prefixes to adopt speaker timbre.
  • Dynamic Emotional Control: Supports natural language inline tags such as [whisper] or [laugh] to adjust pitch and intensity in real-time.
  • RadixAttention Optimization: Integrated with SGLang to cache KV states of master voice prompts, drastically reducing prefill overhead for repeated speakers.

Practical Applications

  • Use Case: Real-time conversational AI using SGLang and NVIDIA H200 hardware to achieve sub-150ms latency for live agents. Pitfall: Failing to use RadixAttention for master voice prompts results in redundant computation and increased prefill time.
  • Use Case: Multi-speaker narration generated in a single inference pass by including multiple identities in one context window. Pitfall: Providing reference clips outside the 10-30 second optimal range may degrade voice cloning accuracy.

References:

Continue reading

Next article

Building Self-Designing Meta-Agents for Automated AI Architecture Construction

Related Content