Skip to main content

On This Page

Supertonic v3: On-Device TTS with 31-Language Support and Expressive Tags

3 min read
Share

These articles are AI-generated summaries. Please check the original sources for full details.

Supertone Releases Supertonic v3: On-Device Text-to-Speech Model with 31-Language Support, Fewer Reading Failures, and Expression Tags

Supertone has launched Supertonic v3, the third generation of its ONNX-based on-device text-to-speech system. The model expands language support from 5 to 31 codes while maintaining a compact footprint of approximately 99M parameters. This release introduces expressive tags and a built-in text normalization engine that outperforms major cloud-based competitors on technical units.

Why This Matters

Most high-fidelity TTS models require significant cloud resources, with parameter counts often ranging from 0.7B to 2B, making edge deployment difficult. Supertonic v3 addresses this by utilizing flow-matching to achieve usable audio in just 2 inference steps, significantly reducing memory and compute requirements compared to diffusion-based models. The built-in text normalization solves the common failure point where standard systems struggle with complex surface forms like financial units ($5.2M) and technical abbreviations (30kph). While competitors like ElevenLabs Flash v2.5 and OpenAI TTS-1 failed to correctly process these inputs, Supertonic v3 maintains reading accuracy without requiring external preprocessing pipelines.

Key Insights

  • Expanded language coverage from 5 to 31 ISO codes, including a special ‘na’ fallback for unknown text (Supertone, 2026).
  • Flow-matching architecture enables high-speed inference on CPU, achieving an average RTF of 0.3x on an Onyx Boox Go 6 e-reader.
  • Introduction of Length-Aware Rotary Position Embedding (LARoPE) and Self-Purifying Flow Matching to improve text-speech alignment and robustness against noisy labels.
  • Expressive tag support allows embedding prosodic cues like , , and directly into input text without separate preprocessing.
  • Public ONNX assets occupy only 404 MB, making the system viable for browser and mobile environments via onnxruntime-web.

Working Examples

Minimal Python SDK example for synthesizing audio using the Supertonic v3 model.

from supertonic import TTS
tts = TTS(auto_download=True)
style = tts.get_voice_style(voice_name="M1")
text = "A gentle breeze moved through the open window while everyone listened to the story."
wav, duration = tts.synthesize(text, voice_style=style, lang="en")
tts.save_audio(wav, "output.wav")
print(f"Generated {duration:.2f}s of audio")

Practical Applications

  • Use case: E-ink e-readers (Onyx Boox Go 6) can perform local TTS in airplane mode with 0.3x RTF. Pitfall: Attempting to use larger 2B parameter models on such hardware typically results in excessive latency and memory exhaustion.
  • Use case: Automated financial reporting systems can correctly verbalize ‘$5.2M’ as ‘five point two million dollars’ using built-in text normalization. Pitfall: Relying on generic TTS systems like OpenAI TTS-1 or Gemini 2.5 Flash often leads to reading failures on technical units and currency formats.
  • Use case: Web applications using onnxruntime-web for pure client-side execution of voice interfaces. Pitfall: Neglecting to handle the ‘na’ fallback for unsupported languages, which could lead to inconsistent synthesis quality for unknown text inputs.

References:

Continue reading

Next article

Swift Protocol Magic: Designing a Reusable Location Tracking System for iOS

Related Content