Supertonic v3: On-Device TTS with 31-Language Support and Expressive Tags

Supertone Releases Supertonic v3: On-Device Text-to-Speech Model with 31-Language Support, Fewer Reading Failures, and Expression Tags

Supertone has launched Supertonic v3, the third generation of its ONNX-based on-device text-to-speech system. The model expands language support from 5 to 31 codes while maintaining a compact footprint of approximately 99M parameters. This release introduces expressive tags and a built-in text normalization engine that outperforms major cloud-based competitors on technical units.

Why This Matters

Most high-fidelity TTS models require significant cloud resources, with parameter counts often ranging from 0.7B to 2B, making edge deployment difficult. Supertonic v3 addresses this by utilizing flow-matching to achieve usable audio in just 2 inference steps, significantly reducing memory and compute requirements compared to diffusion-based models. The built-in text normalization solves the common failure point where standard systems struggle with complex surface forms like financial units ($5.2M) and technical abbreviations (30kph). While competitors like ElevenLabs Flash v2.5 and OpenAI TTS-1 failed to correctly process these inputs, Supertonic v3 maintains reading accuracy without requiring external preprocessing pipelines.

Key Insights

Expanded language coverage from 5 to 31 ISO codes, including a special ‘na’ fallback for unknown text (Supertone, 2026).
Flow-matching architecture enables high-speed inference on CPU, achieving an average RTF of 0.3x on an Onyx Boox Go 6 e-reader.
Introduction of Length-Aware Rotary Position Embedding (LARoPE) and Self-Purifying Flow Matching to improve text-speech alignment and robustness against noisy labels.
Expressive tag support allows embedding prosodic cues like , , and directly into input text without separate preprocessing.
Public ONNX assets occupy only 404 MB, making the system viable for browser and mobile environments via onnxruntime-web.

Working Examples

Minimal Python SDK example for synthesizing audio using the Supertonic v3 model.

from supertonic import TTS
tts = TTS(auto_download=True)
style = tts.get_voice_style(voice_name="M1")
text = "A gentle breeze moved through the open window while everyone listened to the story."
wav, duration = tts.synthesize(text, voice_style=style, lang="en")
tts.save_audio(wav, "output.wav")
print(f"Generated {duration:.2f}s of audio")

Practical Applications

Use case: E-ink e-readers (Onyx Boox Go 6) can perform local TTS in airplane mode with 0.3x RTF. Pitfall: Attempting to use larger 2B parameter models on such hardware typically results in excessive latency and memory exhaustion.
Use case: Automated financial reporting systems can correctly verbalize ‘$5.2M’ as ‘five point two million dollars’ using built-in text normalization. Pitfall: Relying on generic TTS systems like OpenAI TTS-1 or Gemini 2.5 Flash often leads to reading failures on technical units and currency formats.
Use case: Web applications using onnxruntime-web for pure client-side execution of voice interfaces. Pitfall: Neglecting to handle the ‘na’ fallback for unsupported languages, which could lead to inconsistent synthesis quality for unknown text inputs.

References:

https://www.marktechpost.com/2026/05/15/supertone-releases-supertonic-v3-on-device-text-to-speech-model-with-31-language-support-fewer-reading-failures-and-expression-tags/

On This Page

Supertone Releases Supertonic v3: On-Device Text-to-Speech Model with 31-Language Support, Fewer Reading Failures, and Expression Tags

Why This Matters

Key Insights

Working Examples

Practical Applications

Continue reading

Related Content

Meta AI Releases Omnilingual ASR: A Suite of Open-Source Multilingual Speech Recognition Models for 1600+ Languages

Google AI Releases gws CLI for Unified Workspace API Management

Google AI Releases WAXAL: A 24-Language African Speech Dataset for ASR and TTS