Skip to main content

On This Page

Maya1: A New Open Source 3B Voice Model For Expressive Text To Speech On A Single GPU

2 min read
Share

These articles are AI-generated summaries. Please check the original sources for full details.

Maya1: A New Open Source 3B Voice Model For Expressive Text To Speech On A Single GPU

Maya Research has released Maya1, a 3B parameter text-to-speech model that generates expressive audio in real time on a single GPU. It outperforms proprietary systems while remaining open source under Apache 2.0.

Why This Matters

Traditional text-to-speech systems often require high computational resources or proprietary APIs, limiting accessibility and scalability. Maya1 addresses this by using a neural audio codec (SNAC) to predict discrete tokens instead of raw waveforms, reducing memory overhead and enabling real-time streaming. This approach lowers deployment costs and broadens use cases for developers and businesses.

Key Insights

  • “Maya1 uses SNAC codec tokens for efficient generation, 2025”: The model predicts 7 tokens per 24 kHz audio frame using SNAC, achieving 0.98 kbps streaming efficiency.
  • “XML-style voice descriptions for natural control”: The model accepts free-form text like “Female voice in her 20s with a British accent, energetic” instead of rigid parameters.
  • “Hugging Face Space demo for interactive use, 2025”: An interactive demo allows users to input text and voice descriptions to generate audio instantly.

Working Example

from transformers import AutoModelForCausalLM
from snac import SNAC

# Load Maya1 model and SNAC decoder
model = AutoModelForCausalLM.from_pretrained("maya-research/maya1", torch_dtype="bfloat16", device_map="auto")
snac_decoder = SNAC.from_pretrained("hubertsiuzdak/snac_24khz")

# Generate audio from text and voice description
input_text = "Hello, world! <laugh>"
voice_description = "Female voice, 20s, cheerful, clear diction"
audio = model.generate(voice_description, input_text)
decoded_audio = snac_decoder.decode(audio)

Practical Applications

  • Use Case: Interactive agents and games using Maya1 for real-time expressive TTS.
  • Pitfall: Overlooking the need for a GPU with 16GB+ VRAM, leading to deployment failures on underpowered hardware.

References:


Continue reading

Next article

Meta AI Releases Omnilingual ASR: A Suite of Open-Source Multilingual Speech Recognition Models for 1600+ Languages

Related Content