MOSS-TTS: A New Multilingual Text-to-Speech Model
The new system from the OpenMOSS Team uses a novel 'delay-pattern' architecture to generate natural-sounding speech in Chinese, English, and Japanese.
The OpenMOSS Team has released MOSS-TTS, a new open-source model for generating high-quality speech from text. The system is multilingual, capable of producing audio in Chinese, English, and Japanese, making it a versatile tool for a range of voice applications.
The model's key innovation lies in its architecture. MOSS-TTS is a non-autoregressive system that uses a technique called a "delay-pattern." This approach allows it to model the rhythm and prosody of speech more effectively than some traditional methods, which can result in more natural-sounding intonation without generating audio one step at a time.
A Two-Stage System
Like many modern text-to-speech systems, MOSS-TTS operates in two stages:
- First, a text-to-spectrogram model converts the input text into a mel-spectrogram, a visual representation of the sound's frequency spectrum.
- Second, a HiFi-GAN vocoder takes this spectrogram and synthesizes it into a final audio waveform.
The complete model, along with instructions for use, is available on the Hugging Face Hub. While the weights are openly accessible, they are released under a custom license that prohibits commercial use, a key consideration for developers looking to integrate the technology.
Sources
- Visit
OpenMOSS-Team/MOSS-TTS
Hugging Face
0 comments
No comments yet. Be the first to weigh in.
More in Text → Speech
Zyphra Releases Open-Source Zonos 2 TTS Model
The new text-to-speech model offers a commercially permissive alternative for developers in a field still dominated by closed-source APIs.

Boson AI's Higgs Audio v3 Offers Expressive, Multilingual TTS
The new 4-billion-parameter text-to-speech model is available for non-commercial use, promising fine-grained control over vocal delivery.
MOSS-TTS Aims for More Robust Speech Synthesis
A new text-to-speech model introduces 'delay-pattern decoding' to solve common word skipping and repetition errors in parallel generation.