MOSS-TTS Aims for More Robust Speech Synthesis
A new text-to-speech model introduces 'delay-pattern decoding' to solve common word skipping and repetition errors in parallel generation.
The OpenMOSS team has released MOSS-TTS v1.5, a new model for multilingual text-to-speech synthesis. Built on the VITS architecture, the model is designed to generate natural-sounding speech in Chinese, English, and Japanese.
Many modern TTS systems use parallel decoding to generate audio quickly, but this approach can lead to instability, causing the model to occasionally skip or repeat words. This lack of reliability is a significant hurdle for using such models in production applications.
To address this, MOSS-TTS introduces a technique its creators call "delay-pattern decoding." This novel mechanism is designed to mitigate common generation errors, significantly improving the model's robustness and making the output more consistent and reliable.
The model is available on the Hugging Face Hub for community use. While the code is licensed under the permissive MIT license, the model weights themselves are restricted to non-commercial research purposes.
Sources
- Visit
OpenMOSS-Team/MOSS-TTS-v1.5
Hugging Face
0 comments
No comments yet. Be the first to weigh in.
More in Text → Speech
Zyphra Releases Open-Source Zonos 2 TTS Model
The new text-to-speech model offers a commercially permissive alternative for developers in a field still dominated by closed-source APIs.

Boson AI's Higgs Audio v3 Offers Expressive, Multilingual TTS
The new 4-billion-parameter text-to-speech model is available for non-commercial use, promising fine-grained control over vocal delivery.

MisoLabs Debuts MisoTTS, an Open Voice Model
The new text-to-speech system adapts the decoder-only architecture of language models like Llama to generate more natural-sounding speech.