OpenMOSSText → Speech

MOSS-TTS Aims for More Robust Speech Synthesis

A new text-to-speech model introduces 'delay-pattern decoding' to solve common word skipping and repetition errors in parallel generation.

May 25, 2026

UpdateOther

The OpenMOSS team has released MOSS-TTS v1.5, a new model for multilingual text-to-speech synthesis. Built on the VITS architecture, the model is designed to generate natural-sounding speech in Chinese, English, and Japanese.

Many modern TTS systems use parallel decoding to generate audio quickly, but this approach can lead to instability, causing the model to occasionally skip or repeat words. This lack of reliability is a significant hurdle for using such models in production applications.

To address this, MOSS-TTS introduces a technique its creators call "delay-pattern decoding." This novel mechanism is designed to mitigate common generation errors, significantly improving the model's robustness and making the output more consistent and reliable.

The model is available on the Hugging Face Hub for community use. While the code is licensed under the permissive MIT license, the model weights themselves are restricted to non-commercial research purposes.

Sources

OpenMOSS-Team/MOSS-TTS-v1.5
Hugging Face
Visit

0 comments

No comments yet. Be the first to weigh in.

Audio8 debuts a 0.6B multilingual zero-shot TTS preview

The compact text-to-speech model promises voice cloning across languages from a footprint small enough to run without heavy hardware.

Jul 28, 2026

KRAFTON/Any-to-Any

KRAFTON releases A.X-K2 Raon speech MoE model

The game maker's new open model blends text-to-speech and speech recognition in a single 21B mixture-of-experts system with just 3B active parameters.

Jul 27, 2026

NVIDIA/Any-to-Any

NVIDIA's Audex Unifies Audio Understanding and Speech

A new 30B mixture-of-experts model from NVIDIA handles both listening and speaking within a single audio-text architecture.

Jul 6, 2026