OpenMOSSText → Speech

MOSS-TTS: A New Multilingual Text-to-Speech Model

The new system from the OpenMOSS Team uses a novel 'delay-pattern' architecture to generate natural-sounding speech in Chinese, English, and Japanese.

Feb 6, 2026

NotableOther

The OpenMOSS Team has released MOSS-TTS, a new open-source model for generating high-quality speech from text. The system is multilingual, capable of producing audio in Chinese, English, and Japanese, making it a versatile tool for a range of voice applications.

The model's key innovation lies in its architecture. MOSS-TTS is a non-autoregressive system that uses a technique called a "delay-pattern." This approach allows it to model the rhythm and prosody of speech more effectively than some traditional methods, which can result in more natural-sounding intonation without generating audio one step at a time.

A Two-Stage System

Like many modern text-to-speech systems, MOSS-TTS operates in two stages:

First, a text-to-spectrogram model converts the input text into a mel-spectrogram, a visual representation of the sound's frequency spectrum.
Second, a HiFi-GAN vocoder takes this spectrogram and synthesizes it into a final audio waveform.

The complete model, along with instructions for use, is available on the Hugging Face Hub. While the weights are openly accessible, they are released under a custom license that prohibits commercial use, a key consideration for developers looking to integrate the technology.

Sources

OpenMOSS-Team/MOSS-TTS
Hugging Face
Visit

0 comments

No comments yet. Be the first to weigh in.

Audio8 debuts a 0.6B multilingual zero-shot TTS preview

The compact text-to-speech model promises voice cloning across languages from a footprint small enough to run without heavy hardware.

Jul 28, 2026

KRAFTON/Any-to-Any

KRAFTON releases A.X-K2 Raon speech MoE model

The game maker's new open model blends text-to-speech and speech recognition in a single 21B mixture-of-experts system with just 3B active parameters.

Jul 27, 2026

NVIDIA/Any-to-Any

NVIDIA's Audex Unifies Audio Understanding and Speech

A new 30B mixture-of-experts model from NVIDIA handles both listening and speaking within a single audio-text architecture.

Jul 6, 2026

A Two-Stage System

Like many modern text-to-speech systems, MOSS-TTS operates in two stages:

First, a text-to-spectrogram model converts the input text into a mel-spectrogram, a visual representation of the sound's frequency spectrum.

Second, a HiFi-GAN vocoder takes this spectrogram and synthesizes it into a final audio waveform.