OpenMOSSText → Speech

MOSS-TTS-Nano Delivers Multilingual Speech at 100M Params

The new open-source model from OpenMOSS-Team generates high-quality speech in multiple languages while maintaining a remarkably small footprint.

Apr 2, 2026

UpdateOther

The field of open-source text-to-speech has a new, compact contender. A group known as OpenMOSS-Team has released MOSS-TTS-Nano, a generative audio model with just 100 million parameters designed for high-quality, multilingual speech synthesis.

The model's key feature is its linguistic flexibility. It officially supports English, Mandarin Chinese, and Cantonese, but its most notable capability is handling mixed-language sentences—a common challenge for speech models. This allows it to generate natural-sounding audio from text that switches between languages, such as "give me a cup of 拿铁."

At just 100M parameters, the 'Nano' in its name is well-earned. This small size makes MOSS-TTS-Nano a compelling option for applications where computational resources are limited, such as on-device assistants, embedded systems, or other edge computing scenarios. It presents an efficient alternative to larger, cloud-dependent text-to-speech APIs.

The model is available for download from the team's Hugging Face repository. It's released under a Creative Commons CC BY-NC-SA 4.0 license, which permits academic and personal use but restricts commercial applications.

Sources

OpenMOSS-Team/MOSS-TTS-Nano-100M
Hugging Face
Visit

0 comments

No comments yet. Be the first to weigh in.

Audio8 debuts a 0.6B multilingual zero-shot TTS preview

The compact text-to-speech model promises voice cloning across languages from a footprint small enough to run without heavy hardware.

Jul 28, 2026

KRAFTON/Any-to-Any

KRAFTON releases A.X-K2 Raon speech MoE model

The game maker's new open model blends text-to-speech and speech recognition in a single 21B mixture-of-experts system with just 3B active parameters.

Jul 27, 2026

NVIDIA/Any-to-Any

NVIDIA's Audex Unifies Audio Understanding and Speech

A new 30B mixture-of-experts model from NVIDIA handles both listening and speaking within a single audio-text architecture.

Jul 6, 2026