Qwen · AlibabaText → Speech

Qwen Releases a Compact Custom-Voice TTS Model

The new 600-million-parameter model from Alibaba's Qwen team can clone voices from short audio clips for multilingual speech synthesis.

Jan 21, 2026

NotableQwen

Alibaba's Qwen team has released a new open-source model for text-to-speech, Qwen3-TTS-12Hz-0.6B-CustomVoice. This compact model, with just 600 million parameters, introduces a powerful feature to the open-source audio landscape: custom voice cloning.

With this capability, developers can use a short audio sample to create a digital version of a specific voice. The model can then use this cloned voice to synthesize new speech from text in multiple languages, opening up possibilities for personalized applications, custom voice assistants, and dynamic content creation.

A More Accessible Approach

The model's relatively small size makes it more accessible for researchers and developers to run and fine-tune compared to larger, proprietary systems. The "12Hz" in its name likely points to its internal audio representation, suggesting a design that balances quality with computational efficiency, making it suitable for a wider range of hardware.

This release provides a significant new tool for the open-source AI community. While high-quality TTS models exist, those with built-in, easy-to-use voice cloning are less common. The model is available for download on Hugging Face under the Qwen license, which developers should review for commercial use terms.

Sources

Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice
Hugging Face
Visit

0 comments

No comments yet. Be the first to weigh in.

Audio8 debuts a 0.6B multilingual zero-shot TTS preview

The compact text-to-speech model promises voice cloning across languages from a footprint small enough to run without heavy hardware.

Jul 28, 2026

KRAFTON/Any-to-Any

KRAFTON releases A.X-K2 Raon speech MoE model

The game maker's new open model blends text-to-speech and speech recognition in a single 21B mixture-of-experts system with just 3B active parameters.

Jul 27, 2026

NVIDIA/Any-to-Any

NVIDIA's Audex Unifies Audio Understanding and Speech

A new 30B mixture-of-experts model from NVIDIA handles both listening and speaking within a single audio-text architecture.

Jul 6, 2026

A More Accessible Approach