Qwen Releases a Compact Custom-Voice TTS Model
The new 600-million-parameter model from Alibaba's Qwen team can clone voices from short audio clips for multilingual speech synthesis.
Alibaba's Qwen team has released a new open-source model for text-to-speech, Qwen3-TTS-12Hz-0.6B-CustomVoice. This compact model, with just 600 million parameters, introduces a powerful feature to the open-source audio landscape: custom voice cloning.
With this capability, developers can use a short audio sample to create a digital version of a specific voice. The model can then use this cloned voice to synthesize new speech from text in multiple languages, opening up possibilities for personalized applications, custom voice assistants, and dynamic content creation.
A More Accessible Approach
The model's relatively small size makes it more accessible for researchers and developers to run and fine-tune compared to larger, proprietary systems. The "12Hz" in its name likely points to its internal audio representation, suggesting a design that balances quality with computational efficiency, making it suitable for a wider range of hardware.
This release provides a significant new tool for the open-source AI community. While high-quality TTS models exist, those with built-in, easy-to-use voice cloning are less common. The model is available for download on Hugging Face under the Qwen license, which developers should review for commercial use terms.
Sources
- Visit
Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice
Hugging Face
0 comments
No comments yet. Be the first to weigh in.
More in Text → Speech
Zyphra Releases Open-Source Zonos 2 TTS Model
The new text-to-speech model offers a commercially permissive alternative for developers in a field still dominated by closed-source APIs.

Boson AI's Higgs Audio v3 Offers Expressive, Multilingual TTS
The new 4-billion-parameter text-to-speech model is available for non-commercial use, promising fine-grained control over vocal delivery.
MOSS-TTS Aims for More Robust Speech Synthesis
A new text-to-speech model introduces 'delay-pattern decoding' to solve common word skipping and repetition errors in parallel generation.