Qwen Releases Open 1.7B Custom Voice Synthesis Model
Alibaba's Qwen team has released a new text-to-speech model capable of cloning voices from just a few seconds of audio.
The Qwen team at Alibaba has released Qwen3-TTS, a new open-source text-to-speech (TTS) model. At 1.7 billion parameters, this model is designed to generate high-quality speech from text and is available under the permissive Apache 2.0 license, allowing for commercial use.
The standout feature of the new model is its ability to perform custom voice cloning. According to the release documentation, developers can use a short audio clip, typically between 3 and 10 seconds long, as a reference to synthesize speech in that specific voice. This capability opens up a wide range of applications for personalized and dynamic audio content.
Technical Details
The model, named Qwen3-TTS-12Hz-1.7B-CustomVoice, operates on a two-stage process. First, a text-to-acoustic model generates an initial audio representation from the input text and a voice embedding derived from the reference audio. Then, a vocoder converts this representation into the final audio waveform. The "12Hz" in its name refers to its tokenization rate, a technical detail related to how it processes audio information.
This release adds a powerful new tool to the growing ecosystem of open-source generative audio. By providing a capable, permissively licensed voice cloning model, the Qwen team is enabling developers to build more sophisticated and personalized voice applications, from custom assistants to accessibility tools. The model and usage instructions are available on the Hugging Face Hub.
Sources
- Visit
Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice
Hugging Face
0 comments
No comments yet. Be the first to weigh in.
More in Text → Speech
Zyphra Releases Open-Source Zonos 2 TTS Model
The new text-to-speech model offers a commercially permissive alternative for developers in a field still dominated by closed-source APIs.

Boson AI's Higgs Audio v3 Offers Expressive, Multilingual TTS
The new 4-billion-parameter text-to-speech model is available for non-commercial use, promising fine-grained control over vocal delivery.
MOSS-TTS Aims for More Robust Speech Synthesis
A new text-to-speech model introduces 'delay-pattern decoding' to solve common word skipping and repetition errors in parallel generation.