Qwen · AlibabaText → Speech

Qwen Releases Open 1.7B Custom Voice Synthesis Model

Alibaba's Qwen team has released a new text-to-speech model capable of cloning voices from just a few seconds of audio.

Jan 21, 2026

NotableApache 2.0

The Qwen team at Alibaba has released Qwen3-TTS, a new open-source text-to-speech (TTS) model. At 1.7 billion parameters, this model is designed to generate high-quality speech from text and is available under the permissive Apache 2.0 license, allowing for commercial use.

The standout feature of the new model is its ability to perform custom voice cloning. According to the release documentation, developers can use a short audio clip, typically between 3 and 10 seconds long, as a reference to synthesize speech in that specific voice. This capability opens up a wide range of applications for personalized and dynamic audio content.

Technical Details

The model, named Qwen3-TTS-12Hz-1.7B-CustomVoice, operates on a two-stage process. First, a text-to-acoustic model generates an initial audio representation from the input text and a voice embedding derived from the reference audio. Then, a vocoder converts this representation into the final audio waveform. The "12Hz" in its name refers to its tokenization rate, a technical detail related to how it processes audio information.

This release adds a powerful new tool to the growing ecosystem of open-source generative audio. By providing a capable, permissively licensed voice cloning model, the Qwen team is enabling developers to build more sophisticated and personalized voice applications, from custom assistants to accessibility tools. The model and usage instructions are available on the Hugging Face Hub.

Sources

Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice
Hugging Face
Visit

0 comments

No comments yet. Be the first to weigh in.

Audio8 debuts a 0.6B multilingual zero-shot TTS preview

The compact text-to-speech model promises voice cloning across languages from a footprint small enough to run without heavy hardware.

Jul 28, 2026

KRAFTON/Any-to-Any

KRAFTON releases A.X-K2 Raon speech MoE model

The game maker's new open model blends text-to-speech and speech recognition in a single 21B mixture-of-experts system with just 3B active parameters.

Jul 27, 2026

NVIDIA/Any-to-Any

NVIDIA's Audex Unifies Audio Understanding and Speech

A new 30B mixture-of-experts model from NVIDIA handles both listening and speaking within a single audio-text architecture.

Jul 6, 2026

Technical Details