Qwen Releases Open-Source Voice Cloning Model
The new 600-million-parameter Qwen3-TTS model can generate speech in multiple languages and clone voices from short audio clips.
The Qwen team, part of Alibaba, has released a new open-source model for generating human-like speech. Named Qwen3-TTS, this initial release is a 600-million-parameter base model designed for text-to-speech (TTS) applications, making another powerful generative audio tool available to developers.
The model's key capabilities are its multilingual support and its capacity for voice cloning. This allows it to not only generate speech in various languages but also to mimic a specific person's voice using only a short audio sample as a reference. This feature, often called zero-shot voice cloning, is a significant capability for creating custom voice assistants, dynamic audio content, and accessibility tools.
A Permissive Foundation
Released under the permissive Apache 2.0 license, Qwen3-TTS provides a strong alternative to proprietary text-to-speech APIs. Its open nature encourages experimentation and allows developers to build upon it without restrictive licensing, fostering innovation in the open-source AI audio space.
The model, officially designated Qwen3-TTS-12Hz-0.6B-Base, is available now for download and use. As a "base" model, it serves as a solid foundation intended for further fine-tuning on specific tasks or voices to achieve higher quality and more specialized outputs.
Sources
- Visit
Qwen/Qwen3-TTS-12Hz-0.6B-Base
Hugging Face
0 comments
No comments yet. Be the first to weigh in.
More in Text → Speech
Zyphra Releases Open-Source Zonos 2 TTS Model
The new text-to-speech model offers a commercially permissive alternative for developers in a field still dominated by closed-source APIs.

Boson AI's Higgs Audio v3 Offers Expressive, Multilingual TTS
The new 4-billion-parameter text-to-speech model is available for non-commercial use, promising fine-grained control over vocal delivery.
MOSS-TTS Aims for More Robust Speech Synthesis
A new text-to-speech model introduces 'delay-pattern decoding' to solve common word skipping and repetition errors in parallel generation.