Qwen · AlibabaText → Speech

Qwen Releases Open-Source Voice Cloning Model

The new 600-million-parameter Qwen3-TTS model can generate speech in multiple languages and clone voices from short audio clips.

Jan 21, 2026

NotableApache 2.0

The Qwen team, part of Alibaba, has released a new open-source model for generating human-like speech. Named Qwen3-TTS, this initial release is a 600-million-parameter base model designed for text-to-speech (TTS) applications, making another powerful generative audio tool available to developers.

The model's key capabilities are its multilingual support and its capacity for voice cloning. This allows it to not only generate speech in various languages but also to mimic a specific person's voice using only a short audio sample as a reference. This feature, often called zero-shot voice cloning, is a significant capability for creating custom voice assistants, dynamic audio content, and accessibility tools.

A Permissive Foundation

Released under the permissive Apache 2.0 license, Qwen3-TTS provides a strong alternative to proprietary text-to-speech APIs. Its open nature encourages experimentation and allows developers to build upon it without restrictive licensing, fostering innovation in the open-source AI audio space.

The model, officially designated Qwen3-TTS-12Hz-0.6B-Base, is available now for download and use. As a "base" model, it serves as a solid foundation intended for further fine-tuning on specific tasks or voices to achieve higher quality and more specialized outputs.

Sources

Qwen/Qwen3-TTS-12Hz-0.6B-Base
Hugging Face
Visit

0 comments

No comments yet. Be the first to weigh in.

Audio8 debuts a 0.6B multilingual zero-shot TTS preview

The compact text-to-speech model promises voice cloning across languages from a footprint small enough to run without heavy hardware.

Jul 28, 2026

KRAFTON/Any-to-Any

KRAFTON releases A.X-K2 Raon speech MoE model

The game maker's new open model blends text-to-speech and speech recognition in a single 21B mixture-of-experts system with just 3B active parameters.

Jul 27, 2026

NVIDIA/Any-to-Any

NVIDIA's Audex Unifies Audio Understanding and Speech

A new 30B mixture-of-experts model from NVIDIA handles both listening and speaking within a single audio-text architecture.

Jul 6, 2026

A Permissive Foundation