Qwen · AlibabaText → Speech

Qwen Unveils Open Model for Custom Voice Synthesis

The new 1.7-billion-parameter text-to-speech model from Alibaba's Qwen team can generate novel voices from short audio prompts.

Jan 21, 2026

NotableApache 2.0

Alibaba's Qwen team has expanded its open-source offerings with Qwen3-TTS, a new model dedicated to high-quality speech synthesis. Released under a permissive Apache 2.0 license, this 1.7-billion-parameter system marks a significant entry into the growing field of open text-to-speech (TTS) models.

The model's standout feature is its "Voice Design" capability. Unlike traditional TTS systems that rely on a fixed set of pre-recorded voices, Qwen3-TTS can generate speech in a novel voice by analyzing a short audio prompt. This allows developers to create unique voices or clone existing ones for custom applications, a feature previously common in proprietary, API-driven systems.

Multilingual and Prompt-Driven

Qwen3-TTS is designed to be multilingual and is controlled through a combination of text and audio inputs. A user provides the text to be spoken along with a reference audio clip, and the model generates speech that matches the voice characteristics of the reference. The "12Hz" in the model's name likely refers to the sampling rate of its internal audio representation, a technique used in modern neural audio codecs to efficiently model speech.

The release of a powerful, commercially-permissive voice design model like Qwen3-TTS is a notable development for the open-source AI community. It provides a foundational tool for a wide range of applications, including personalized digital assistants, dynamic video game character dialogue, and accessibility tools, without the restrictions of closed platforms.

Sources

Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign
Hugging Face
Visit

0 comments

No comments yet. Be the first to weigh in.

Audio8 debuts a 0.6B multilingual zero-shot TTS preview

The compact text-to-speech model promises voice cloning across languages from a footprint small enough to run without heavy hardware.

Jul 28, 2026

KRAFTON/Any-to-Any

KRAFTON releases A.X-K2 Raon speech MoE model

The game maker's new open model blends text-to-speech and speech recognition in a single 21B mixture-of-experts system with just 3B active parameters.

Jul 27, 2026

NVIDIA/Any-to-Any

NVIDIA's Audex Unifies Audio Understanding and Speech

A new 30B mixture-of-experts model from NVIDIA handles both listening and speaking within a single audio-text architecture.

Jul 6, 2026

Multilingual and Prompt-Driven