Qwen Unveils Open Model for Custom Voice Synthesis
The new 1.7-billion-parameter text-to-speech model from Alibaba's Qwen team can generate novel voices from short audio prompts.
Alibaba's Qwen team has expanded its open-source offerings with Qwen3-TTS, a new model dedicated to high-quality speech synthesis. Released under a permissive Apache 2.0 license, this 1.7-billion-parameter system marks a significant entry into the growing field of open text-to-speech (TTS) models.
The model's standout feature is its "Voice Design" capability. Unlike traditional TTS systems that rely on a fixed set of pre-recorded voices, Qwen3-TTS can generate speech in a novel voice by analyzing a short audio prompt. This allows developers to create unique voices or clone existing ones for custom applications, a feature previously common in proprietary, API-driven systems.
Multilingual and Prompt-Driven
Qwen3-TTS is designed to be multilingual and is controlled through a combination of text and audio inputs. A user provides the text to be spoken along with a reference audio clip, and the model generates speech that matches the voice characteristics of the reference. The "12Hz" in the model's name likely refers to the sampling rate of its internal audio representation, a technique used in modern neural audio codecs to efficiently model speech.
The release of a powerful, commercially-permissive voice design model like Qwen3-TTS is a notable development for the open-source AI community. It provides a foundational tool for a wide range of applications, including personalized digital assistants, dynamic video game character dialogue, and accessibility tools, without the restrictions of closed platforms.
Sources
- Visit
Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign
Hugging Face
0 comments
No comments yet. Be the first to weigh in.
More in Text → Speech
Zyphra Releases Open-Source Zonos 2 TTS Model
The new text-to-speech model offers a commercially permissive alternative for developers in a field still dominated by closed-source APIs.

Boson AI's Higgs Audio v3 Offers Expressive, Multilingual TTS
The new 4-billion-parameter text-to-speech model is available for non-commercial use, promising fine-grained control over vocal delivery.
MOSS-TTS Aims for More Robust Speech Synthesis
A new text-to-speech model introduces 'delay-pattern decoding' to solve common word skipping and repetition errors in parallel generation.