Qwen · AlibabaText → Speech

Alibaba Releases CosyVoice 3 for Expressive TTS

The new 500-million-parameter text-to-speech model from the Qwen team offers multilingual voice cloning and emotional control.

Dec 11, 2025

NotableOther

Alibaba’s FunAudioLLM team, part of the group behind the Qwen model family, has released Fun-CosyVoice3, a 500-million-parameter foundation model for text-to-speech (TTS). The model is designed to generate highly natural, expressive, and controllable human-like speech, pushing the boundaries of open generative audio.

CosyVoice 3 stands out for its rich feature set, which brings it closer to capabilities offered by leading proprietary services. It provides a robust tool for developers working on sophisticated voice applications.

Cloning, Control, and Multilingual Support

The model's core strengths lie in its versatility and fine-grained control. Key features highlighted in the official release include:

Multilingual and Accent Support: CosyVoice 3 handles over ten languages, including English, Chinese, Japanese, French, and Spanish, and can manage code-switching between them.
Zero-Shot Voice Cloning: It can replicate a speaker’s voice from a mere 3-second audio clip, even performing cross-lingual cloning where the target language differs from the source clip.
Expressive Control: The model allows for adjustments to emotion, style, rhythm, and prosody, enabling the generation of nuanced and context-aware speech.

While the model is available for commercial use, it is released under the tongyi-qianwen-license-1.0, which carries restrictions. Companies with more than 100 million monthly active users must seek a separate license from Alibaba, a detail developers should note before integrating it into large-scale products.

Sources

FunAudioLLM/Fun-CosyVoice3-0.5B-2512
Hugging Face
Visit

0 comments

No comments yet. Be the first to weigh in.

Audio8 debuts a 0.6B multilingual zero-shot TTS preview

The compact text-to-speech model promises voice cloning across languages from a footprint small enough to run without heavy hardware.

Jul 28, 2026

KRAFTON/Any-to-Any

KRAFTON releases A.X-K2 Raon speech MoE model

The game maker's new open model blends text-to-speech and speech recognition in a single 21B mixture-of-experts system with just 3B active parameters.

Jul 27, 2026

NVIDIA/Any-to-Any

NVIDIA's Audex Unifies Audio Understanding and Speech

A new 30B mixture-of-experts model from NVIDIA handles both listening and speaking within a single audio-text architecture.

Jul 6, 2026

Cloning, Control, and Multilingual Support

The model's core strengths lie in its versatility and fine-grained control. Key features highlighted in the official release include:

Multilingual and Accent Support: CosyVoice 3 handles over ten languages, including English, Chinese, Japanese, French, and Spanish, and can manage code-switching between them.

Zero-Shot Voice Cloning: It can replicate a speaker’s voice from a mere 3-second audio clip, even performing cross-lingual cloning where the target language differs from the source clip.

Expressive Control: The model allows for adjustments to emotion, style, rhythm, and prosody, enabling the generation of nuanced and context-aware speech.