Zhipu AIText → Speech

Zhipu AI Releases GLM-TTS for Zero-Shot Voice Cloning

This new text-to-speech model can replicate a voice from just a few seconds of audio, using a novel combination of flow matching and reinforcement learning.

Dec 10, 2025

NotableOther

Zhipu AI, the company behind the GLM family of large language models, has released GLM-TTS, a new model for text-to-speech synthesis. The system is capable of zero-shot voice cloning, meaning it can replicate a speaker's voice after hearing just a few seconds of an audio sample. It's designed to be bilingual, supporting both Chinese and English out of the box.

A New Approach to Synthesis

Instead of relying on more common diffusion techniques, GLM-TTS is built on a flow matching architecture. This approach can offer faster and more stable training compared to some alternatives. Uniquely, the model also incorporates reinforcement learning (RL) to fine-tune the output, specifically to improve the prosody—the rhythm, stress, and intonation—of the generated speech, making it sound more natural and expressive.

The model's core capability is its ability to take a 3- to 10-second audio prompt of a target voice and then generate new speech in that voice from any given text. This makes it a powerful tool for applications requiring personalized audio generation without extensive training data for each new voice.

GLM-TTS is available on Hugging Face, though it is released under a custom license from Zhipu AI that governs its use. Potential users should review its terms, as they differ from standard open-source licenses like Apache 2.0 or MIT.

Sources

zai-org/GLM-TTS
Hugging Face
Visit

0 comments

No comments yet. Be the first to weigh in.

Audio8 debuts a 0.6B multilingual zero-shot TTS preview

The compact text-to-speech model promises voice cloning across languages from a footprint small enough to run without heavy hardware.

Jul 28, 2026

KRAFTON/Any-to-Any

KRAFTON releases A.X-K2 Raon speech MoE model

The game maker's new open model blends text-to-speech and speech recognition in a single 21B mixture-of-experts system with just 3B active parameters.

Jul 27, 2026

NVIDIA/Any-to-Any

NVIDIA's Audex Unifies Audio Understanding and Speech

A new 30B mixture-of-experts model from NVIDIA handles both listening and speaking within a single audio-text architecture.

Jul 6, 2026

A New Approach to Synthesis