Zhipu AI Releases GLM-TTS for Zero-Shot Voice Cloning
This new text-to-speech model can replicate a voice from just a few seconds of audio, using a novel combination of flow matching and reinforcement learning.

Zhipu AI, the company behind the GLM family of large language models, has released GLM-TTS, a new model for text-to-speech synthesis. The system is capable of zero-shot voice cloning, meaning it can replicate a speaker's voice after hearing just a few seconds of an audio sample. It's designed to be bilingual, supporting both Chinese and English out of the box.
A New Approach to Synthesis
Instead of relying on more common diffusion techniques, GLM-TTS is built on a flow matching architecture. This approach can offer faster and more stable training compared to some alternatives. Uniquely, the model also incorporates reinforcement learning (RL) to fine-tune the output, specifically to improve the prosody—the rhythm, stress, and intonation—of the generated speech, making it sound more natural and expressive.
The model's core capability is its ability to take a 3- to 10-second audio prompt of a target voice and then generate new speech in that voice from any given text. This makes it a powerful tool for applications requiring personalized audio generation without extensive training data for each new voice.
GLM-TTS is available on Hugging Face, though it is released under a custom license from Zhipu AI that governs its use. Potential users should review its terms, as they differ from standard open-source licenses like Apache 2.0 or MIT.
Sources
- Visit
zai-org/GLM-TTS
Hugging Face
0 comments
No comments yet. Be the first to weigh in.
More in Text → Speech
Zyphra Releases Open-Source Zonos 2 TTS Model
The new text-to-speech model offers a commercially permissive alternative for developers in a field still dominated by closed-source APIs.

Boson AI's Higgs Audio v3 Offers Expressive, Multilingual TTS
The new 4-billion-parameter text-to-speech model is available for non-commercial use, promising fine-grained control over vocal delivery.
MOSS-TTS Aims for More Robust Speech Synthesis
A new text-to-speech model introduces 'delay-pattern decoding' to solve common word skipping and repetition errors in parallel generation.