OpenBMB Releases VoxCPM for Open Voice Synthesis
The new 500-million-parameter model offers high-quality text-to-speech and zero-shot voice cloning under a permissive license.

The OpenBMB research collective has released VoxCPM-0.5B, a new open-source model for speech generation. At just 500 million parameters, it's designed to be a relatively lightweight yet capable tool for developers working with synthetic audio. The model is available under a permissive Apache 2.0 license, encouraging broad adoption.
VoxCPM is built upon the architecture of the MiniCPM model family, specifically drawing from the multimodal capabilities of MiniCPM4. By extending this foundation into the audio domain, OpenBMB provides a high-quality speech synthesis model that is both accessible and efficient, continuing the trend of powerful, specialized open models in smaller weight classes.
Zero-Shot Voice Cloning
The model's primary strength lies in its ability to perform zero-shot voice cloning. This means it can replicate a person's voice from a short audio sample without requiring any specialized fine-tuning or retraining. Its core features include:
- Bilingual text-to-speech in English and Chinese.
- Zero-shot voice cloning from brief audio clips.
- High-quality, natural-sounding audio output.
For researchers and developers interested in exploring its capabilities, the model is available for download on Hugging Face. Its open license and modest size make it a compelling option for projects requiring custom voice generation or real-time speech synthesis applications.
Sources
- Visit
openbmb/VoxCPM-0.5B
Hugging Face
0 comments
No comments yet. Be the first to weigh in.
More in Text → Speech
Zyphra Releases Open-Source Zonos 2 TTS Model
The new text-to-speech model offers a commercially permissive alternative for developers in a field still dominated by closed-source APIs.

Boson AI's Higgs Audio v3 Offers Expressive, Multilingual TTS
The new 4-billion-parameter text-to-speech model is available for non-commercial use, promising fine-grained control over vocal delivery.
MOSS-TTS Aims for More Robust Speech Synthesis
A new text-to-speech model introduces 'delay-pattern decoding' to solve common word skipping and repetition errors in parallel generation.