OpenBMBText → Speech

OpenBMB Releases VoxCPM for Open Voice Synthesis

The new 500-million-parameter model offers high-quality text-to-speech and zero-shot voice cloning under a permissive license.

Sep 16, 2025

NotableApache 2.0

The OpenBMB research collective has released VoxCPM-0.5B, a new open-source model for speech generation. At just 500 million parameters, it's designed to be a relatively lightweight yet capable tool for developers working with synthetic audio. The model is available under a permissive Apache 2.0 license, encouraging broad adoption.

VoxCPM is built upon the architecture of the MiniCPM model family, specifically drawing from the multimodal capabilities of MiniCPM4. By extending this foundation into the audio domain, OpenBMB provides a high-quality speech synthesis model that is both accessible and efficient, continuing the trend of powerful, specialized open models in smaller weight classes.

Zero-Shot Voice Cloning

The model's primary strength lies in its ability to perform zero-shot voice cloning. This means it can replicate a person's voice from a short audio sample without requiring any specialized fine-tuning or retraining. Its core features include:

Bilingual text-to-speech in English and Chinese.
Zero-shot voice cloning from brief audio clips.
High-quality, natural-sounding audio output.

For researchers and developers interested in exploring its capabilities, the model is available for download on Hugging Face. Its open license and modest size make it a compelling option for projects requiring custom voice generation or real-time speech synthesis applications.

Sources

openbmb/VoxCPM-0.5B
Hugging Face
Visit

0 comments

No comments yet. Be the first to weigh in.

Audio8 debuts a 0.6B multilingual zero-shot TTS preview

The compact text-to-speech model promises voice cloning across languages from a footprint small enough to run without heavy hardware.

Jul 28, 2026

KRAFTON/Any-to-Any

KRAFTON releases A.X-K2 Raon speech MoE model

The game maker's new open model blends text-to-speech and speech recognition in a single 21B mixture-of-experts system with just 3B active parameters.

Jul 27, 2026

NVIDIA/Any-to-Any

NVIDIA's Audex Unifies Audio Understanding and Speech

A new 30B mixture-of-experts model from NVIDIA handles both listening and speaking within a single audio-text architecture.

Jul 6, 2026

Zero-Shot Voice Cloning

Bilingual text-to-speech in English and Chinese.

Zero-shot voice cloning from brief audio clips.

High-quality, natural-sounding audio output.