VoxCPM 1.5 Brings Open-Source Voice Cloning
The new 500-million-parameter text-to-speech model from OpenBMB supports both English and Chinese and can replicate a voice from a short audio sample.

The field of open-source text-to-speech has a new contender with the release of VoxCPM 1.5 by the OpenBMB research group. The model introduces high-quality, zero-shot voice cloning capabilities to the community, enabling users to generate speech in a specific voice using just a short audio sample.
Built on the MiniCPM-4 architecture, VoxCPM 1.5 is a compact 500-million-parameter model. Its relatively small size makes it more accessible for developers and researchers to run and fine-tune on a wider range of hardware, lowering the barrier to entry for creating custom speech applications.
Bilingual Voice Synthesis
A key advantage of the model is its bilingual nature, supporting both English and Chinese within a single framework. This, combined with its permissive Apache 2.0 license, makes it a versatile tool for global applications. Key features include:
- Zero-shot voice cloning from brief audio clips.
- Bilingual support for English and Chinese.
- An efficient 500M parameter architecture.
By providing an open and powerful tool for voice synthesis, OpenBMB is enabling new possibilities in areas like personalized digital assistants, accessible technology, and creative content generation. Developers can explore the model and its capabilities in the official Hugging Face repository.
Sources
- Visit
openbmb/VoxCPM1.5
Hugging Face
0 comments
No comments yet. Be the first to weigh in.
More in Text → Speech
Zyphra Releases Open-Source Zonos 2 TTS Model
The new text-to-speech model offers a commercially permissive alternative for developers in a field still dominated by closed-source APIs.

Boson AI's Higgs Audio v3 Offers Expressive, Multilingual TTS
The new 4-billion-parameter text-to-speech model is available for non-commercial use, promising fine-grained control over vocal delivery.
MOSS-TTS Aims for More Robust Speech Synthesis
A new text-to-speech model introduces 'delay-pattern decoding' to solve common word skipping and repetition errors in parallel generation.