OpenBMB Releases VoxCPM2 for Expressive TTS
The new diffusion-based model from the OpenBMB research group supports multilingual speech, emotional control, and zero-shot voice cloning.
The OpenBMB research community has released VoxCPM2, a powerful new open-source model for text-to-speech synthesis. Built on a modern diffusion-based architecture, the model aims to generate high-fidelity, expressive human speech in multiple languages.
Cloning and Control
VoxCPM2's standout feature is its ability to perform zero-shot voice cloning using just a 3-to-20 second audio sample of a target voice. This allows it to generate speech in a new voice without specific training. The model also offers fine-grained control over the output, with key capabilities including:
- Cross-lingual synthesis: Generate speech in one language using a voice from another (e.g., speaking Chinese with an English speaker's vocal characteristics).
- Emotional control: Adjust the emotional tone of the generated speech.
- Multilingual support: Primarily trained on Chinese and English.
The model uses a two-stage cascaded diffusion process. The first stage converts text into a mel-spectrogram, an acoustic representation of the audio. A second-stage vocoder then converts this spectrogram into a final audio waveform, a technique known for producing high-quality results.
VoxCPM2 represents another significant step forward for open-source generative audio, providing capabilities that rival proprietary systems. It gives researchers and developers a powerful tool for creating custom voice applications. The model is available for download on the Hugging Face Hub, though users should note its custom "OpenBMB Model License" for any usage considerations.
Sources
- Visit
openbmb/VoxCPM2
Hugging Face
0 comments
No comments yet. Be the first to weigh in.
More in Text → Speech
Zyphra Releases Open-Source Zonos 2 TTS Model
The new text-to-speech model offers a commercially permissive alternative for developers in a field still dominated by closed-source APIs.

Boson AI's Higgs Audio v3 Offers Expressive, Multilingual TTS
The new 4-billion-parameter text-to-speech model is available for non-commercial use, promising fine-grained control over vocal delivery.
MOSS-TTS Aims for More Robust Speech Synthesis
A new text-to-speech model introduces 'delay-pattern decoding' to solve common word skipping and repetition errors in parallel generation.