OpenBMBText → Speech

OpenBMB Releases VoxCPM2 for Expressive TTS

The new diffusion-based model from the OpenBMB research group supports multilingual speech, emotional control, and zero-shot voice cloning.

Apr 3, 2026

NotableOther

The OpenBMB research community has released VoxCPM2, a powerful new open-source model for text-to-speech synthesis. Built on a modern diffusion-based architecture, the model aims to generate high-fidelity, expressive human speech in multiple languages.

Cloning and Control

VoxCPM2's standout feature is its ability to perform zero-shot voice cloning using just a 3-to-20 second audio sample of a target voice. This allows it to generate speech in a new voice without specific training. The model also offers fine-grained control over the output, with key capabilities including:

Cross-lingual synthesis: Generate speech in one language using a voice from another (e.g., speaking Chinese with an English speaker's vocal characteristics).
Emotional control: Adjust the emotional tone of the generated speech.
Multilingual support: Primarily trained on Chinese and English.

The model uses a two-stage cascaded diffusion process. The first stage converts text into a mel-spectrogram, an acoustic representation of the audio. A second-stage vocoder then converts this spectrogram into a final audio waveform, a technique known for producing high-quality results.

VoxCPM2 represents another significant step forward for open-source generative audio, providing capabilities that rival proprietary systems. It gives researchers and developers a powerful tool for creating custom voice applications. The model is available for download on the Hugging Face Hub, though users should note its custom "OpenBMB Model License" for any usage considerations.

Sources

openbmb/VoxCPM2
Hugging Face
Visit

0 comments

No comments yet. Be the first to weigh in.

Audio8 debuts a 0.6B multilingual zero-shot TTS preview

The compact text-to-speech model promises voice cloning across languages from a footprint small enough to run without heavy hardware.

Jul 28, 2026

KRAFTON/Any-to-Any

KRAFTON releases A.X-K2 Raon speech MoE model

The game maker's new open model blends text-to-speech and speech recognition in a single 21B mixture-of-experts system with just 3B active parameters.

Jul 27, 2026

NVIDIA/Any-to-Any

NVIDIA's Audex Unifies Audio Understanding and Speech

A new 30B mixture-of-experts model from NVIDIA handles both listening and speaking within a single audio-text architecture.

Jul 6, 2026

Cloning and Control

Cross-lingual synthesis: Generate speech in one language using a voice from another (e.g., speaking Chinese with an English speaker's vocal characteristics).

Emotional control: Adjust the emotional tone of the generated speech.

Multilingual support: Primarily trained on Chinese and English.