Microsoft Releases VibeVoice, a 7B Podcast TTS Model
The new 7-billion-parameter model is designed for generating long-form, multi-speaker audio in English and Chinese under a permissive MIT license.

Microsoft has entered the open-source audio space with VibeVoice-7B, a large text-to-speech (TTS) model designed for creating complex, conversational audio. At 7 billion parameters, it represents a significant new entry in a field where high-quality, open models are still less common than their text-based counterparts.
The model's primary strength is its ability to generate long-form, multi-speaker content that mimics the style of a podcast. This capability addresses a key challenge in synthetic audio: maintaining vocal consistency and natural turn-taking over extended durations. VibeVoice-7B supports both English and Chinese, broadening its potential applications.
Key Capabilities
- Multi-Speaker Generation: Creates conversational audio with distinct voices.
- Long-Form Synthesis: Optimized for generating extended content like articles or audiobooks.
- Bilingual Support: Capable of producing speech in both English and Chinese.
- Permissive Licensing: Released under the MIT license, allowing for wide-ranging commercial and research use.
What makes this release particularly notable is the combination of its scale and permissive license. While proprietary TTS services offer high quality, a powerful open-source alternative like VibeVoice gives developers and creators more control and flexibility. This could accelerate innovation in areas like automated content production, accessibility tools, and dynamic virtual assistants. The model is available for download and experimentation on its Hugging Face repository.
Sources
- Visit
vibevoice/VibeVoice-7B
Hugging Face
0 comments
No comments yet. Be the first to weigh in.
More in Text → Speech
Zyphra Releases Open-Source Zonos 2 TTS Model
The new text-to-speech model offers a commercially permissive alternative for developers in a field still dominated by closed-source APIs.

Boson AI's Higgs Audio v3 Offers Expressive, Multilingual TTS
The new 4-billion-parameter text-to-speech model is available for non-commercial use, promising fine-grained control over vocal delivery.
MOSS-TTS Aims for More Robust Speech Synthesis
A new text-to-speech model introduces 'delay-pattern decoding' to solve common word skipping and repetition errors in parallel generation.