Microsoft Releases VibeVoice for Long-Form Audio
The new 1.5-billion-parameter text-to-speech model is designed to generate natural, multi-speaker audio for podcasts and other long-form content.

Microsoft has released VibeVoice-1.5B, a new open-source model aimed at generating high-quality, long-form speech. At 1.5 billion parameters, it's a notable new entry in the text-to-speech (TTS) landscape, focusing on a particularly challenging area: creating natural-sounding, multi-speaker conversations.
The model is specifically designed to produce audio that mimics the style of podcasts. It supports both English and Chinese, making it versatile for a wide range of applications. Importantly, VibeVoice is released under a permissive MIT license, which allows for broad use in both research and commercial projects without significant restrictions.
Key Capabilities
- Long-form Generation: Capable of producing extended audio clips beyond typical short sentences.
- Multi-speaker Support: Can synthesize conversations involving different voices.
- Bilingual: Supports both English and Chinese text input.
- Permissive Licensing: Released under the MIT license, encouraging wide adoption.
The release of VibeVoice matters because it provides a strong open-source alternative for creating sophisticated audio content that has often been the domain of proprietary services. Developers and creators can now experiment with generating entire podcast episodes, dynamic audiobooks, or more complex conversational agents. You can find the model and usage instructions on its Hugging Face repository.
Sources
- Visit
microsoft/VibeVoice-1.5B
Hugging Face
0 comments
No comments yet. Be the first to weigh in.
More in Text → Speech
Zyphra Releases Open-Source Zonos 2 TTS Model
The new text-to-speech model offers a commercially permissive alternative for developers in a field still dominated by closed-source APIs.

Boson AI's Higgs Audio v3 Offers Expressive, Multilingual TTS
The new 4-billion-parameter text-to-speech model is available for non-commercial use, promising fine-grained control over vocal delivery.
MOSS-TTS Aims for More Robust Speech Synthesis
A new text-to-speech model introduces 'delay-pattern decoding' to solve common word skipping and repetition errors in parallel generation.