VibevoiceText → Speech

Microsoft Releases VibeVoice, a 7B Podcast TTS Model

The new 7-billion-parameter model is designed for generating long-form, multi-speaker audio in English and Chinese under a permissive MIT license.

Sep 4, 2025

NotableMIT

Microsoft has entered the open-source audio space with VibeVoice-7B, a large text-to-speech (TTS) model designed for creating complex, conversational audio. At 7 billion parameters, it represents a significant new entry in a field where high-quality, open models are still less common than their text-based counterparts.

The model's primary strength is its ability to generate long-form, multi-speaker content that mimics the style of a podcast. This capability addresses a key challenge in synthetic audio: maintaining vocal consistency and natural turn-taking over extended durations. VibeVoice-7B supports both English and Chinese, broadening its potential applications.

Key Capabilities

Multi-Speaker Generation: Creates conversational audio with distinct voices.
Long-Form Synthesis: Optimized for generating extended content like articles or audiobooks.
Bilingual Support: Capable of producing speech in both English and Chinese.
Permissive Licensing: Released under the MIT license, allowing for wide-ranging commercial and research use.

What makes this release particularly notable is the combination of its scale and permissive license. While proprietary TTS services offer high quality, a powerful open-source alternative like VibeVoice gives developers and creators more control and flexibility. This could accelerate innovation in areas like automated content production, accessibility tools, and dynamic virtual assistants. The model is available for download and experimentation on its Hugging Face repository.

Sources

vibevoice/VibeVoice-7B
Hugging Face
Visit

0 comments

No comments yet. Be the first to weigh in.

Audio8 debuts a 0.6B multilingual zero-shot TTS preview

The compact text-to-speech model promises voice cloning across languages from a footprint small enough to run without heavy hardware.

Jul 28, 2026

KRAFTON/Any-to-Any

KRAFTON releases A.X-K2 Raon speech MoE model

The game maker's new open model blends text-to-speech and speech recognition in a single 21B mixture-of-experts system with just 3B active parameters.

Jul 27, 2026

NVIDIA/Any-to-Any

NVIDIA's Audex Unifies Audio Understanding and Speech

A new 30B mixture-of-experts model from NVIDIA handles both listening and speaking within a single audio-text architecture.

Jul 6, 2026

Key Capabilities

Multi-Speaker Generation: Creates conversational audio with distinct voices.

Long-Form Synthesis: Optimized for generating extended content like articles or audiobooks.

Bilingual Support: Capable of producing speech in both English and Chinese.

Permissive Licensing: Released under the MIT license, allowing for wide-ranging commercial and research use.