Aoi OtText → Speech

Microsoft Releases VibeVoice, a Podcast-Ready TTS Model

The new open-source model specializes in generating long-form, multi-speaker audio in both English and Mandarin, mimicking a natural podcast conversation.

Sep 4, 2025

NotableMIT

Microsoft has introduced a new open-source model for text-to-speech synthesis, VibeVoice Large, designed specifically for creating realistic, long-form audio content. Released under the permissive MIT license, the model aims to tackle one of the more challenging frontiers in speech generation: natural, multi-speaker conversations.

Unlike many TTS models optimized for short, single-speaker responses, VibeVoice is built to generate audio that mimics the dynamic flow of a podcast. According to the release materials on Hugging Face, it can handle extended passages of text and differentiate between multiple speakers within the same audio track, supporting both English and Mandarin Chinese.

Why It Matters

The release of VibeVoice addresses a key gap in the open-source AI ecosystem. Creating high-quality, long-form spoken content, especially with multiple voices, has often required complex, proprietary systems or extensive manual editing. By providing a specialized tool for this purpose, Microsoft is enabling developers and creators to build more sophisticated applications, from automated podcast production and audiobook narration to more dynamic virtual assistants.

The model's focus on conversational audio represents a move toward more naturalistic human-computer interaction. As AI becomes more integrated into daily life, the ability to generate speech that is not just clear but also contextually appropriate and engaging is increasingly important. VibeVoice Large is available for download and experimentation now.

Sources

aoi-ot/VibeVoice-Large
Hugging Face
Visit

0 comments

No comments yet. Be the first to weigh in.

Audio8 debuts a 0.6B multilingual zero-shot TTS preview

The compact text-to-speech model promises voice cloning across languages from a footprint small enough to run without heavy hardware.

Jul 28, 2026

KRAFTON/Any-to-Any

KRAFTON releases A.X-K2 Raon speech MoE model

The game maker's new open model blends text-to-speech and speech recognition in a single 21B mixture-of-experts system with just 3B active parameters.

Jul 27, 2026

NVIDIA/Any-to-Any

NVIDIA's Audex Unifies Audio Understanding and Speech

A new 30B mixture-of-experts model from NVIDIA handles both listening and speaking within a single audio-text architecture.

Jul 6, 2026

Why It Matters