MicrosoftText → Speech

Microsoft Releases VibeVoice for Long-Form Audio

The new 1.5-billion-parameter text-to-speech model is designed to generate natural, multi-speaker audio for podcasts and other long-form content.

Aug 25, 2025

NotableMIT

Microsoft has released VibeVoice-1.5B, a new open-source model aimed at generating high-quality, long-form speech. At 1.5 billion parameters, it's a notable new entry in the text-to-speech (TTS) landscape, focusing on a particularly challenging area: creating natural-sounding, multi-speaker conversations.

The model is specifically designed to produce audio that mimics the style of podcasts. It supports both English and Chinese, making it versatile for a wide range of applications. Importantly, VibeVoice is released under a permissive MIT license, which allows for broad use in both research and commercial projects without significant restrictions.

Key Capabilities

Long-form Generation: Capable of producing extended audio clips beyond typical short sentences.
Multi-speaker Support: Can synthesize conversations involving different voices.
Bilingual: Supports both English and Chinese text input.
Permissive Licensing: Released under the MIT license, encouraging wide adoption.

The release of VibeVoice matters because it provides a strong open-source alternative for creating sophisticated audio content that has often been the domain of proprietary services. Developers and creators can now experiment with generating entire podcast episodes, dynamic audiobooks, or more complex conversational agents. You can find the model and usage instructions on its Hugging Face repository.

Sources

microsoft/VibeVoice-1.5B
Hugging Face
Visit

0 comments

No comments yet. Be the first to weigh in.

Audio8 debuts a 0.6B multilingual zero-shot TTS preview

The compact text-to-speech model promises voice cloning across languages from a footprint small enough to run without heavy hardware.

Jul 28, 2026

KRAFTON/Any-to-Any

KRAFTON releases A.X-K2 Raon speech MoE model

The game maker's new open model blends text-to-speech and speech recognition in a single 21B mixture-of-experts system with just 3B active parameters.

Jul 27, 2026

NVIDIA/Any-to-Any

NVIDIA's Audex Unifies Audio Understanding and Speech

A new 30B mixture-of-experts model from NVIDIA handles both listening and speaking within a single audio-text architecture.

Jul 6, 2026

Key Capabilities

Long-form Generation: Capable of producing extended audio clips beyond typical short sentences.

Multi-speaker Support: Can synthesize conversations involving different voices.

Bilingual: Supports both English and Chinese text input.

Permissive Licensing: Released under the MIT license, encouraging wide adoption.