MicrosoftSpeech → Text

Microsoft Releases VibeVoice for Speech Transcription

The new open-source automatic speech recognition model handles multilingual transcription and speaker identification out of the box.

Jan 21, 2026

NotableOther

Microsoft has released VibeVoice-ASR, a new foundational model for automatic speech recognition. The system, now available on Hugging Face, is designed to convert spoken audio into written text across multiple languages.

Beyond simple transcription, VibeVoice's key capability is integrated speaker diarization—the ability to identify and label who is speaking and when. This feature is crucial for accurately transcribing conversations with multiple participants, such as meetings, interviews, or panel discussions, without requiring a separate post-processing step.

Why It Matters

The release adds a notable new entry into the competitive open-source audio space, which includes popular models like OpenAI's Whisper. While Microsoft has not yet published detailed performance benchmarks, VibeVoice’s built-in diarization offers a more streamlined solution for developers who would otherwise need to combine separate models for transcription and speaker identification.

Prospective users should take note of the licensing. According to the official model card, VibeVoice-ASR is being released for research purposes only. This will limit its immediate use in commercial products but provides a valuable new tool for the academic community exploring advanced speech processing systems.

Sources

microsoft/VibeVoice-ASR
Hugging Face
Visit

0 comments

No comments yet. Be the first to weigh in.

KRAFTON releases A.X-K2 Raon speech MoE model

The game maker's new open model blends text-to-speech and speech recognition in a single 21B mixture-of-experts system with just 3B active parameters.

Jul 27, 2026

Microsoft/Speech → Text

Microsoft's VibeVoice ASR Goes BitNet for CPU Speech

A BitNet-quantized speech recognition model trades GPU dependence for efficient CPU inference in English and Chinese.

Jul 24, 2026

Nyralabs/Speech → Text

CrisperWhisper 2.0 Large targets verbatim transcription

A Whisper-based ASR model that keeps every filler word and stamps timestamps to the individual word, now covering English and German.

Jul 15, 2026

Why It Matters