Company

Microsoft

9 modelsUS

CategoriesVision-Language Speech → Text Text → Image Text / LLM Text → Speech

Releases

Microsoft/Vision-Language

Microsoft's Mage-VL Streams Video Natively

A codec-native multimodal foundation model aims to understand live video and vision-language input in real time.

Jul 26, 2026

Vision-Language Any-to-Any

Microsoft/Speech → Text

Microsoft's VibeVoice ASR Goes BitNet for CPU Speech

A BitNet-quantized speech recognition model trades GPU dependence for efficient CPU inference in English and Chinese.

Jul 24, 2026

Speech → Text

Microsoft/Text → Image

Microsoft's Mage-Flow packs image editing into 4B

A compact model handles both text-to-image generation and instruction-based edits at native resolution, under a permissive MIT license.

Jul 21, 2026

Text → Image Image Editing

Microsoft/Vision-Language

Microsoft's Fara1.5-27B targets computer-use agents

A 27B-parameter vision-language model built to drive browsers and desktop apps like a human operator.

Jul 17, 2026

Vision-Language

Microsoft/Vision-Language

Microsoft previews GELab-Zero-4B, a compact GUI agent

The 4-billion-parameter vision-language model targets on-screen and mobile automation, built atop Qwen3-VL.

Jun 30, 2026

Vision-Language

Microsoft/Text / LLM

Microsoft's FastContext is a 4B sub-agent for code

A compact Qwen3-derived model built to explore repositories, released under a permissive MIT license.

Jun 14, 2026

Text / LLM Code

Microsoft/Speech → Text

Microsoft Releases VibeVoice for Speech Transcription

The new open-source automatic speech recognition model handles multilingual transcription and speaker identification out of the box.

Jan 21, 2026

Speech → Text

Microsoft/Text → Speech

Microsoft Releases VibeVoice for Real-Time AI Speech

The new 500-million-parameter model is designed for generating natural, long-form speech with very low latency for interactive applications.

Dec 4, 2025

Text → Speech

Microsoft/Vision-Language

Microsoft Releases Fara-7B Vision Agent Model

The 7-billion-parameter model is designed to understand and interact with graphical user interfaces, building on Alibaba's open-source Qwen2.5-VL.

Oct 30, 2025

Vision-Language

Microsoft/Text → Speech

Microsoft Releases VibeVoice for Long-Form Audio

The new 1.5-billion-parameter text-to-speech model is designed to generate natural, multi-speaker audio for podcasts and other long-form content.

Aug 25, 2025

Text → Speech