The Open Weights
LatestModelsLeaderboardsUpcomingCompanies
Subscribe
The Open Weights

The daily record of open-source AI. New model releases, leaderboards, and what's coming next — written for people who ship.

Refreshed every 12 hours

Discover

  • Latest releases
  • New today
  • Trending models
  • Upcoming launches

Browse

  • All models
  • Companies
  • Categories
  • Leaderboards

About

  • About
  • Editorial policy
  • RSS feed
  • Newsletter

© 2026 The Open Weights. An independent publication.

Aggregated by Claude · written with Gemini · curated by humans.

Category · audio

Latest Speech → Text models

The newest open-source Speech → Text releases, from across the ecosystem.

Filter

27 releases

zhifeixie/Speech → Text

Mega-ASR Improves on Qwen for Speech Recognition

Researcher Zhifei Xie has released a 1.7B-parameter model that refines Alibaba's Qwen3-ASR, showing improved performance on English and Chinese transcription benchmarks.

May 19, 2026
Speech → Text
Mega-ASR
Mega-ASR
NVIDIASpeech → Text
/

NVIDIA Releases Nemotron-3.5 Streaming ASR Model

The 600-million-parameter model uses a FastConformer architecture for real-time, multilingual speech-to-text applications.

May 15, 2026
Speech → Text
Nemotron 3.5 ASR Streaming 0.6B
Nemotron 3.5 ASR Streaming 0.6B
Xiaomi/Speech → Text

Xiaomi Releases MiMo Model for Speech Recognition

The new open-source model from the Chinese tech giant offers automatic speech recognition for Mandarin, Cantonese, and English under a permissive MIT license.

Apr 23, 2026
Speech → Text
MiMo-V2.5-ASR
MiMo-V2.5-ASR
IBM/Speech → Text

IBM Releases 2B Granite Model for Multilingual Speech

The new two-billion-parameter model offers transcription capabilities for at least five major languages under a permissive Apache 2.0 license.

Apr 16, 2026
Speech → Text
Granite Speech 4.1 2B
Granite Speech 4.1 2B
KRAFTON/Any-to-Any

KRAFTON Releases 9B Bilingual Speech Model

The gaming giant behind 'PUBG' has released Raon-Speech-9B, a multimodal model for English and Korean speech recognition and synthesis.

Mar 30, 2026
Speech → TextAny-to-Any
Raon-Speech-9B
Raon-Speech-9B
Cohere/Speech → Text

Cohere Releases Top-Ranked Multilingual Transcription Model

The new automatic speech recognition model from Cohere Labs sets a new benchmark on the Hugging Face Open ASR Leaderboard for multilingual performance.

Mar 24, 2026
Speech → Text
Cohere Transcribe 03-2026
Cohere Transcribe 03-2026
IBM/Speech → Text

IBM Releases 1B Granite Model for Multilingual Speech

The new Apache 2.0-licensed model is part of the company's Granite family and aims to provide high-quality speech-to-text across several languages.

Feb 27, 2026
Speech → Text
Granite 4.0 1B Speech
Granite 4.0 1B Speech
Qwen · Alibaba/Speech → Text

Qwen Releases 0.6B Model for Audio-Text Alignment

The new open-source tool, based on the Qwen3 architecture, precisely synchronizes audio recordings with their corresponding text transcripts.

Jan 28, 2026
Speech → Text
Qwen3 ForcedAligner 0.6B
Qwen3 ForcedAligner 0.6B
Qwen · Alibaba/Speech → Text

Qwen3 Family Expands into Speech Recognition

Alibaba's Qwen team has released a new 1.7-billion-parameter model designed specifically for automatic speech recognition.

Jan 28, 2026
Speech → Text
Qwen3-ASR-1.7B
Qwen3-ASR-1.7B
Qwen · Alibaba/Speech → Text

Qwen open-sources compact model for speech recognition

The new 600-million-parameter Qwen3-ASR model is designed for efficient, high-quality audio transcription under a permissive license.

Jan 28, 2026
Speech → Text
Qwen3-ASR-0.6B
Qwen3-ASR-0.6B
Mistral AI/Speech → Text

Mistral Enters Speech AI with Voxtral Mini Model

The company, known for its powerful text models, has released its first open-source speech recognition system designed for real-time, multilingual transcription.

Jan 21, 2026
Speech → Text
Voxtral Mini 4B Realtime
Voxtral Mini 4B Realtime
Microsoft/Speech → Text

Microsoft Releases VibeVoice for Speech Transcription

The new open-source automatic speech recognition model handles multilingual transcription and speaker identification out of the box.

Jan 21, 2026
Speech → Text
VibeVoice ASR
VibeVoice ASR
Qwen · Alibaba/Any-to-Any

Qwen's Fun-Audio-Chat: An Open Speech-to-Speech LLM

The 8-billion-parameter model from Alibaba's Qwen team understands and generates spoken responses, enabling more natural audio-first applications.

Dec 23, 2025
Speech → TextAny-to-Any
Fun-Audio-Chat-8B
Fun-Audio-Chat-8B
Google DeepMind/Speech → Text

Google Releases MedASR for Medical Transcription

The new speech recognition model from DeepMind is trained specifically on medical dictation, aiming for higher accuracy in clinical notes.

Dec 18, 2025
Speech → Text
MedASR
MedASR
NVIDIA/Speech → Text

NVIDIA Releases Streaming Speech-to-Text Model

The 600-million-parameter Nemotron model is designed for real-time English transcription using a cache-aware FastConformer architecture.

Dec 17, 2025
Speech → Text
Nemotron Speech Streaming EN 0.6B
Nemotron Speech Streaming EN 0.6B
Qwen · Alibaba/Speech → Text

Qwen Releases Compact ASR Model for Streaming Audio

The new Fun-ASR-Nano model from Alibaba's team packs real-time multilingual transcription, speaker diarization, and hotword detection into an efficient package.

Dec 15, 2025
Speech → Text
Fun-ASR-Nano-2512
Fun-ASR-Nano-2512
Zhipu AI/Speech → Text

Zhipu AI Releases Compact Bilingual Speech Model

The new GLM-ASR-Nano model is designed for efficient automatic speech recognition in both English and Mandarin Chinese.

Dec 9, 2025
Speech → Text
GLM-ASR-Nano-2512
GLM-ASR-Nano-2512
NVIDIA/Speech → Text

NVIDIA Releases Real-Time Speaker Diarization Model

The new Sortformer-based model is designed for streaming audio, identifying up to four distinct speakers in real time.

Oct 22, 2025
Speech → Text
Streaming Sortformer Diarization 4spk v2.1
Streaming Sortformer Diarization 4spk v2.1
NVIDIA/Speech → Text

NVIDIA's Parakeet ASR Tackles Multi-Speaker Audio

The 600-million-parameter model offers real-time speech-to-text with speaker diarization, built on the efficient FastConformer architecture.

Oct 15, 2025
Speech → Text
Multitalker Parakeet Streaming 0.6B
Multitalker Parakeet Streaming 0.6B
inclusionAI/Any-to-Any

Ming-UniAudio Brings MoE to Unified Audio AI

A new 16-billion-parameter model from inclusionAI uses a Mixture-of-Experts architecture to handle a wide range of audio tasks efficiently.

Sep 29, 2025
Speech → TextAny-to-Any
Ming-UniAudio-16B-A3B
Ming-UniAudio-16B-A3B
Qwen · Alibaba/Any-to-AnyMajor release

Qwen3-Omni Arrives With Any-to-Any Multimodality

The new 30B Mixture-of-Experts model from Alibaba's Qwen team can process and generate content across text, image, and audio formats.

Sep 20, 2025
Speech → TextAny-to-Any
Qwen3-Omni-30B-A3B-Instruct
Qwen3-Omni-30B-A3B-Instruct
Xiaomi/Any-to-Any

Xiaomi's MiMo-Audio 7B Tackles Complex Speech Tasks

This new instruction-tuned model from Xiaomi can handle a flexible combination of audio and text inputs and outputs, from transcription to voice synthesis.

Sep 18, 2025
Speech → TextAny-to-Any
MiMo-Audio-7B-Instruct
MiMo-Audio-7B-Instruct
StepFun/Any-to-Any

StepFun Releases Step-Audio 2 mini, a Unified Audio AI

The new open-source model handles both speech recognition and audio generation in a single, end-to-end architecture.

Aug 28, 2025
Speech → TextAny-to-Any
Step-Audio 2 mini
Step-Audio 2 mini
NVIDIA/Speech → Text

NVIDIA Releases Canary 1B v2 Multilingual Speech Model

The new 1-billion-parameter model handles both transcription and translation across five languages using the company's efficient FastConformer architecture.

Aug 4, 2025
Speech → Text
Canary 1B v2
Canary 1B v2
NVIDIA/Speech → Text

NVIDIA Releases 600M Parakeet for Speech Recognition

The new FastConformer model uses a specialized training technique to improve transcription accuracy in noisy, real-world environments.

Aug 4, 2025
Speech → Text
Parakeet TDT 0.6B v3
Parakeet TDT 0.6B v3
T-Tech/Speech → Text

T-Tech Releases T-one for Russian Speech Recognition

The new streaming Conformer model from the Russian digital bank is optimized for real-time transcription of telephone conversations.

Jul 14, 2025
Speech → Text
T-one
T-one
NVIDIA/Speech → Text

NVIDIA Fuses LLM and ASR in Canary-Qwen 2.5B Model

The 2.5 billion-parameter speech model combines a FastConformer encoder with a Qwen LLM decoder, a hybrid approach to transcription.

Jun 26, 2025
Speech → Text
Canary-Qwen 2.5B
Canary-Qwen 2.5B