NVIDIASpeech → Text

NVIDIA Fuses LLM and ASR in Canary-Qwen 2.5B Model

The 2.5 billion-parameter speech model combines a FastConformer encoder with a Qwen LLM decoder, a hybrid approach to transcription.

Jun 26, 2025

NotableOther

NVIDIA has released Canary-Qwen 2.5B, a new model for automatic speech recognition (ASR) that takes a novel architectural approach. Instead of a single, end-to-end network, the 2.5 billion-parameter model pairs a specialized audio encoder with a general-purpose large language model for decoding text.

This hybrid design is the model's key feature. It uses a FastConformer encoder, a component optimized for efficiently processing audio signals, to understand the input speech. The resulting representation is then handed off to a decoder based on a Qwen large language model. This allows the system to leverage the powerful text generation and contextual understanding of an LLM to produce more accurate and natural-sounding transcriptions.

The model is designed to be multilingual and handle tasks like punctuation and capitalization automatically, which are common challenges for ASR systems. This approach of using an LLM as a "brain" for a specialized task reflects a broader trend in AI, where generalist models are adapted to enhance specific applications.

Canary-Qwen 2.5B is available on Hugging Face under a custom community license. Its release provides developers with a powerful new tool for speech-to-text applications and a clear example of how different model architectures can be effectively combined.

Sources

nvidia/canary-qwen-2.5b
Hugging Face
Visit

0 comments

No comments yet. Be the first to weigh in.

KRAFTON releases A.X-K2 Raon speech MoE model

The game maker's new open model blends text-to-speech and speech recognition in a single 21B mixture-of-experts system with just 3B active parameters.

Jul 27, 2026

Microsoft/Speech → Text

Microsoft's VibeVoice ASR Goes BitNet for CPU Speech

A BitNet-quantized speech recognition model trades GPU dependence for efficient CPU inference in English and Chinese.

Jul 24, 2026

Nyralabs/Speech → Text

CrisperWhisper 2.0 Large targets verbatim transcription

A Whisper-based ASR model that keeps every filler word and stamps timestamps to the individual word, now covering English and German.

Jul 15, 2026