NVIDIA Fuses LLM and ASR in Canary-Qwen 2.5B Model
The 2.5 billion-parameter speech model combines a FastConformer encoder with a Qwen LLM decoder, a hybrid approach to transcription.
NVIDIA has released Canary-Qwen 2.5B, a new model for automatic speech recognition (ASR) that takes a novel architectural approach. Instead of a single, end-to-end network, the 2.5 billion-parameter model pairs a specialized audio encoder with a general-purpose large language model for decoding text.
This hybrid design is the model's key feature. It uses a FastConformer encoder, a component optimized for efficiently processing audio signals, to understand the input speech. The resulting representation is then handed off to a decoder based on a Qwen large language model. This allows the system to leverage the powerful text generation and contextual understanding of an LLM to produce more accurate and natural-sounding transcriptions.
The model is designed to be multilingual and handle tasks like punctuation and capitalization automatically, which are common challenges for ASR systems. This approach of using an LLM as a "brain" for a specialized task reflects a broader trend in AI, where generalist models are adapted to enhance specific applications.
Canary-Qwen 2.5B is available on Hugging Face under a custom community license. Its release provides developers with a powerful new tool for speech-to-text applications and a clear example of how different model architectures can be effectively combined.
Sources
- Visit
nvidia/canary-qwen-2.5b
Hugging Face
0 comments
No comments yet. Be the first to weigh in.
More in Speech → Text

Mega-ASR Improves on Qwen for Speech Recognition
Researcher Zhifei Xie has released a 1.7B-parameter model that refines Alibaba's Qwen3-ASR, showing improved performance on English and Chinese transcription benchmarks.

NVIDIA Releases Nemotron-3.5 Streaming ASR Model
The 600-million-parameter model uses a FastConformer architecture for real-time, multilingual speech-to-text applications.

Xiaomi Releases MiMo Model for Speech Recognition
The new open-source model from the Chinese tech giant offers automatic speech recognition for Mandarin, Cantonese, and English under a permissive MIT license.