NVIDIA Releases Streaming Speech-to-Text Model
The 600-million-parameter Nemotron model is designed for real-time English transcription using a cache-aware FastConformer architecture.

NVIDIA has released a new model for automatic speech recognition (ASR), Nemotron Speech Streaming EN 0.6B. This 600-million-parameter model is specifically engineered for real-time, streaming transcription of English audio, making it suitable for applications that require immediate output.
Built for Real-Time Performance
The model is based on the FastConformer architecture, an effective design for speech recognition. Its key feature is its "cache-aware streaming" capability, which allows it to process audio in small chunks as it arrives rather than waiting for an entire recording. By intelligently managing its internal state, or cache, between these chunks, the model can deliver continuous transcription with minimal delay.
This streaming approach is critical for interactive voice applications. Potential use cases include:
- Live captioning for broadcasts and events
- Responsive voice assistants
- Real-time transcription for meetings or customer service calls
By releasing a specialized model for this task, NVIDIA provides developers with another tool for building responsive, voice-enabled products. The model is available on Hugging Face under the NVIDIA Open Model License Agreement, and interested users can find full details in the official repository.
Sources
- Visit
nvidia/nemotron-speech-streaming-en-0.6b
Hugging Face
0 comments
No comments yet. Be the first to weigh in.
More in Speech → Text

Mega-ASR Improves on Qwen for Speech Recognition
Researcher Zhifei Xie has released a 1.7B-parameter model that refines Alibaba's Qwen3-ASR, showing improved performance on English and Chinese transcription benchmarks.

NVIDIA Releases Nemotron-3.5 Streaming ASR Model
The 600-million-parameter model uses a FastConformer architecture for real-time, multilingual speech-to-text applications.

Xiaomi Releases MiMo Model for Speech Recognition
The new open-source model from the Chinese tech giant offers automatic speech recognition for Mandarin, Cantonese, and English under a permissive MIT license.