Qwen · AlibabaSpeech → Text

Qwen Releases Compact ASR Model for Streaming Audio

The new Fun-ASR-Nano model from Alibaba's team packs real-time multilingual transcription, speaker diarization, and hotword detection into an efficient package.

Dec 15, 2025

NotableOther

Alibaba's Qwen team has released Fun-ASR-Nano-2512, a new automatic speech recognition (ASR) model designed for efficiency and real-time performance. As its "Nano" designation suggests, the model is compact, making it a candidate for applications where computational resources are constrained.

Fun-ASR-Nano moves beyond simple transcription by integrating several advanced features often found in much larger systems. Its architecture is built for streaming audio, allowing it to process speech with low latency as it's spoken, rather than waiting for an entire audio file to be complete.

Structured Audio Output

This combination of features makes the model particularly useful for building sophisticated conversational AI and analysis tools. Key capabilities detailed in the official release include:

Speaker diarization: Identifying who is speaking and when.
Word-level timestamps: Aligning transcribed text with its precise timing in the source audio.
Hotword detection: Customizing the model to reliably recognize specific keywords.
Multilingual support: Processing speech from multiple languages.

By packaging these tools into a lightweight model, the Qwen team provides a powerful component for developers creating on-device or edge applications, such as smart meeting assistants or embedded voice-controlled interfaces. The model is available under the custom Model-Scope Open-Source License.

Sources

FunAudioLLM/Fun-ASR-Nano-2512
Hugging Face
Visit

0 comments

No comments yet. Be the first to weigh in.

KRAFTON releases A.X-K2 Raon speech MoE model

The game maker's new open model blends text-to-speech and speech recognition in a single 21B mixture-of-experts system with just 3B active parameters.

Jul 27, 2026

Microsoft/Speech → Text

Microsoft's VibeVoice ASR Goes BitNet for CPU Speech

A BitNet-quantized speech recognition model trades GPU dependence for efficient CPU inference in English and Chinese.

Jul 24, 2026

Nyralabs/Speech → Text

CrisperWhisper 2.0 Large targets verbatim transcription

A Whisper-based ASR model that keeps every filler word and stamps timestamps to the individual word, now covering English and German.

Jul 15, 2026

Structured Audio Output

This combination of features makes the model particularly useful for building sophisticated conversational AI and analysis tools. Key capabilities detailed in the official release include:

Speaker diarization: Identifying who is speaking and when.

Word-level timestamps: Aligning transcribed text with its precise timing in the source audio.

Hotword detection: Customizing the model to reliably recognize specific keywords.

Multilingual support: Processing speech from multiple languages.