Qwen · AlibabaSpeech → Text

Qwen Releases 0.6B Model for Audio-Text Alignment

The new open-source tool, based on the Qwen3 architecture, precisely synchronizes audio recordings with their corresponding text transcripts.

Jan 28, 2026

NotableApache 2.0

Alibaba's Qwen team has released a new specialized tool for audio processing, the Qwen3 ForcedAligner 0.6B. This compact 600-million-parameter model is designed for a specific and crucial task in speech AI: aligning existing text with an audio recording.

Unlike standard speech-to-text models that generate text from scratch, a forced aligner takes both an audio file and its transcript as input. It then determines the precise start and end times for each word in the audio, effectively synchronizing the two. This capability is essential for creating accurately timed subtitles, preparing high-quality datasets for training other speech models, and conducting phonetic research.

The model is built on the Qwen3 architecture and is available on the Hugging Face Hub under a permissive Apache 2.0 license, allowing for broad commercial use. Its relatively small size suggests it can be run efficiently, making this alignment technology more accessible to developers and researchers.

The release of Qwen3 ForcedAligner adds another foundational component to the open-source audio ecosystem, providing a key tool for building more sophisticated applications that handle spoken language.

Sources

Qwen/Qwen3-ForcedAligner-0.6B
Hugging Face
Visit

0 comments

No comments yet. Be the first to weigh in.

KRAFTON releases A.X-K2 Raon speech MoE model

The game maker's new open model blends text-to-speech and speech recognition in a single 21B mixture-of-experts system with just 3B active parameters.

Jul 27, 2026

Microsoft/Speech → Text

Microsoft's VibeVoice ASR Goes BitNet for CPU Speech

A BitNet-quantized speech recognition model trades GPU dependence for efficient CPU inference in English and Chinese.

Jul 24, 2026

Nyralabs/Speech → Text

CrisperWhisper 2.0 Large targets verbatim transcription

A Whisper-based ASR model that keeps every filler word and stamps timestamps to the individual word, now covering English and German.

Jul 15, 2026