The Open Weights
LatestModelsLeaderboardsUpcomingCompanies
Subscribe
The Open Weights

The daily record of open-source AI. New model releases, leaderboards, and what's coming next — written for people who ship.

Refreshed every 12 hours

Discover

  • Latest releases
  • New today
  • Trending models
  • Upcoming launches

Browse

  • All models
  • Companies
  • Categories
  • Leaderboards

About

  • About
  • Editorial policy
  • RSS feed
  • Newsletter

© 2026 The Open Weights. An independent publication.

Aggregated by Claude · written with Gemini · curated by humans.

LatestQwen · AlibabaQwen3-Omni Captioner
Qwen · AlibabaAny-to-Any

Qwen Releases 30B Model for Audio Captioning

The new Mixture-of-Experts model from Alibaba is fine-tuned to generate detailed, multilingual descriptions for complex audio content.

Sep 15, 2025
NotableOther
Qwen · Alibaba · Any-to-Any
Qwen3-Omni-30B-A3B-Captioner
Qwen3-Omni-30B-A3B-Captioner

Alibaba's Qwen team has released a new specialized model, Qwen3-Omni-30B-A3B-Captioner, designed to generate detailed descriptions of audio content. As an "omni-modal" model, it can process various data types but has been specifically fine-tuned for the nuanced task of audio captioning, moving beyond simple speech-to-text transcription.

The model is built on a Mixture-of-Experts (MoE) architecture, containing a total of 30 billion parameters. During inference, however, it only activates a sparse 3 billion parameters, offering the power of a large model with significantly lower computational costs. This efficiency makes it more accessible for researchers and developers to run and experiment with.

Capabilities and Use Cases

The primary function of the Qwen3-Omni Captioner is to understand and describe complex audio environments in multiple languages. This includes identifying and explaining a wide range of sounds, such as:

  • Ambient noise and environmental sounds
  • Musical cues and instrumentation
  • Overlapping speech and non-speech events

This capability is a valuable building block for advanced accessibility tools, automated media indexing, and content analysis systems that need to understand the full context of an audio track.

The model is available now on the Hugging Face Hub. It's released under a custom research-focused license, so users should review the terms before incorporating it into their work.

Sources

  • Qwen/Qwen3-Omni-30B-A3B-Captioner

    Hugging Face

    Visit

0 comments

Protected by Turnstile

No comments yet. Be the first to weigh in.

Get the model

Weights

Specs

Parameters30B · MoE
Context window—
LicenseOTHER
Downloads4.2K

Modalities

Any-to-AnyText → SpeechVision-Language

More in Any-to-Any

MiniMax
MiniMax-M3
MiniMax-M3
MiniMax/Vision-Language

MiniMax Releases M3, a Multimodal MoE Model

The new open-weight model from MiniMax AI combines vision, coding, and reasoning using a Mixture-of-Experts architecture.

Jun 2, 2026
Google DeepMind
Gemma 4 12B
Gemma 4 12B
Google DeepMind/Any-to-Any

Google Releases Gemma 4 12B Multimodal Model

The new 12-billion-parameter open model from DeepMind introduces a unified 'any-to-any' architecture for advanced multimodal tasks.

May 23, 2026
Google DeepMind
Gemma 4 12B
Gemma 4 12B
Google DeepMind/Any-to-Any

Google Releases Gemma 4, a 12B 'Any-to-Any' Model

The new 12-billion-parameter model from Google DeepMind is designed to handle a flexible mix of data types, moving beyond traditional text and image inputs.

May 23, 2026