Qwen · AlibabaAny-to-Any

Qwen3-Omni Arrives With Any-to-Any Multimodality

The new 30B Mixture-of-Experts model from Alibaba's Qwen team can process and generate content across text, image, and audio formats.

Sep 20, 2025

Major releaseOther

Alibaba's Qwen team has released Qwen3-Omni, an ambitious new model family that pushes the boundaries of open multimodal AI. The first release, a 30-billion parameter instruction-tuned variant, is designed for "any-to-any" tasks, meaning it can natively process and generate content across text, vision, and audio domains.

This omni-modal capability sets it apart from typical open-source releases. While many models can interpret images and text, Qwen3-Omni can handle a wider range of tasks, including speech-to-text transcription, text-to-speech generation, and visual language understanding. This allows it to function as a more versatile and integrated assistant, capable of understanding a spoken query about an image and responding with a spoken answer.

Technical Details

The model is a Mixture of Experts (MoE) architecture with 30 billion total parameters, though only 8.7 billion are active during inference, offering a balance between capability and computational efficiency. According to its official model card, it's built to handle complex, interleaved inputs from different modalities.

Qwen3-Omni represents a significant step forward for developers building sophisticated, multi-sensory AI applications. However, potential users should note that the model is available under a custom license, not a standard open-source license like Apache 2.0 or MIT, which will require review for commercial use cases.

Sources

Qwen/Qwen3-Omni-30B-A3B-Instruct
Hugging Face
Visit

0 comments

No comments yet. Be the first to weigh in.

Thinking Machines Debuts Inkling Small, a Compact Multimodal MoE

The Apache-2.0 model brings mixture-of-experts efficiency to image, audio, and text tasks in a smaller footprint.

Jul 27, 2026

KRAFTON/Any-to-Any

KRAFTON releases A.X-K2 Raon speech MoE model

The game maker's new open model blends text-to-speech and speech recognition in a single 21B mixture-of-experts system with just 3B active parameters.

Jul 27, 2026

Microsoft/Vision-Language

Microsoft's Mage-VL Streams Video Natively

A codec-native multimodal foundation model aims to understand live video and vision-language input in real time.

Jul 26, 2026

Technical Details