inclusionAIAny-to-Any

Ming-UniAudio Brings MoE to Unified Audio AI

A new 16-billion-parameter model from inclusionAI uses a Mixture-of-Experts architecture to handle a wide range of audio tasks efficiently.

Sep 29, 2025

NotableApache 2.0

A new open-source model aims to unify a broad spectrum of audio AI tasks, from understanding to generation. The model, called Ming-UniAudio, was released by the research group inclusionAI and features a Mixture-of-Experts (MoE) architecture, a technique for increasing model capacity without a proportional rise in computational cost.

Ming-UniAudio has a total of 16 billion parameters, but during inference, it only activates 3 billion. This MoE design makes it significantly more efficient to run than a dense model of a similar size, potentially lowering the barrier for developers and researchers to experiment with large-scale audio models.

A Generalist for Sound

The model is described as a "unified" audio model, capable of handling a diverse set of tasks that would typically require separate specialized systems. Its capabilities include:

Speech recognition and translation
Speaker and emotion recognition
Music understanding and tagging
Text-to-speech (TTS) generation

This versatility points toward a future where single, powerful models can serve as the backbone for complex, multi-faceted audio applications. By combining understanding and generation, Ming-UniAudio can power more interactive and seamless voice and sound experiences. The model is available under an Apache 2.0 license, permitting commercial use.

Sources

inclusionAI/Ming-UniAudio-16B-A3B
Hugging Face
Visit

0 comments

No comments yet. Be the first to weigh in.

Thinking Machines Debuts Inkling Small, a Compact Multimodal MoE

The Apache-2.0 model brings mixture-of-experts efficiency to image, audio, and text tasks in a smaller footprint.

Jul 27, 2026

KRAFTON/Any-to-Any

KRAFTON releases A.X-K2 Raon speech MoE model

The game maker's new open model blends text-to-speech and speech recognition in a single 21B mixture-of-experts system with just 3B active parameters.

Jul 27, 2026

Microsoft/Vision-Language

Microsoft's Mage-VL Streams Video Natively

A codec-native multimodal foundation model aims to understand live video and vision-language input in real time.

Jul 26, 2026

A Generalist for Sound

The model is described as a "unified" audio model, capable of handling a diverse set of tasks that would typically require separate specialized systems. Its capabilities include:

Speech recognition and translation

Speaker and emotion recognition

Music understanding and tagging

Text-to-speech (TTS) generation