Ming-UniAudio Brings MoE to Unified Audio AI
A new 16-billion-parameter model from inclusionAI uses a Mixture-of-Experts architecture to handle a wide range of audio tasks efficiently.

A new open-source model aims to unify a broad spectrum of audio AI tasks, from understanding to generation. The model, called Ming-UniAudio, was released by the research group inclusionAI and features a Mixture-of-Experts (MoE) architecture, a technique for increasing model capacity without a proportional rise in computational cost.
Ming-UniAudio has a total of 16 billion parameters, but during inference, it only activates 3 billion. This MoE design makes it significantly more efficient to run than a dense model of a similar size, potentially lowering the barrier for developers and researchers to experiment with large-scale audio models.
A Generalist for Sound
The model is described as a "unified" audio model, capable of handling a diverse set of tasks that would typically require separate specialized systems. Its capabilities include:
- Speech recognition and translation
- Speaker and emotion recognition
- Music understanding and tagging
- Text-to-speech (TTS) generation
This versatility points toward a future where single, powerful models can serve as the backbone for complex, multi-faceted audio applications. By combining understanding and generation, Ming-UniAudio can power more interactive and seamless voice and sound experiences. The model is available under an Apache 2.0 license, permitting commercial use.
Sources
- Visit
inclusionAI/Ming-UniAudio-16B-A3B
Hugging Face
0 comments
No comments yet. Be the first to weigh in.
More in Any-to-Any

MiniMax Releases M3, a Multimodal MoE Model
The new open-weight model from MiniMax AI combines vision, coding, and reasoning using a Mixture-of-Experts architecture.
Google Releases Gemma 4 12B Multimodal Model
The new 12-billion-parameter open model from DeepMind introduces a unified 'any-to-any' architecture for advanced multimodal tasks.
Google Releases Gemma 4, a 12B 'Any-to-Any' Model
The new 12-billion-parameter model from Google DeepMind is designed to handle a flexible mix of data types, moving beyond traditional text and image inputs.