HKUSTAudioAny-to-Any

HKUST Releases Audio-Omni, a Unified Audio Model

The new diffusion-based model handles speech, music, and general audio tasks like conversion and editing within a single, versatile framework.

Mar 27, 2026

NotableCC BY-NC 4.0

Researchers from the Hong Kong University of Science and Technology (HKUST) have released Audio-Omni, a new model that aims to unify a wide range of audio generation tasks. Unlike specialized models designed for a single purpose, Audio-Omni is an "any-to-any" system, capable of handling diverse audio inputs and outputs.

The model is built on a diffusion-based architecture, which allows it to generate high-fidelity audio by progressively refining noise into a coherent signal. This single framework is designed to understand and process various audio modalities, from human speech to complex musical compositions and environmental sounds, treating them all as interchangeable data types.

A Generalist Approach to Audio

Audio-Omni's versatility allows it to perform a broad set of tasks that would typically require multiple different models. As detailed on its Hugging Face repository, its key capabilities include:

Conversion: Transforming speech to music, music to speech, or one style of music to another.
Generation: Creating music or speech from text prompts.
Editing: Modifying existing audio, such as separating stems or in-painting missing sections.
Continuation: Extending an existing audio clip in a consistent style.

This release represents another step toward building more generalized foundation models for audio. By consolidating disparate tasks into one model, Audio-Omni points to a future where audio generation is less fragmented and more universally accessible. The model is available for research and non-commercial use under a CC BY-NC 4.0 license.

Sources

HKUSTAudio/Audio-Omni
Hugging Face
Visit

0 comments

No comments yet. Be the first to weigh in.

Thinking Machines Debuts Inkling Small, a Compact Multimodal MoE

The Apache-2.0 model brings mixture-of-experts efficiency to image, audio, and text tasks in a smaller footprint.

Jul 27, 2026

KRAFTON/Any-to-Any

KRAFTON releases A.X-K2 Raon speech MoE model

The game maker's new open model blends text-to-speech and speech recognition in a single 21B mixture-of-experts system with just 3B active parameters.

Jul 27, 2026

Microsoft/Vision-Language

Microsoft's Mage-VL Streams Video Natively

A codec-native multimodal foundation model aims to understand live video and vision-language input in real time.

Jul 26, 2026

A Generalist Approach to Audio

Audio-Omni's versatility allows it to perform a broad set of tasks that would typically require multiple different models. As detailed on its Hugging Face repository, its key capabilities include:

Conversion: Transforming speech to music, music to speech, or one style of music to another.

Generation: Creating music or speech from text prompts.

Editing: Modifying existing audio, such as separating stems or in-painting missing sections.

Continuation: Extending an existing audio clip in a consistent style.