Latest Any-to-Any models

Microsoft/Vision-Language

Microsoft's Mage-VL Streams Video Natively

A codec-native multimodal foundation model aims to understand live video and vision-language input in real time.

Jul 26, 2026

NVIDIA/Any-to-Any

NVIDIA's Audio-Visual Flamingo Fuses Sound and Sight

A fully open multimodal model aims to reason jointly across audio, images, and long-form video.

Jul 16, 2026

Thinkingmachines/Any-to-AnyMajor release

Thinking Machines Lab debuts Inkling, its first open model

The lab's inaugural open-weights release is a mixture-of-experts system that takes image and audio inputs, shipped under a permissive Apache 2.0 license.

Jul 15, 2026

OpenMOSS/Vision-Language

OpenMOSS Debuts MOSS-VL-Realtime for Live Video

The Chinese research group's new vision-language model targets streaming understanding of video and images rather than static frames.

Jul 14, 2026

Unknown/Any-to-Any

Boogu-Image-0.1 Brings Unified Multimodal to Open Source

A new Apache-licensed model family folds bilingual text-to-image generation and instruction editing into one system.

Jul 13, 2026

NVIDIA/Any-to-Any

NVIDIA's Audex Unifies Audio Understanding and Speech

A new 30B mixture-of-experts model from NVIDIA handles both listening and speaking within a single audio-text architecture.

Jul 6, 2026

Google DeepMind/Any-to-AnyMajor release

Google DeepMind's Gemma 4 Goes Multimodal and MoE

The new open-weights family adds a mixture-of-experts design, encoder-free multimodal inputs, and an optional thinking mode.

Jul 1, 2026

Google DeepMind/Any-to-Any

Google DeepMind Releases TabFM for Tabular Data

A new foundation model brings zero-shot, in-context learning to classification and regression on structured tables.

Jun 29, 2026

SenseTime/Any-to-Any

SenseTime's SenseNova-Vision-7B-MoT Goes Any-to-Any

A single 7B model from SenseTime folds vision-language understanding, image generation, editing, and perception into one system.

Jun 29, 2026

MiniMax/Vision-LanguageMajor release

MiniMax Releases M3, a Multimodal MoE Model

The new open-weight model from MiniMax AI combines vision, coding, and reasoning using a Mixture-of-Experts architecture.

Jun 2, 2026

Code Any-to-Any

Google DeepMind/Any-to-AnyMajor release

Google Releases Gemma 4 12B Multimodal Model

The new 12-billion-parameter open model from DeepMind introduces a unified 'any-to-any' architecture for advanced multimodal tasks.

May 23, 2026

Google DeepMind/Any-to-AnyMajor release

Google Releases Gemma 4, a 12B 'Any-to-Any' Model

The new 12-billion-parameter model from Google DeepMind is designed to handle a flexible mix of data types, moving beyond traditional text and image inputs.

May 23, 2026

ByteDance/Any-to-AnyMajor release

ByteDance Releases Lance, a Unified Generative AI Model

The 3-billion-parameter model handles image and video generation, editing, and understanding from a single set of weights under a permissive license.

May 15, 2026

SenseTime/Any-to-Any

SenseTime Releases 8B 'Any-to-Any' Infographic Model

The new 8B-parameter SenseNova U1 model from SenseTime is designed for complex multimodal tasks, including the in-conversation generation and editing of infographics.

May 14, 2026

NVIDIA/Any-to-Any

NVIDIA Releases Efficient Nemotron-3 Multimodal MoE

The new 30-billion parameter Mixture-of-Experts model handles text and images while using only 3 billion active parameters for inference.

Apr 24, 2026

Google DeepMind/Any-to-Any

Google Releases Gemma 4 Multimodal Open Model

The new 26-billion-parameter model from DeepMind uses a mixture-of-experts design for greater efficiency and is tuned for assistant-style tasks.

Apr 23, 2026

Google DeepMind/Any-to-AnyMajor release

Google Releases Multimodal Gemma 4 31B Model

The new 31-billion-parameter model is an instruction-tuned, 'any-to-any' powerhouse released under a permissive Apache 2.0 license.

Apr 23, 2026

Google DeepMind/Any-to-Any

Google Releases 4B Multimodal Gemma 4 Assistant

The new 4-billion-parameter model is instruction-tuned for 'any-to-any' tasks, handling a flexible mix of data types.

Apr 23, 2026

Google DeepMind/Any-to-Any

Google Releases 2B Multimodal Gemma 4 Assistant Model

The new compact model from DeepMind is instruction-tuned for "any-to-any" tasks, capable of processing and generating mixed data types.

Apr 23, 2026

inclusionAI/Any-to-Any

LLaDA2.0-Uni: A Unified MoE for Vision Tasks

The new open-source model from inclusionAI uses a Mixture-of-Experts architecture to handle multiple vision tasks in a single, diffusion-based system.

Apr 22, 2026

SenseTime/Any-to-Any

SenseTime Releases 8B Any-to-Any Multimodal Model

The new SenseNova-U1 model unifies image understanding, generation, and editing within a single 8-billion-parameter framework.

Apr 22, 2026

NVIDIA/Any-to-Any

NVIDIA Releases Nemotron-3-Nano Omni-Modal MoE

The new 30-billion-parameter Mixture-of-Experts model handles any combination of modalities with just 3 billion active parameters.

Apr 20, 2026

KRAFTON/Any-to-Any

KRAFTON Releases 9B Bilingual Speech Model

The gaming giant behind 'PUBG' has released Raon-Speech-9B, a multimodal model for English and Korean speech recognition and synthesis.

Mar 30, 2026

HKUSTAudio/Any-to-Any

HKUST Releases Audio-Omni, a Unified Audio Model

The new diffusion-based model handles speech, music, and general audio tasks like conversion and editing within a single, versatile framework.

Mar 27, 2026

Any-to-Any Music

Meituan/Any-to-Any

Meituan Releases LongCat-Next 'Any-to-Any' AI Model

The Chinese tech company has released the weights for a unified model that can process and generate combinations of text, images, audio, and video.

Mar 25, 2026

GAIR/Image → Video

GAIR Releases daVinci-MagiHuman for Video Generation

The new open-source model from the General Artificial Intelligence Research team can create video clips complete with audio from a variety of inputs.

Mar 21, 2026

Image → Video Any-to-Any

Google DeepMind/Any-to-AnyMajor release

Google Releases Compact Gemma 4 E2B Multimodal Model

The new 2-billion-parameter model from Google DeepMind brings efficient image-and-text understanding to the open-source Gemma family.

Mar 2, 2026

Google DeepMind/Any-to-AnyMajor release

Google's Gemma 4 Arrives with Any-to-Any Multimodal Skills

The new 2-billion-parameter model from DeepMind can process text, vision, and audio, making it a versatile and efficient foundation for developers.

Mar 2, 2026

Google DeepMind/Any-to-Any

Google Releases Gemma 4 E4B, a 4B Multimodal Model

The new 4-billion-parameter vision-language model brings image and text understanding to Google's popular open-source family.

Mar 2, 2026

Google DeepMind/Any-to-AnyMajor release

Google's Gemma 4 Debuts with Any-to-Any Multimodality

The new 4-billion parameter model from Google DeepMind is designed for versatile input and output, handling text, images, and other data types.

Mar 2, 2026

inclusionAI/Any-to-Any

inclusionAI's Ming 2.0 Tackles Any-to-Any Multimodality

The new open-source Mixture-of-Experts model can process and generate content across text, images, and audio in any combination.

Feb 10, 2026

OpenBMB/Any-to-Any

OpenBMB Releases 'Any-to-Any' Multimodal Model

The new MiniCPM-o 4.5 model from the open-source research group can process and generate interleaved combinations of images, text, and audio.

Feb 3, 2026

OpenBMB/Any-to-Any

MiniCPM-o 4.5 Offers 'Any-to-Any' Multimodal AI

The new model from OpenBMB supports mixed-modality inputs and outputs, from text and images to audio and video, in a single efficient package.

Feb 2, 2026

OpenMOSS/Any-to-Any

OpenMOSS Releases MOVA, a 720p Multimodal Video Generator

The new open model can generate high-definition video with synchronized audio from a flexible combination of text and image prompts.

Jan 28, 2026

Image → Video Any-to-Any

Qwen · Alibaba/Any-to-Any

Qwen's Fun-Audio-Chat: An Open Speech-to-Speech LLM

The 8-billion-parameter model from Alibaba's Qwen team understands and generates spoken responses, enabling more natural audio-first applications.

Dec 23, 2025

FlashLabs/Any-to-Any

FlashLabs Releases Chroma-4B, an Any-to-Any Model

The new 4-billion-parameter model handles text, image, and speech inputs and outputs, including direct speech-to-speech translation.

Nov 28, 2025

BAAI/Any-to-Any

BAAI Releases Emu3.5, an 'Any-to-Any' Multimodal Model

The new open-source model from the Allen Institute for AI unifies text and image understanding and generation into a single architecture.

Oct 31, 2025

Any-to-Any Text → Image

Meituan/Any-to-Any

Meituan Debuts LongCat-Flash-Omni, an Any-to-Any AI Model

The new open-source Mixture-of-Experts model can process and generate any combination of text, images, video, audio, and 3D data.

Oct 23, 2025

Kuaishou/Any-to-Any

Kling Releases UniVideo for Generation and Understanding

The new open-source model combines both video generation and comprehension, a rare dual capability built on the Qwen2.5 vision-language foundation.

Oct 18, 2025

Any-to-Any Text → Video

inclusionAI/Any-to-Any

inclusionAI Debuts 'Any-to-Any' Multimodal MoE Model

The new Ming-flash-omni-Preview aims to handle any combination of data modalities using an efficient Mixture of Experts architecture.

Oct 14, 2025

inclusionAI/Any-to-Any

inclusionAI Releases Ming-UniVision MoE Multimodal Model

The new 16-billion-parameter model uses a sparse Mixture-of-Experts design to efficiently handle 'any-to-any' data combinations, from text to images.

Sep 30, 2025

Qwen · Alibaba/Vision-LanguageMajor release

Qwen Releases 30B MoE Vision Model, Qwen3-VL

The new open-source model from Alibaba uses a Mixture-of-Experts architecture to make its powerful vision-language capabilities more efficient to run.

Sep 30, 2025

inclusionAI/Any-to-Any

Ming-UniAudio Brings MoE to Unified Audio AI

A new 16-billion-parameter model from inclusionAI uses a Mixture-of-Experts architecture to handle a wide range of audio tasks efficiently.

Sep 29, 2025

Qwen · Alibaba/Any-to-AnyMajor release

Qwen3-Omni Arrives With Any-to-Any Multimodality

The new 30B Mixture-of-Experts model from Alibaba's Qwen team can process and generate content across text, image, and audio formats.

Sep 20, 2025

Xiaomi/Any-to-Any

Xiaomi's MiMo-Audio 7B Tackles Complex Speech Tasks

This new instruction-tuned model from Xiaomi can handle a flexible combination of audio and text inputs and outputs, from transcription to voice synthesis.

Sep 18, 2025

Qwen · Alibaba/Any-to-Any

Qwen Releases 'Thinking' Multimodal MoE Model

The new 30-billion-parameter Mixture-of-Experts model from Alibaba's Qwen team is designed to show its reasoning process for complex multimodal tasks.

Sep 15, 2025

Qwen · Alibaba/Any-to-Any

Qwen Releases 30B Model for Audio Captioning

The new Mixture-of-Experts model from Alibaba is fine-tuned to generate detailed, multilingual descriptions for complex audio content.

Sep 15, 2025

Any-to-Any Text → Speech

Alpha-VLLM/Any-to-Any

Lumina-DiMOO: A Diffusion Model for Any-to-Any AI

This new open-source model uses a diffusion architecture instead of a typical transformer to generate and understand a mix of media types.

Sep 9, 2025

Any-to-Any Text → Image

StepFun/Any-to-Any

StepFun Releases Step-Audio 2 mini, a Unified Audio AI

The new open-source model handles both speech recognition and audio generation in a single, end-to-end architecture.

Aug 28, 2025

NexaAI/Any-to-Any

NexaAI Releases OmniNeural-4B for On-Device AI

The new 4-billion-parameter model is designed for 'any-to-any' multimodal tasks and optimized to run efficiently on mobile hardware.

Aug 15, 2025

Skywork/Any-to-Any

Skywork Releases UniPic, a Unified 1.5B Vision Model

The new autoregressive model from the Chinese AI lab can understand, generate, and edit images within a single, compact framework.

Jul 29, 2025

inclusionAI/Any-to-Any

Ming-Lite-Omni 1.5 Brings Any-to-Any Modality to Open Source

The new MIT-licensed model from inclusionAI can process and generate a mix of text, images, audio, and video, pushing the boundaries of open multimodal AI.

Jul 15, 2025

ByteDance/Any-to-Any

ByteDance Releases Tar-7B for 'Any-to-Any' Multimodality

The new 7-billion-parameter model from the company's SEED team can process and generate a mix of text, images, audio, and video in a single unified framework.

Jul 2, 2025

AIDC-AI/Any-to-Any

Ovis-U1-3B Unifies Image Understanding and Generation

The new 3-billion-parameter model from AIDC-AI combines vision-language understanding and image generation into a single 'any-to-any' framework.

Jun 28, 2025

Any-to-Any Text → Image

FreedomIntelligence/Any-to-Any

Janus-4o-7B Adds Image Generation to 7B Multimodal AI

The new 7-billion-parameter model from FreedomIntelligence can process various inputs and generate or edit images based on text prompts.

Jun 23, 2025