MeituanAny-to-Any

Meituan Releases LongCat-Next 'Any-to-Any' AI Model

The Chinese tech company has released the weights for a unified model that can process and generate combinations of text, images, audio, and video.

Mar 25, 2026

NotableMIT

Chinese technology company Meituan has released the weights for LongCat-Next, an ambitious 'any-to-any' multimodal model. Published under a permissive MIT license, the model marks a significant step towards more flexible and generalized AI systems that can operate across a wide spectrum of data types.

Unlike most multimodal models that handle specific input-output pairs, such as text-to-image or image-to-text, LongCat-Next is designed for true combinatorial flexibility. It can accept any mix of text, images, audio, and video as input and generate any combination of those modalities as output. For example, it could take an image and an audio clip as prompts and produce a descriptive paragraph and a short video in response.

A Unified Architecture

The model achieves this versatility through a unified framework. Instead of stitching together separate, specialized encoders and decoders for each data type, LongCat-Next uses a single, end-to-end trained network. This architecture relies on a shared vocabulary to represent and process information from different sources, enabling it to generate coherent, multimodal content from complex prompts.

The release of LongCat-Next on the Hugging Face Hub provides researchers and developers with a powerful tool for exploring the frontiers of multimodal AI. Its open-ended capabilities and permissive license encourage experimentation in creative content generation, data synthesis, and complex reasoning tasks that span multiple domains.

Sources

meituan-longcat/LongCat-Next
Hugging Face
Visit

0 comments

No comments yet. Be the first to weigh in.

Thinking Machines Debuts Inkling Small, a Compact Multimodal MoE

The Apache-2.0 model brings mixture-of-experts efficiency to image, audio, and text tasks in a smaller footprint.

Jul 27, 2026

KRAFTON/Any-to-Any

KRAFTON releases A.X-K2 Raon speech MoE model

The game maker's new open model blends text-to-speech and speech recognition in a single 21B mixture-of-experts system with just 3B active parameters.

Jul 27, 2026

Microsoft/Vision-Language

Microsoft's Mage-VL Streams Video Natively

A codec-native multimodal foundation model aims to understand live video and vision-language input in real time.

Jul 26, 2026

A Unified Architecture