BAAIAny-to-Any

BAAI Releases Emu3.5, an 'Any-to-Any' Multimodal Model

The new open-source model from the Allen Institute for AI unifies text and image understanding and generation into a single architecture.

Oct 31, 2025

NotableApache 2.0

The Allen Institute for AI (BAAI) has released Emu3.5, a new open-source model that pushes the boundaries of multimodal AI. Available under the permissive Apache 2.0 license, Emu3.5 is designed as a native "any-to-any" system, capable of both understanding and generating interleaved text and images within a single, unified framework.

A Unified Architecture

Unlike systems that chain separate, specialized models for different tasks (e.g., one for captioning, another for image generation), Emu3.5 aims to handle diverse combinations of inputs and outputs natively. The model can accept prompts containing both text and images to generate responses that are also a mix of text and new images. This approach moves beyond simple text-to-image or image-to-text capabilities toward more fluid, conversational interactions across modalities.

This unified design represents a significant step toward more integrated and capable AI systems. By handling complex, multimodal instructions within one architecture, models like Emu3.5 could power more sophisticated applications in creative tools, data analysis, and robotics. Researchers and developers can explore the model and its capabilities on its official Hugging Face repository.

Sources

BAAI/Emu3.5
Hugging Face
Visit

0 comments

No comments yet. Be the first to weigh in.

Thinking Machines Debuts Inkling Small, a Compact Multimodal MoE

The Apache-2.0 model brings mixture-of-experts efficiency to image, audio, and text tasks in a smaller footprint.

Jul 27, 2026

KRAFTON/Any-to-Any

KRAFTON releases A.X-K2 Raon speech MoE model

The game maker's new open model blends text-to-speech and speech recognition in a single 21B mixture-of-experts system with just 3B active parameters.

Jul 27, 2026

Microsoft/Vision-Language

Microsoft's Mage-VL Streams Video Natively

A codec-native multimodal foundation model aims to understand live video and vision-language input in real time.

Jul 26, 2026

A Unified Architecture