Google DeepMindAny-to-Any

Google Releases Gemma 4 12B Multimodal Model

The new 12-billion-parameter open model from DeepMind introduces a unified 'any-to-any' architecture for advanced multimodal tasks.

May 23, 2026

Major releaseApache 2.0

Google DeepMind has released Gemma 4 12B, a new generation of its open model family. This 12-billion-parameter model is available under a permissive Apache 2.0 license, continuing Google's commitment to providing powerful tools for the open-source AI community.

Unlike many existing vision-language models, Gemma 4 is built on what Google calls a "unified any-to-any" architecture. This design aims to natively handle a wide variety of data modalities for both input and output, moving beyond the common text-and-image limitations of previous systems.

Why 'Any-to-Any' Matters

This architectural approach is significant for developers. It simplifies the process of building complex applications that need to interpret and generate combinations of different data types, such as text, images, and potentially other formats in the future. Instead of chaining together multiple specialized models, developers can use a single, more integrated system, which could enable more fluid and capable AI assistants, creative tools, and analysis engines.

By releasing a model with this advanced multimodal design, Google provides a powerful new foundation for open-source development. Researchers and engineers can now experiment with and build upon this flexible architecture, pushing the boundaries of what's possible with open AI. The model is available now on Hugging Face.

Sources

google/gemma-4-12B
Hugging Face
Visit

0 comments

No comments yet. Be the first to weigh in.

Thinking Machines Debuts Inkling Small, a Compact Multimodal MoE

The Apache-2.0 model brings mixture-of-experts efficiency to image, audio, and text tasks in a smaller footprint.

Jul 27, 2026

KRAFTON/Any-to-Any

KRAFTON releases A.X-K2 Raon speech MoE model

The game maker's new open model blends text-to-speech and speech recognition in a single 21B mixture-of-experts system with just 3B active parameters.

Jul 27, 2026

Microsoft/Vision-Language

Microsoft's Mage-VL Streams Video Natively

A codec-native multimodal foundation model aims to understand live video and vision-language input in real time.

Jul 26, 2026

Why 'Any-to-Any' Matters