Google DeepMindAny-to-Any

Google Releases Gemma 4, a 12B 'Any-to-Any' Model

The new 12-billion-parameter model from Google DeepMind is designed to handle a flexible mix of data types, moving beyond traditional text and image inputs.

May 23, 2026

Major releaseGemma

Google DeepMind has expanded its open-weights portfolio with the release of Gemma 4 12B Instruct, a new 12-billion-parameter model. The model's key innovation is its unified 'any-to-any' multimodal architecture, designed to handle diverse data inputs and outputs seamlessly.

Unlike traditional models that are often limited to specific input-output pairs like text-to-image, Gemma 4 is built for more flexible, generalized reasoning. According to the release details on Hugging Face, its 'any-to-any' design allows it to process combinations of modalities simultaneously, a significant step toward more capable AI systems.

Why It Matters

The arrival of Gemma 4 democratizes a sophisticated architecture previously seen in much larger, closed models. By packaging these capabilities into a relatively efficient 12B parameter model, Google enables a wider range of researchers and developers to experiment with advanced multimodal applications that require less computational overhead.

The model is available under the custom Gemma license, which includes specific terms for usage and distribution. As an instruction-tuned variant, Gemma 4 12B is optimized for direct use in conversational and task-oriented applications.

Sources

google/gemma-4-12B-it
Hugging Face
Visit
google-deepmind/gemma v4.0.0
GitHub
Visit

0 comments

No comments yet. Be the first to weigh in.

Thinking Machines Debuts Inkling Small, a Compact Multimodal MoE

The Apache-2.0 model brings mixture-of-experts efficiency to image, audio, and text tasks in a smaller footprint.

Jul 27, 2026

KRAFTON/Any-to-Any

KRAFTON releases A.X-K2 Raon speech MoE model

The game maker's new open model blends text-to-speech and speech recognition in a single 21B mixture-of-experts system with just 3B active parameters.

Jul 27, 2026

Microsoft/Vision-Language

Microsoft's Mage-VL Streams Video Natively

A codec-native multimodal foundation model aims to understand live video and vision-language input in real time.

Jul 26, 2026