Google DeepMindAny-to-Any

Google's Gemma 4 Arrives with Any-to-Any Multimodal Skills

The new 2-billion-parameter model from DeepMind can process text, vision, and audio, making it a versatile and efficient foundation for developers.

Mar 2, 2026

Major releaseGemma

Google DeepMind has released Gemma 4 E2B IT, the first model in a new generation of its open-weights AI family. This compact, 2-billion-parameter model is distinguished by its native "any-to-any" multimodality, capable of processing and generating combinations of text, vision, and audio.

The "E2B" in the name likely stands for "Efficient 2 Billion," highlighting the model's focus on performance within a small footprint. As an instruction-tuned (-it) variant, it's optimized for chat and direct-response tasks and features a context window of 8192 tokens.

Why It Matters

Gemma 4's release marks a significant step for developers seeking powerful AI that can run efficiently on local hardware or in resource-constrained cloud environments. By integrating text, vision, and audio capabilities into a single, compact model, Google is making sophisticated, multi-sensory AI more accessible and lowering the barrier for building complex applications.

The model is available now for researchers and developers on Hugging Face. It is released under the Gemma license terms, which permit commercial use but include specific restrictions, continuing Google's strategy of providing capable but controlled open-source tools.

Sources

google/gemma-4-E2B-it
Hugging Face
Visit

0 comments

No comments yet. Be the first to weigh in.

Thinking Machines Debuts Inkling Small, a Compact Multimodal MoE

The Apache-2.0 model brings mixture-of-experts efficiency to image, audio, and text tasks in a smaller footprint.

Jul 27, 2026

KRAFTON/Any-to-Any

KRAFTON releases A.X-K2 Raon speech MoE model

The game maker's new open model blends text-to-speech and speech recognition in a single 21B mixture-of-experts system with just 3B active parameters.

Jul 27, 2026

Microsoft/Vision-Language

Microsoft's Mage-VL Streams Video Natively

A codec-native multimodal foundation model aims to understand live video and vision-language input in real time.

Jul 26, 2026

Why It Matters