Google DeepMindAny-to-Any

Google Releases Compact Gemma 4 E2B Multimodal Model

The new 2-billion-parameter model from Google DeepMind brings efficient image-and-text understanding to the open-source Gemma family.

Mar 2, 2026

Major releaseGemma

Google DeepMind has expanded its open-source offerings with the release of Gemma 4 E2B, a new model in the Gemma 4 family. This compact 2-billion-parameter model is designed for efficiency and introduces multimodal capabilities, allowing it to process both images and text to generate text-based responses.

Unlike previous text-only Gemma models, Gemma 4 E2B is a vision-language model (VLM). This means developers can provide it with an image alongside a text prompt to perform tasks like visual question answering, image description, and other forms of visual reasoning. The "E2B" in its name likely signifies its focus on being an "Efficient 2 Billion" parameter model.

Why It Matters

The release of a small, capable VLM like Gemma 4 E2B is significant for making advanced AI more accessible. Its modest size makes it suitable for running on consumer hardware, in edge devices, or in cloud environments where computational cost is a concern. This democratizes access to multimodal technology that has often been restricted to much larger, more demanding models.

The model is available under the Gemma license, and developers can explore its architecture and weights on its official Hugging Face repository.

Sources

google/gemma-4-E2B
Hugging Face
Visit

0 comments

No comments yet. Be the first to weigh in.

Thinking Machines Debuts Inkling Small, a Compact Multimodal MoE

The Apache-2.0 model brings mixture-of-experts efficiency to image, audio, and text tasks in a smaller footprint.

Jul 27, 2026

KRAFTON/Any-to-Any

KRAFTON releases A.X-K2 Raon speech MoE model

The game maker's new open model blends text-to-speech and speech recognition in a single 21B mixture-of-experts system with just 3B active parameters.

Jul 27, 2026

Microsoft/Vision-Language

Microsoft's Mage-VL Streams Video Natively

A codec-native multimodal foundation model aims to understand live video and vision-language input in real time.

Jul 26, 2026

Why It Matters

The model is available under the Gemma license, and developers can explore its architecture and weights on its official Hugging Face repository.