Google DeepMindAny-to-Any

Google Releases Gemma 4, a 26B Vision-Language Model

The new open-source model from DeepMind uses a Mixture-of-Experts architecture to handle both text and image inputs efficiently.

Mar 11, 2026

Major releaseApache 2.0

Google DeepMind has expanded its open-source offerings with the release of Gemma 4 26B Instruct, a new vision-language model. Published under a permissive Apache 2.0 license, this model is designed to understand and process both text and images, making it a versatile tool for multimodal applications.

An Efficient Multimodal Architecture

The key innovation in Gemma 4 is its Mixture-of-Experts (MoE) architecture. While the model contains a total of 26 billion parameters, it's designed for efficiency by activating only a fraction of them for any given task. The model's designation, "A4B," suggests that approximately 4 billion parameters are active at a time, offering potent performance without the full computational cost of a dense 26B model.

As an instruction-tuned model, Gemma 4 26B is optimized to follow user prompts and commands, making it suitable for a wide range of chat and assistant-style applications. Researchers and developers can access the model and its technical details on its Hugging Face repository.

This release signals Google's continued investment in the open-source AI ecosystem, providing a powerful, state-of-the-art multimodal model to the community. The efficient MoE design makes advanced vision-language capabilities more accessible, enabling new possibilities for applications that can see and reason about the world.

Sources

google/gemma-4-26B-A4B-it
Hugging Face
Visit

0 comments

No comments yet. Be the first to weigh in.

Thinking Machines Debuts Inkling Small, a Compact Multimodal MoE

The Apache-2.0 model brings mixture-of-experts efficiency to image, audio, and text tasks in a smaller footprint.

Jul 27, 2026

KRAFTON/Any-to-Any

KRAFTON releases A.X-K2 Raon speech MoE model

The game maker's new open model blends text-to-speech and speech recognition in a single 21B mixture-of-experts system with just 3B active parameters.

Jul 27, 2026

Microsoft/Vision-Language

Microsoft's Mage-VL Streams Video Natively

A codec-native multimodal foundation model aims to understand live video and vision-language input in real time.

Jul 26, 2026

An Efficient Multimodal Architecture