Google DeepMindAny-to-Any

Google Releases Multimodal Gemma 4 31B Model

The new 31-billion-parameter model is instruction-tuned and can process both text and images, marking a significant expansion for the Gemma family.

Mar 11, 2026

Major releaseGemma

Google DeepMind has expanded its open-weights lineup with the release of Gemma 4 31B IT, a powerful new multimodal model. This 31-billion-parameter release marks a significant milestone for the Gemma family, introducing vision-language capabilities at a substantial scale.

As a Vision Language Model (VLM), Gemma 4 31B can interpret and process information from both text and images simultaneously. This allows it to handle tasks that were previously out of reach for its text-only predecessors, such as describing the contents of a photograph or answering questions based on visual information.

The "IT" in the model's name signifies that it is instruction-tuned, meaning it has been specifically optimized to follow user prompts and perform conversational tasks. This fine-tuning makes it more reliable and useful for building interactive applications.

The introduction of a capable VLM under the Gemma name advances Google's open-source AI strategy, providing developers with a strong foundation for building sophisticated multimodal applications. The model is available under the Gemma license and can be accessed by researchers and developers on its Hugging Face repository.

Sources

google/gemma-4-31B-it
Hugging Face
Visit

0 comments

No comments yet. Be the first to weigh in.

Thinking Machines Debuts Inkling Small, a Compact Multimodal MoE

The Apache-2.0 model brings mixture-of-experts efficiency to image, audio, and text tasks in a smaller footprint.

Jul 27, 2026

KRAFTON/Any-to-Any

KRAFTON releases A.X-K2 Raon speech MoE model

The game maker's new open model blends text-to-speech and speech recognition in a single 21B mixture-of-experts system with just 3B active parameters.

Jul 27, 2026

Microsoft/Vision-Language

Microsoft's Mage-VL Streams Video Natively

A codec-native multimodal foundation model aims to understand live video and vision-language input in real time.

Jul 26, 2026

Google DeepMindAny-to-Any

Google Releases Multimodal Gemma 4 31B Model

The new 31-billion-parameter model is instruction-tuned and can process both text and images, marking a significant expansion for the Gemma family.

Mar 11, 2026

Major releaseGemma

Sources

google/gemma-4-31B-it
Hugging Face
Visit

0 comments

No comments yet. Be the first to weigh in.

Thinking Machines Debuts Inkling Small, a Compact Multimodal MoE

The Apache-2.0 model brings mixture-of-experts efficiency to image, audio, and text tasks in a smaller footprint.

Jul 27, 2026

KRAFTON/Any-to-Any

KRAFTON releases A.X-K2 Raon speech MoE model

The game maker's new open model blends text-to-speech and speech recognition in a single 21B mixture-of-experts system with just 3B active parameters.

Jul 27, 2026

Microsoft/Vision-Language

Microsoft's Mage-VL Streams Video Natively

A codec-native multimodal foundation model aims to understand live video and vision-language input in real time.

Jul 26, 2026