Google Releases 2B Multimodal Gemma 4 Assistant Model
The new compact model from DeepMind is instruction-tuned for "any-to-any" tasks, capable of processing and generating mixed data types.
Google DeepMind has released a new addition to its open-source Gemma family: a 2-billion-parameter model designed for multimodal assistant tasks. Dubbed "Gemma 4 E2B-it Assistant," the model is notably compact, aiming to bring sophisticated capabilities to a wider range of hardware.
This release is an instruction-tuned variant, meaning it's been specifically fine-tuned to follow user commands and engage in conversational interactions. Its key feature is its "any-to-any" architecture, which allows it to process and generate a mix of data types beyond just text—a significant capability for a model of its size.
Compact Multimodality
The model's combination of a small parameter count and advanced multimodal features makes it particularly interesting. While larger models have long handled mixed inputs, a capable 2B model opens up new possibilities for developers building applications for edge devices, specialized agents, or scenarios where computational resources are constrained.
The Gemma 4 E2B-it Assistant is licensed under the permissive Apache 2.0 license, encouraging both research and commercial use. Developers can explore the model and its capabilities now, as it is available on Hugging Face.
Sources
- Visit
google/gemma-4-E2B-it-assistant
Hugging Face
0 comments
No comments yet. Be the first to weigh in.
More in Any-to-Any

MiniMax Releases M3, a Multimodal MoE Model
The new open-weight model from MiniMax AI combines vision, coding, and reasoning using a Mixture-of-Experts architecture.
Google Releases Gemma 4 12B Multimodal Model
The new 12-billion-parameter open model from DeepMind introduces a unified 'any-to-any' architecture for advanced multimodal tasks.
Google Releases Gemma 4, a 12B 'Any-to-Any' Model
The new 12-billion-parameter model from Google DeepMind is designed to handle a flexible mix of data types, moving beyond traditional text and image inputs.