The Open Weights
LatestModelsLeaderboardsUpcomingCompanies
Subscribe
The Open Weights

The daily record of open-source AI. New model releases, leaderboards, and what's coming next — written for people who ship.

Refreshed every 12 hours

Discover

  • Latest releases
  • New today
  • Trending models
  • Upcoming launches

Browse

  • All models
  • Companies
  • Categories
  • Leaderboards

About

  • About
  • Editorial policy
  • RSS feed
  • Newsletter

© 2026 The Open Weights. An independent publication.

Aggregated by Claude · written with Gemini · curated by humans.

Category · vision

Latest Vision-Language models

The newest open-source Vision-Language releases, from across the ecosystem.

Filter

34 releases

Moonshot AI/CodeMajor release

Moonshot AI Releases Kimi, a Multimodal Coding Model

The new Mixture-of-Experts model from the Chinese AI company can generate code while also understanding visual inputs, a rare combination in open models.

Jun 11, 2026
CodeText / LLM
Kimi-K2.7-Code
Kimi-K2.7-Code
Google DeepMind/Text / LLM

Google Releases Open-Source DiffusionGemma 26B Model

The new 26B parameter model from DeepMind uses a diffusion-based architecture, a technique more common in image generation, to produce text.

Jun 9, 2026
Text / LLMVision-Language
DiffusionGemma 26B-A4B Instruct
DiffusionGemma 26B-A4B Instruct
MiniMax/Vision-LanguageMajor release

MiniMax Releases M3, a Multimodal MoE Model

The new open-weight model from MiniMax AI combines vision, coding, and reasoning using a Mixture-of-Experts architecture.

Jun 2, 2026
CodeAny-to-Any
MiniMax-M3
MiniMax-M3
Google DeepMind/Any-to-AnyMajor release

Google Releases Gemma 4, a 12B 'Any-to-Any' Model

The new 12-billion-parameter model from Google DeepMind is designed to handle a flexible mix of data types, moving beyond traditional text and image inputs.

May 23, 2026
Text / LLMAny-to-Any
Gemma 4 12B
Gemma 4 12B
NVIDIA/Any-to-Any

NVIDIA Releases Efficient Nemotron-3 Multimodal MoE

The new 30-billion parameter Mixture-of-Experts model handles text and images while using only 3 billion active parameters for inference.

Apr 24, 2026
Any-to-AnyReasoning
Nemotron-3 Nano Omni 30B-A3B Reasoning
Nemotron-3 Nano Omni 30B-A3B Reasoning
Google DeepMind/Any-to-Any

Google Releases Gemma 4 Multimodal Open Model

The new 26-billion-parameter model from DeepMind uses a mixture-of-experts design for greater efficiency and is tuned for assistant-style tasks.

Apr 23, 2026
Text / LLMAny-to-Any
Gemma 4 26B-A4B Instruct (MoE)
Gemma 4 26B-A4B Instruct (MoE)
Google DeepMind/Any-to-AnyMajor release

Google Releases Multimodal Gemma 4 31B Model

The new 31-billion-parameter model is an instruction-tuned, 'any-to-any' powerhouse released under a permissive Apache 2.0 license.

Apr 23, 2026
Text / LLMAny-to-Any
Gemma 4 12B
Gemma 4 12B
Qwen · Alibaba/Vision-Language

Alibaba's Qwen Releases Open 27B Vision Model

The new dense model, licensed under Apache 2.0, brings both text and image understanding to the midrange parameter space.

Apr 21, 2026
Text / LLMVision-Language
Qwen3.6-27B
Qwen3.6-27B
Qwen · Alibaba/Vision-LanguageMajor release

Qwen Releases 35B Multimodal Mixture-of-Experts Model

The new Qwen3.6-35B-A3B from Alibaba's Qwen team combines vision and language capabilities using an efficient sparse architecture.

Apr 15, 2026
Text / LLMReasoning
Qwen3.6-27B
Qwen3.6-27B
Moonshot AI/Vision-LanguageMajor release

Moonshot AI Releases Kimi-K2.6 Multimodal Model

The Chinese AI lab has published weights for its new vision-language model, though a restrictive license limits its use to research applications.

Apr 14, 2026
Text / LLMVision-Language
Kimi-K2.6
Kimi-K2.6
OpenBMB/Vision-Language

OpenBMB Releases MiniCPM-V for On-Device Vision

The new open-source vision-language model is designed for high-resolution image understanding on mobile and edge devices.

Apr 13, 2026
Vision-Language
MiniCPM-V-4.6
MiniCPM-V-4.6
Baidu/Vision-Language

Baidu Releases Qianfan-OCR for Document Intelligence

The new vision-language model from the Chinese tech giant is designed for complex, multilingual optical character recognition and layout analysis.

Mar 18, 2026
Vision-Language
Qianfan-OCR
Qianfan-OCR
Google DeepMind/Any-to-AnyMajor release

Google Releases Gemma 4, a 26B Vision-Language Model

The new open-source model from DeepMind uses a Mixture-of-Experts architecture to handle both text and image inputs efficiently.

Mar 11, 2026
Text / LLMVision-Language
Gemma 4 12B
Gemma 4 12B
Google DeepMind/Any-to-AnyMajor release

Google Releases Multimodal Gemma 4 31B Model

The new 31-billion-parameter model is instruction-tuned and can process both text and images, marking a significant expansion for the Gemma family.

Mar 11, 2026
Text / LLMVision-Language
Gemma 4 12B
Gemma 4 12B
NVIDIA/Vision-Language

NVIDIA's New 3B VLM Pinpoints Objects in Images

The new 3-billion-parameter model, based on the company's Eagle architecture, is designed for high-precision visual grounding tasks.

Mar 2, 2026
Vision-Language
LocateAnything-3B
LocateAnything-3B
Google DeepMind/Any-to-AnyMajor release

Google's Gemma 4 Debuts with Any-to-Any Multimodality

The new 4-billion parameter model from Google DeepMind is designed for versatile input and output, handling text, images, and other data types.

Mar 2, 2026
Text / LLMAny-to-Any
Gemma 4 E4B
Gemma 4 E4B
Zhipu AI/Vision-Language

Zhipu AI Releases Multilingual GLM-OCR Vision Model

The new vision-language model from the creators of the GLM series is specialized for recognizing and extracting text from images across multiple languages.

Jan 30, 2026
Vision-Language
GLM-OCR
GLM-OCR
Moonshot AI/Vision-LanguageMajor release

Moonshot AI Releases Kimi K2.5 Multimodal Model

The new vision-language model from the Chinese AI firm uses a Mixture-of-Experts architecture and is now available on Hugging Face.

Jan 1, 2026
Text / LLMReasoning
Kimi K2.5
Kimi K2.5
Allen Institute for AI/Any-to-Any

BAAI Releases Emu3.5, an 'Any-to-Any' Multimodal Model

The new open-source model from the Allen Institute for AI unifies text and image understanding and generation into a single architecture.

Oct 31, 2025
Any-to-AnyText → Image
Emu3.5
Emu3.5
Datalab/Vision-Language

Datalab Releases Chandra, a New OCR Vision Model

The new vision-language model from Datalab is fine-tuned from Qwen2-VL to specialize in extracting text and structure from complex documents.

Oct 21, 2025
Vision-Language
Chandra OCR
Chandra OCR
DeepSeek/Vision-LanguageMajor release

DeepSeek-OCR Tackles Document Parsing with Vision AI

The new vision-language model uses a novel context compression technique to efficiently extract text and structure from complex documents.

Oct 17, 2025
Vision-Language
DeepSeek-OCR
DeepSeek-OCR
Baidu/Vision-Language

Baidu Releases PaddleOCR-VL for Document AI

The new vision-language model is fine-tuned to understand not just text, but the complex structure of tables, charts, and formulas.

Oct 16, 2025
Vision-Language
PaddleOCR-VL
PaddleOCR-VL
Qwen · Alibaba/Vision-Language

Alibaba Releases Qwen3-VL, an 8B Open-Source Vision Model

The latest vision-language model from the popular Qwen series is instruction-tuned and available under an Apache 2.0 license.

Oct 11, 2025
Vision-Language
Qwen3-VL-8B-Instruct
Qwen3-VL-8B-Instruct
inclusionAI/Any-to-Any

inclusionAI Releases Ming-UniVision MoE Multimodal Model

The new 16-billion-parameter model uses a sparse Mixture-of-Experts design to efficiently handle 'any-to-any' data combinations, from text to images.

Sep 30, 2025
Any-to-AnyVision-Language
Ming-UniVision-16B-A3B
Ming-UniVision-16B-A3B
Qwen · Alibaba/Vision-LanguageMajor release

Qwen Releases 30B MoE Vision Model, Qwen3-VL

The new open-source model from Alibaba uses a Mixture-of-Experts architecture to make its powerful vision-language capabilities more efficient to run.

Sep 30, 2025
Any-to-AnyVision-Language
Qwen3-VL-8B-Instruct
Qwen3-VL-8B-Instruct
Qwen · Alibaba/Any-to-AnyMajor release

Qwen3-Omni Arrives With Any-to-Any Multimodality

The new 30B Mixture-of-Experts model from Alibaba's Qwen team can process and generate content across text, image, and audio formats.

Sep 20, 2025
Speech → TextAny-to-Any
Qwen3-Omni-30B-A3B-Instruct
Qwen3-Omni-30B-A3B-Instruct
Qwen · Alibaba/Any-to-Any

Qwen Releases 'Thinking' Multimodal MoE Model

The new 30-billion-parameter Mixture-of-Experts model from Alibaba's Qwen team is designed to show its reasoning process for complex multimodal tasks.

Sep 15, 2025
Any-to-AnyReasoning
Qwen3-Omni-30B-A3B-Thinking
Qwen3-Omni-30B-A3B-Thinking
Qwen · Alibaba/Any-to-Any

Qwen Releases 30B Model for Audio Captioning

The new Mixture-of-Experts model from Alibaba is fine-tuned to generate detailed, multilingual descriptions for complex audio content.

Sep 15, 2025
Any-to-AnyText → Speech
Qwen3-Omni-30B-A3B-Captioner
Qwen3-Omni-30B-A3B-Captioner
moondream/Vision-Language

Moondream 3 Arrives in Preview Release

The next generation of the efficient, open-source vision-language model is now available for early testing and feedback.

Sep 11, 2025
Vision-Language
Moondream 3 (preview)
Moondream 3 (preview)
OpenBMB/Vision-Language

OpenBMB Releases Compact Multimodal Model MiniCPM-V 4.5

The new vision-language model from the open-source research group demonstrates strong OCR and video understanding capabilities in a small package.

Aug 24, 2025
Vision-Language
MiniCPM-V 4.5
MiniCPM-V 4.5
Zhipu AI/Vision-LanguageMajor release

Zhipu AI Releases Open Vision Model GLM-4.5V

The new Mixture-of-Experts model offers strong multimodal reasoning capabilities under a permissive MIT license.

Aug 10, 2025
ReasoningVision-Language
GLM-4.5V
GLM-4.5V
rednote-hilab/Vision-Language

New VLM `dots.ocr` Takes on Complex Documents

The new 3B-parameter model from rednote-hilab uses a vision-language approach to parse tables, layouts, and even mathematical formulas.

Jul 30, 2025
Vision-Language
dots.ocr
dots.ocr
Zhipu AI/Vision-Language

Zhipu AI Open-Sources 9B Vision Model with 'Thinking' Mode

The new GLM-4.1V-9B-Thinking model makes its vision and chain-of-thought reasoning capabilities available under a permissive MIT license.

Jun 28, 2025
ReasoningVision-Language
GLM-4.1V-9B-Thinking
GLM-4.1V-9B-Thinking
AIDC-AI/Any-to-Any

Ovis-U1-3B Unifies Image Understanding and Generation

The new 3-billion-parameter model from AIDC-AI combines vision-language understanding and image generation into a single 'any-to-any' framework.

Jun 28, 2025
Any-to-AnyText → Image
Ovis-U1-3B
Ovis-U1-3B