Latest open-source Vision-Language models

Thinkingmachines/Vision-Language

Thinking Machines Debuts Inkling Small, a Compact Multimodal MoE

The Apache-2.0 model brings mixture-of-experts efficiency to image, audio, and text tasks in a smaller footprint.

Jul 27, 2026

Text / LLM Any-to-Any

Microsoft/Vision-Language

Microsoft's Mage-VL Streams Video Natively

A codec-native multimodal foundation model aims to understand live video and vision-language input in real time.

Jul 26, 2026

Any-to-Any Vision-Language

Swiss Ai/Text / LLM

Apertus v1.5 70B arrives with an Apache-2.0 license

Switzerland's open-model effort ships a 70-billion-parameter, multilingual and multimodal system that anyone can use, modify, and deploy.

Jul 24, 2026

Text / LLM Vision-Language

Microsoft/Vision-Language

Microsoft's Fara1.5-27B targets computer-use agents

A 27B-parameter vision-language model built to drive browsers and desktop apps like a human operator.

Jul 17, 2026

Vision-Language

NVIDIA/Any-to-Any

NVIDIA's Audio-Visual Flamingo Fuses Sound and Sight

A fully open multimodal model aims to reason jointly across audio, images, and long-form video.

Jul 16, 2026

Any-to-Any Vision-Language

Internlm/Vision-Language

InternLM Previews 397B Vision-Language Model

The Intern-S2 preview arrives as a very large multimodal system under a permissive Apache-2.0 license.

Jul 16, 2026

Vision-Language

Thinkingmachines/Any-to-AnyMajor release

Thinking Machines Lab debuts Inkling, its first open model

The lab's inaugural open-weights release is a mixture-of-experts system that takes image and audio inputs, shipped under a permissive Apache 2.0 license.

Jul 15, 2026

Text / LLM Any-to-Any

OpenMOSS/Vision-Language

OpenMOSS Debuts MOSS-VL-Realtime for Live Video

The Chinese research group's new vision-language model targets streaming understanding of video and images rather than static frames.

Jul 14, 2026

Any-to-Any Vision-Language

ATH MaaS/Vision-Language

Alibaba's OvisOCR2 turns page images into Markdown

A compact 0.8B vision-language model aims to parse full documents—text, tables, and formulas—in a single pass.

Jul 13, 2026

Vision-Language

Google DeepMind/Any-to-AnyMajor release

Google DeepMind's Gemma 4 Goes Multimodal and MoE

The new open-weights family adds a mixture-of-experts design, encoder-free multimodal inputs, and an optional thinking mode.

Jul 1, 2026

Text / LLM Any-to-Any

Microsoft/Vision-Language

Microsoft previews GELab-Zero-4B, a compact GUI agent

The 4-billion-parameter vision-language model targets on-screen and mobile automation, built atop Qwen3-VL.

Jun 30, 2026

Vision-Language

SenseTime/Any-to-Any

SenseTime's SenseNova-Vision-7B-MoT Goes Any-to-Any

A single 7B model from SenseTime folds vision-language understanding, image generation, editing, and perception into one system.

Jun 29, 2026

Image Editing Any-to-Any

Baidu/Vision-Language

Baidu's PP-OCRv6 packs 50-language OCR into tiny models

The latest release of PaddlePaddle's optical character recognition suite spans models from 1.5M to 34.5M parameters under an Apache 2.0 license.

Jun 22, 2026

Vision-Language

Datalab To/Vision-Language

LIFT: A Qwen3.5-Based VLM for PDF-to-JSON Extraction

Datalab's new open vision-language model targets structured data extraction from documents, turning messy PDFs into clean JSON.

Jun 19, 2026

Vision-Language

Baidu/Vision-Language

Baidu releases Unlimited-OCR under permissive MIT license

The Chinese tech giant's multilingual vision-language model targets text extraction across languages and document types.

Jun 19, 2026

Vision-Language

Moonshot AI/Text / LLMMajor release

Moonshot AI releases Kimi K3, a 2.8T-parameter MoE model

The open-weights multimodal model leans into coding and agentic tasks, extending Moonshot's Kimi line into a new scale bracket.

Jun 13, 2026

Text / LLM Reasoning

Moonshot AI/CodeMajor release

Moonshot AI Releases Kimi, a Multimodal Coding Model

The new Mixture-of-Experts model from the Chinese AI company can generate code while also understanding visual inputs, a rare combination in open models.

Jun 11, 2026

Code Text / LLM

Google DeepMind/Text / LLM

Google Releases Open-Source DiffusionGemma 26B Model

The new 26B parameter model from DeepMind uses a diffusion-based architecture, a technique more common in image generation, to produce text.

Jun 9, 2026

Text / LLM Vision-Language

Baidu/Vision-Language

PaddleOCR's PP-OCRv6 Adds a Medium Detection Model

Baidu's open-source OCR toolkit ships an Apache-licensed text-line detector in safetensors format, tuned for a balance of accuracy and speed.

Jun 9, 2026

Vision-Language

MiniMax/Vision-LanguageMajor release

MiniMax Releases M3, a Multimodal MoE Model

The new open-weight model from MiniMax AI combines vision, coding, and reasoning using a Mixture-of-Experts architecture.

Jun 2, 2026

Code Any-to-Any

Google DeepMind/Any-to-AnyMajor release

Google Releases Gemma 4 12B Multimodal Model

The new 12-billion-parameter open model from DeepMind introduces a unified 'any-to-any' architecture for advanced multimodal tasks.

May 23, 2026

Text / LLM Any-to-Any

Google DeepMind/Any-to-AnyMajor release

Google Releases Gemma 4, a 12B 'Any-to-Any' Model

The new 12-billion-parameter model from Google DeepMind is designed to handle a flexible mix of data types, moving beyond traditional text and image inputs.

May 23, 2026

Text / LLM Any-to-Any

NVIDIA/Any-to-Any

NVIDIA Releases Efficient Nemotron-3 Multimodal MoE

The new 30-billion parameter Mixture-of-Experts model handles text and images while using only 3 billion active parameters for inference.

Apr 24, 2026

Any-to-Any Reasoning

Google DeepMind/Any-to-Any

Google Releases Gemma 4 Multimodal Open Model

The new 26-billion-parameter model from DeepMind uses a mixture-of-experts design for greater efficiency and is tuned for assistant-style tasks.

Apr 23, 2026

Text / LLM Any-to-Any

Google DeepMind/Any-to-AnyMajor release

Google Releases Multimodal Gemma 4 31B Model

The new 31-billion-parameter model is an instruction-tuned, 'any-to-any' powerhouse released under a permissive Apache 2.0 license.

Apr 23, 2026

Text / LLM Any-to-Any

Qwen · Alibaba/Vision-Language

Alibaba's Qwen Releases Open 27B Vision Model

The new dense model, licensed under Apache 2.0, brings both text and image understanding to the midrange parameter space.

Apr 21, 2026

Text / LLM Vision-Language

Qwen · Alibaba/Vision-LanguageMajor release

Qwen Releases 35B Multimodal Mixture-of-Experts Model

The new Qwen3.6-35B-A3B from Alibaba's Qwen team combines vision and language capabilities using an efficient sparse architecture.

Apr 15, 2026

Text / LLM Reasoning

Moonshot AI/Vision-LanguageMajor release

Moonshot AI Releases Kimi-K2.6 Multimodal Model

The Chinese AI lab has published weights for its new vision-language model, though a restrictive license limits its use to research applications.

Apr 14, 2026

Text / LLM Vision-Language

OpenBMB/Vision-Language

OpenBMB Releases MiniCPM-V for On-Device Vision

The new open-source vision-language model is designed for high-resolution image understanding on mobile and edge devices.

Apr 13, 2026

Vision-Language

Tencent/Vision-Language

Tencent Releases 2B Vision Model for Robotics

The new HY-Embodied 0.5 is a vision-language model designed specifically for multi-object tracking in dynamic, real-world environments.

Apr 2, 2026

Vision-Language

Baidu/Vision-Language

Baidu Releases Qianfan-OCR for Document Intelligence

The new vision-language model from the Chinese tech giant is designed for complex, multilingual optical character recognition and layout analysis.

Mar 18, 2026

Vision-Language

Google DeepMind/Any-to-AnyMajor release

Google Releases Gemma 4, a 26B Vision-Language Model

The new open-source model from DeepMind uses a Mixture-of-Experts architecture to handle both text and image inputs efficiently.

Mar 11, 2026

Text / LLM Vision-Language

Google DeepMind/Any-to-AnyMajor release

Google Releases Multimodal Gemma 4 31B Model

The new 31-billion-parameter model is instruction-tuned and can process both text and images, marking a significant expansion for the Gemma family.

Mar 11, 2026

Text / LLM Vision-Language

NVIDIA/Vision-Language

NVIDIA's New 3B VLM Pinpoints Objects in Images

The new 3-billion-parameter model, based on the company's Eagle architecture, is designed for high-precision visual grounding tasks.

Mar 2, 2026

Vision-Language

Google DeepMind/Any-to-AnyMajor release

Google Releases Compact Gemma 4 E2B Multimodal Model

The new 2-billion-parameter model from Google DeepMind brings efficient image-and-text understanding to the open-source Gemma family.

Mar 2, 2026

Text / LLM Any-to-Any

Google DeepMind/Any-to-AnyMajor release

Google's Gemma 4 Arrives with Any-to-Any Multimodal Skills

The new 2-billion-parameter model from DeepMind can process text, vision, and audio, making it a versatile and efficient foundation for developers.

Mar 2, 2026

Text / LLM Any-to-Any

Google DeepMind/Any-to-Any

Google Releases Gemma 4 E4B, a 4B Multimodal Model

The new 4-billion-parameter vision-language model brings image and text understanding to Google's popular open-source family.

Mar 2, 2026

Text / LLM Any-to-Any

Google DeepMind/Any-to-AnyMajor release

Google's Gemma 4 Debuts with Any-to-Any Multimodality

The new 4-billion parameter model from Google DeepMind is designed for versatile input and output, handling text, images, and other data types.

Mar 2, 2026

Text / LLM Any-to-Any

Qwen · Alibaba/Vision-Language

Alibaba's Qwen Releases Compact 0.8B Vision Model

The new 800-million-parameter model is the smallest in the Qwen3.5 family, designed for efficient multimodal tasks on consumer-grade hardware.

Feb 28, 2026

Text / LLM Vision-Language

Qwen · Alibaba/Vision-Language

Alibaba's Qwen team releases 4B vision-language model

The new Qwen3.5-4B model combines text and image understanding in a compact, permissively licensed package for developers.

Feb 27, 2026

Text / LLM Vision-Language

Qwen · Alibaba/Vision-Language

Qwen Releases 9B Multimodal Model in New 3.5 Series

The new open-source vision-language model from Alibaba's Qwen team offers strong performance in a compact, Apache 2.0-licensed package.

Feb 27, 2026

Text / LLM Vision-Language

Qwen · Alibaba/Vision-LanguageMajor release

Qwen Releases Flagship 122B Multimodal MoE Model

The new Qwen3.5-122B-A10B combines a massive parameter count with an efficient Mixture-of-Experts architecture for advanced vision and language tasks.

Feb 24, 2026

Text / LLM Vision-Language

Qwen · Alibaba/Vision-LanguageMajor release

Qwen Releases 27B Vision Model with Long Context

The new model from Alibaba's Qwen team combines multimodal understanding with a 131K token context window under a permissive Apache 2.0 license.

Feb 24, 2026

Text / LLM Vision-Language

Qwen · Alibaba/Vision-Language

Qwen Releases Efficient 35B Multimodal MoE Model

The new Qwen3.5-35B-A3B model from Alibaba combines vision and language capabilities with a resource-friendly Mixture of Experts design.

Feb 24, 2026

Text / LLM Vision-Language

Qwen · Alibaba/Vision-LanguageMajor release

Qwen releases flagship 397B multimodal MoE

The new open-source model from Alibaba uses a Mixture-of-Experts architecture to balance massive scale with efficient inference.

Feb 16, 2026

Text / LLM Vision-Language

OpenBMB/Any-to-Any

OpenBMB Releases 'Any-to-Any' Multimodal Model

The new MiniCPM-o 4.5 model from the open-source research group can process and generate interleaved combinations of images, text, and audio.

Feb 3, 2026

Any-to-Any Vision-Language

OpenBMB/Any-to-Any

MiniCPM-o 4.5 Offers 'Any-to-Any' Multimodal AI

The new model from OpenBMB supports mixed-modality inputs and outputs, from text and images to audio and video, in a single efficient package.

Feb 2, 2026

Any-to-Any Vision-Language

Zhipu AI/Vision-Language

Zhipu AI Releases Multilingual GLM-OCR Vision Model

The new vision-language model from the creators of the GLM series is specialized for recognizing and extracting text from images across multiple languages.

Jan 30, 2026

Vision-Language

Baidu/Vision-Language

Baidu Releases Open VLM for Advanced Document OCR

The new PaddleOCR-VL model is built to parse not just text, but also the tables, formulas, and page layouts found in complex documents.

Jan 28, 2026

Vision-Language

DeepSeek/Vision-Language

DeepSeek-OCR-2 Tackles Multilingual Document AI

The new open vision-language model is designed to extract text and understand structure from complex, multilingual documents.

Jan 27, 2026

Vision-Language

LightOn/Vision-Language

LightOn Releases OCR-2, a 1B Document AI Model

The new vision model from the Paris-based AI lab uses Mistral architecture to extract text and structure from complex documents like PDFs and forms.

Jan 16, 2026

Vision-Language

Google DeepMind/Vision-Language

Google's MedGemma brings open vision AI to medicine

The new 4-billion-parameter vision-language model is specialized for tasks in radiology, pathology, and complex clinical reasoning.

Jan 7, 2026

Reasoning Vision-Language

Moonshot AI/Vision-LanguageMajor release

Moonshot AI Releases Kimi K2.5 Multimodal Model

The new vision-language model from the Chinese AI firm uses a Mixture-of-Experts architecture and is now available on Hugging Face.

Jan 1, 2026

Text / LLM Reasoning

Zhipu AI/Vision-Language

Zhipu AI Releases Fast, Open Vision Model GLM-4.6V-Flash

The new model from the GLM-4.6V family offers a fast, MIT-licensed option for developers working with both text and images.

Dec 7, 2025

Vision-Language

Tencent/Vision-Language

Tencent Releases 1B Parameter HunyuanOCR Model

The new vision-language model from Tencent Hunyuan offers a compact, end-to-end solution for optical character recognition.

Nov 18, 2025

Vision-Language

Meta AI/Vision-Language

Meta releases SAM 3 for image and video segmentation

The latest Segment Anything Model extends Meta's mask-generation lineage from still images into video, now available on Hugging Face.

Nov 7, 2025

Vision-Language

Baidu/Vision-Language

Baidu Releases Open Vision-Language MoE Model

The new ERNIE 4.5 VL model brings advanced multimodal reasoning to the open-source community with an efficient Mixture-of-Experts architecture.

Nov 7, 2025

Reasoning Vision-Language

BAAI/Any-to-Any

BAAI Releases Emu3.5, an 'Any-to-Any' Multimodal Model

The new open-source model from the Allen Institute for AI unifies text and image understanding and generation into a single architecture.

Oct 31, 2025

Any-to-Any Text → Image

Microsoft/Vision-Language

Microsoft Releases Fara-7B Vision Agent Model

The 7-billion-parameter model is designed to understand and interact with graphical user interfaces, building on Alibaba's open-source Qwen2.5-VL.

Oct 30, 2025

Vision-Language

Datalab To/Vision-Language

Datalab Releases Chandra, a New OCR Vision Model

The new vision-language model from Datalab is fine-tuned from Qwen2-VL to specialize in extracting text and structure from complex documents.

Oct 21, 2025

Vision-Language

DeepSeek/Vision-LanguageMajor release

DeepSeek-OCR Tackles Document Parsing with Vision AI

The new vision-language model uses a novel context compression technique to efficiently extract text and structure from complex documents.

Oct 17, 2025

Vision-Language

Baidu/Vision-Language

Baidu Releases PaddleOCR-VL for Document AI

The new vision-language model is fine-tuned to understand not just text, but the complex structure of tables, charts, and formulas.

Oct 16, 2025

Vision-Language

Qwen · Alibaba/Vision-Language

Alibaba Releases Qwen3-VL, an 8B Open-Source Vision Model

The latest vision-language model from the popular Qwen series is instruction-tuned and available under an Apache 2.0 license.

Oct 11, 2025

Vision-Language

inclusionAI/Any-to-Any

inclusionAI Releases Ming-UniVision MoE Multimodal Model

The new 16-billion-parameter model uses a sparse Mixture-of-Experts design to efficiently handle 'any-to-any' data combinations, from text to images.

Sep 30, 2025

Any-to-Any Vision-Language

Qwen · Alibaba/Vision-LanguageMajor release

Qwen Releases 30B MoE Vision Model, Qwen3-VL

The new open-source model from Alibaba uses a Mixture-of-Experts architecture to make its powerful vision-language capabilities more efficient to run.

Sep 30, 2025

Any-to-Any Vision-Language

Qwen · Alibaba/Any-to-AnyMajor release

Qwen3-Omni Arrives With Any-to-Any Multimodality

The new 30B Mixture-of-Experts model from Alibaba's Qwen team can process and generate content across text, image, and audio formats.

Sep 20, 2025

Speech → Text Any-to-Any

Qwen · Alibaba/Any-to-Any

Qwen Releases 'Thinking' Multimodal MoE Model

The new 30-billion-parameter Mixture-of-Experts model from Alibaba's Qwen team is designed to show its reasoning process for complex multimodal tasks.

Sep 15, 2025

Any-to-Any Reasoning