BaiduVision-Language

Baidu Releases PaddleOCR-VL for Document AI

The new vision-language model is fine-tuned to understand not just text, but the complex structure of tables, charts, and formulas.

Oct 16, 2025

NotableOther

Baidu has released PaddleOCR-VL, a new open-source vision-language model specialized for complex document understanding. The model aims to go beyond simple text recognition by interpreting the structural elements within a page, a common challenge in automated data processing.

Built on the company's ERNIE 4.5 architecture, PaddleOCR-VL is designed to handle challenging optical character recognition (OCR) tasks that often trip up traditional systems. Its capabilities extend to parsing the intricate details of documents, including page layouts, tables, mathematical formulas, and charts.

This VLM-based approach allows the model to leverage contextual understanding, treating a document as a cohesive whole rather than a simple sequence of characters. By understanding relationships between text and visual elements, it can more accurately extract structured data from unstructured sources like scanned reports or academic papers.

The release of PaddleOCR-VL provides developers with a powerful new tool for document intelligence and automation pipelines. It reflects a growing trend of applying large multimodal models to solve specific, high-value problems in data extraction and analysis. The model is available on Hugging Face under an Apache 2.0 license.

Sources

PaddlePaddle/PaddleOCR-VL
Hugging Face
Visit

0 comments

No comments yet. Be the first to weigh in.

Thinking Machines Debuts Inkling Small, a Compact Multimodal MoE

The Apache-2.0 model brings mixture-of-experts efficiency to image, audio, and text tasks in a smaller footprint.

Jul 27, 2026

Microsoft/Vision-Language

Microsoft's Mage-VL Streams Video Natively

A codec-native multimodal foundation model aims to understand live video and vision-language input in real time.

Jul 26, 2026

Swiss Ai/Text / LLM

Apertus v1.5 70B arrives with an Apache-2.0 license

Switzerland's open-model effort ships a 70-billion-parameter, multilingual and multimodal system that anyone can use, modify, and deploy.

Jul 24, 2026