Baidu Releases PaddleOCR-VL for Document AI
The new vision-language model is fine-tuned to understand not just text, but the complex structure of tables, charts, and formulas.

Baidu has released PaddleOCR-VL, a new open-source vision-language model specialized for complex document understanding. The model aims to go beyond simple text recognition by interpreting the structural elements within a page, a common challenge in automated data processing.
Built on the company's ERNIE 4.5 architecture, PaddleOCR-VL is designed to handle challenging optical character recognition (OCR) tasks that often trip up traditional systems. Its capabilities extend to parsing the intricate details of documents, including page layouts, tables, mathematical formulas, and charts.
This VLM-based approach allows the model to leverage contextual understanding, treating a document as a cohesive whole rather than a simple sequence of characters. By understanding relationships between text and visual elements, it can more accurately extract structured data from unstructured sources like scanned reports or academic papers.
The release of PaddleOCR-VL provides developers with a powerful new tool for document intelligence and automation pipelines. It reflects a growing trend of applying large multimodal models to solve specific, high-value problems in data extraction and analysis. The model is available on Hugging Face under an Apache 2.0 license.
Sources
- Visit
PaddlePaddle/PaddleOCR-VL
Hugging Face
0 comments
No comments yet. Be the first to weigh in.
More in Vision-Language
Moonshot AI Releases Kimi, a Multimodal Coding Model
The new Mixture-of-Experts model from the Chinese AI company can generate code while also understanding visual inputs, a rare combination in open models.
Google Releases Open-Source DiffusionGemma 26B Model
The new 26B parameter model from DeepMind uses a diffusion-based architecture, a technique more common in image generation, to produce text.

MiniMax Releases M3, a Multimodal MoE Model
The new open-weight model from MiniMax AI combines vision, coding, and reasoning using a Mixture-of-Experts architecture.