rednote-hilabVision-Language

New VLM `dots.ocr` Takes on Complex Documents

The new 3B-parameter model from rednote-hilab uses a vision-language approach to parse tables, layouts, and even mathematical formulas.

Jul 30, 2025

NotableOther

Researchers at rednote-hilab have released dots.ocr, a new open-source model designed for sophisticated document understanding. At 3 billion parameters, this vision-language model (VLM) moves beyond simple text extraction to interpret the complex structure of a page.

Built upon Microsoft's powerful Florence-2 vision foundation model, dots.ocr applies a multi-modal approach to Optical Character Recognition (OCR). Instead of merely identifying characters in sequence, it comprehends the spatial relationships between elements, allowing it to make sense of a document's overall layout.

Advanced Document Parsing

The model's capabilities make it particularly well-suited for digitizing challenging content. It excels at:

Layout Analysis: Identifying columns, headers, and figures.
Table Extraction: Accurately parsing rows and columns from structured tables.
Formula Recognition: Transcribing complex mathematical and scientific notation, a common failure point for traditional OCR systems.

The release of dots.ocr provides a strong, openly-licensed alternative for developers building document intelligence applications. By handling nuanced formats that often require manual intervention, it opens new possibilities for automating data extraction from scientific papers, financial reports, and technical manuals. The model and usage examples are available on its Hugging Face repository.

Sources

rednote-hilab/dots.ocr
Hugging Face
Visit

0 comments

No comments yet. Be the first to weigh in.

Thinking Machines Debuts Inkling Small, a Compact Multimodal MoE

The Apache-2.0 model brings mixture-of-experts efficiency to image, audio, and text tasks in a smaller footprint.

Jul 27, 2026

Microsoft/Vision-Language

Microsoft's Mage-VL Streams Video Natively

A codec-native multimodal foundation model aims to understand live video and vision-language input in real time.

Jul 26, 2026

Swiss Ai/Text / LLM

Apertus v1.5 70B arrives with an Apache-2.0 license

Switzerland's open-model effort ships a 70-billion-parameter, multilingual and multimodal system that anyone can use, modify, and deploy.

Jul 24, 2026

Advanced Document Parsing

The model's capabilities make it particularly well-suited for digitizing challenging content. It excels at:

Layout Analysis: Identifying columns, headers, and figures.

Table Extraction: Accurately parsing rows and columns from structured tables.

Formula Recognition: Transcribing complex mathematical and scientific notation, a common failure point for traditional OCR systems.