New VLM `dots.ocr` Takes on Complex Documents
The new 3B-parameter model from rednote-hilab uses a vision-language approach to parse tables, layouts, and even mathematical formulas.

Researchers at rednote-hilab have released dots.ocr, a new open-source model designed for sophisticated document understanding. At 3 billion parameters, this vision-language model (VLM) moves beyond simple text extraction to interpret the complex structure of a page.
Built upon Microsoft's powerful Florence-2 vision foundation model, dots.ocr applies a multi-modal approach to Optical Character Recognition (OCR). Instead of merely identifying characters in sequence, it comprehends the spatial relationships between elements, allowing it to make sense of a document's overall layout.
Advanced Document Parsing
The model's capabilities make it particularly well-suited for digitizing challenging content. It excels at:
- Layout Analysis: Identifying columns, headers, and figures.
- Table Extraction: Accurately parsing rows and columns from structured tables.
- Formula Recognition: Transcribing complex mathematical and scientific notation, a common failure point for traditional OCR systems.
The release of dots.ocr provides a strong, openly-licensed alternative for developers building document intelligence applications. By handling nuanced formats that often require manual intervention, it opens new possibilities for automating data extraction from scientific papers, financial reports, and technical manuals. The model and usage examples are available on its Hugging Face repository.
Sources
- Visit
rednote-hilab/dots.ocr
Hugging Face
0 comments
No comments yet. Be the first to weigh in.
More in Vision-Language
Moonshot AI Releases Kimi, a Multimodal Coding Model
The new Mixture-of-Experts model from the Chinese AI company can generate code while also understanding visual inputs, a rare combination in open models.
Google Releases Open-Source DiffusionGemma 26B Model
The new 26B parameter model from DeepMind uses a diffusion-based architecture, a technique more common in image generation, to produce text.

MiniMax Releases M3, a Multimodal MoE Model
The new open-weight model from MiniMax AI combines vision, coding, and reasoning using a Mixture-of-Experts architecture.