Zhipu AIVision-Language

Zhipu AI Releases Multilingual GLM-OCR Vision Model

The new vision-language model from the creators of the GLM series is specialized for recognizing and extracting text from images across multiple languages.

Jan 30, 2026

NotableOther

Zhipu AI, the company behind the prominent GLM series of large language models, has released a new open-source model focused on a classic computer vision task: optical character recognition (OCR). The new model, called GLM-OCR, is a vision-language model (VLM) designed specifically to identify and extract text embedded in images.

The key feature of GLM-OCR is its multilingual capability. According to the project's official release page, the model is trained to handle text in Chinese, English, Korean, and Japanese, making it a potentially valuable tool for applications that need to process documents and images from across East Asia and the English-speaking world. You can find the model and usage instructions on its Hugging Face repository.

Why it matters

High-quality OCR is a foundational technology for digitizing documents, parsing user interfaces, and powering accessibility tools. While powerful OCR services are available through proprietary APIs, strong open-source alternatives empower developers to build applications with more privacy and control. GLM-OCR provides a new, specialized tool for this purpose, particularly for developers working with multilingual content.

While Zhipu AI has released the model weights, potential users should note the license. The model is available under a custom license that places limitations on its use for online services, a key distinction from more permissive licenses like Apache 2.0. This restricts its use in certain commercial applications, so developers should review the terms carefully before integrating it into their projects.

Sources

zai-org/GLM-OCR
Hugging Face
Visit

0 comments

No comments yet. Be the first to weigh in.

Thinking Machines Debuts Inkling Small, a Compact Multimodal MoE

The Apache-2.0 model brings mixture-of-experts efficiency to image, audio, and text tasks in a smaller footprint.

Jul 27, 2026

Microsoft/Vision-Language

Microsoft's Mage-VL Streams Video Natively

A codec-native multimodal foundation model aims to understand live video and vision-language input in real time.

Jul 26, 2026

Swiss Ai/Text / LLM

Apertus v1.5 70B arrives with an Apache-2.0 license

Switzerland's open-model effort ships a 70-billion-parameter, multilingual and multimodal system that anyone can use, modify, and deploy.

Jul 24, 2026

Why it matters