Zhipu AI Releases Multilingual GLM-OCR Vision Model
The new vision-language model from the creators of the GLM series is specialized for recognizing and extracting text from images across multiple languages.

Zhipu AI, the company behind the prominent GLM series of large language models, has released a new open-source model focused on a classic computer vision task: optical character recognition (OCR). The new model, called GLM-OCR, is a vision-language model (VLM) designed specifically to identify and extract text embedded in images.
The key feature of GLM-OCR is its multilingual capability. According to the project's official release page, the model is trained to handle text in Chinese, English, Korean, and Japanese, making it a potentially valuable tool for applications that need to process documents and images from across East Asia and the English-speaking world. You can find the model and usage instructions on its Hugging Face repository.
Why it matters
High-quality OCR is a foundational technology for digitizing documents, parsing user interfaces, and powering accessibility tools. While powerful OCR services are available through proprietary APIs, strong open-source alternatives empower developers to build applications with more privacy and control. GLM-OCR provides a new, specialized tool for this purpose, particularly for developers working with multilingual content.
While Zhipu AI has released the model weights, potential users should note the license. The model is available under a custom license that places limitations on its use for online services, a key distinction from more permissive licenses like Apache 2.0. This restricts its use in certain commercial applications, so developers should review the terms carefully before integrating it into their projects.
Sources
- Visit
zai-org/GLM-OCR
Hugging Face
0 comments
No comments yet. Be the first to weigh in.
More in Vision-Language
Moonshot AI Releases Kimi, a Multimodal Coding Model
The new Mixture-of-Experts model from the Chinese AI company can generate code while also understanding visual inputs, a rare combination in open models.
Google Releases Open-Source DiffusionGemma 26B Model
The new 26B parameter model from DeepMind uses a diffusion-based architecture, a technique more common in image generation, to produce text.

MiniMax Releases M3, a Multimodal MoE Model
The new open-weight model from MiniMax AI combines vision, coding, and reasoning using a Mixture-of-Experts architecture.