Kling Releases UniVideo for Generation and Understanding
The new open-source model combines both video generation and comprehension, a rare dual capability built on the Qwen2.5 vision-language foundation.
The Kling team has released UniVideo, a new open-source model designed to both generate and understand video content. Unlike many models that focus solely on text-to-video synthesis, UniVideo operates as a unified system, capable of interpreting the contents of a video as well as creating new ones from text prompts.
At its core, UniVideo is built upon Qwen2.5-VL-7B, a powerful large vision-language model. This foundation provides a strong base for processing and relating visual and textual information, allowing a single model architecture to handle tasks that often require separate, specialized systems. This unified approach can lead to more efficient and coherent video processing.
Why It Matters
While the open-source community has made significant strides in video generation, models that also possess deep comprehension abilities are less common. UniVideo helps bridge this gap by providing a single, powerful tool for more complex video-related AI tasks. By combining generation with understanding, it enables new possibilities for content analysis, automated description, and creative workflows within a single framework.
The model is released under a permissive Apache 2.0 license, encouraging broad adoption and experimentation. Researchers and developers can access the model and its code on its Hugging Face repository to explore its dual capabilities.
Sources
- Visit
KlingTeam/UniVideo
Hugging Face
0 comments
No comments yet. Be the first to weigh in.
More in Any-to-Any

MiniMax Releases M3, a Multimodal MoE Model
The new open-weight model from MiniMax AI combines vision, coding, and reasoning using a Mixture-of-Experts architecture.
Google Releases Gemma 4 12B Multimodal Model
The new 12-billion-parameter open model from DeepMind introduces a unified 'any-to-any' architecture for advanced multimodal tasks.
Google Releases Gemma 4, a 12B 'Any-to-Any' Model
The new 12-billion-parameter model from Google DeepMind is designed to handle a flexible mix of data types, moving beyond traditional text and image inputs.