KuaishouAny-to-Any

Kling Releases UniVideo for Generation and Understanding

The new open-source model combines both video generation and comprehension, a rare dual capability built on the Qwen2.5 vision-language foundation.

Oct 18, 2025

NotableApache 2.0

The Kling team has released UniVideo, a new open-source model designed to both generate and understand video content. Unlike many models that focus solely on text-to-video synthesis, UniVideo operates as a unified system, capable of interpreting the contents of a video as well as creating new ones from text prompts.

At its core, UniVideo is built upon Qwen2.5-VL-7B, a powerful large vision-language model. This foundation provides a strong base for processing and relating visual and textual information, allowing a single model architecture to handle tasks that often require separate, specialized systems. This unified approach can lead to more efficient and coherent video processing.

Why It Matters

While the open-source community has made significant strides in video generation, models that also possess deep comprehension abilities are less common. UniVideo helps bridge this gap by providing a single, powerful tool for more complex video-related AI tasks. By combining generation with understanding, it enables new possibilities for content analysis, automated description, and creative workflows within a single framework.

The model is released under a permissive Apache 2.0 license, encouraging broad adoption and experimentation. Researchers and developers can access the model and its code on its Hugging Face repository to explore its dual capabilities.

Sources

KlingTeam/UniVideo
Hugging Face
Visit

0 comments

No comments yet. Be the first to weigh in.

Thinking Machines Debuts Inkling Small, a Compact Multimodal MoE

The Apache-2.0 model brings mixture-of-experts efficiency to image, audio, and text tasks in a smaller footprint.

Jul 27, 2026

KRAFTON/Any-to-Any

KRAFTON releases A.X-K2 Raon speech MoE model

The game maker's new open model blends text-to-speech and speech recognition in a single 21B mixture-of-experts system with just 3B active parameters.

Jul 27, 2026

Microsoft/Vision-Language

Microsoft's Mage-VL Streams Video Natively

A codec-native multimodal foundation model aims to understand live video and vision-language input in real time.

Jul 26, 2026

Why It Matters