Tencent Releases 2B Vision Model for Robotics
The new HY-Embodied 0.5 is a vision-language model designed specifically for multi-object tracking in dynamic, real-world environments.
Tencent's Hunyuan team has released HY-Embodied 0.5, a new 2-billion-parameter vision-language model aimed at the growing field of embodied AI.
Unlike many general-purpose VLMs that focus on static image captioning, HY-Embodied is built on an end-to-end Multi-object Tracking (MoT) architecture. This allows the model to perceive and follow multiple distinct objects through video sequences—a critical capability for robots and other autonomous agents that need to understand dynamic scenes.
A Foundation for Physical Agents
The model's specialized design bridges the gap between passive visual understanding and the active interaction required in robotics. By providing a unified system for tracking objects over time, HY-Embodied could enable more sophisticated behaviors in applications like:
- Robotic navigation and manipulation
- Autonomous vehicle systems
- Advanced video analysis
The release signals a move towards creating foundational models for specific, complex domains beyond simple text and image generation. The HY-Embodied 0.5 model is available on Hugging Face under a custom license agreement.
Sources
- Visit
tencent/HY-Embodied-0.5
Hugging Face
0 comments
No comments yet. Be the first to weigh in.
More in Vision-Language
Moonshot AI Releases Kimi, a Multimodal Coding Model
The new Mixture-of-Experts model from the Chinese AI company can generate code while also understanding visual inputs, a rare combination in open models.
Google Releases Open-Source DiffusionGemma 26B Model
The new 26B parameter model from DeepMind uses a diffusion-based architecture, a technique more common in image generation, to produce text.

MiniMax Releases M3, a Multimodal MoE Model
The new open-weight model from MiniMax AI combines vision, coding, and reasoning using a Mixture-of-Experts architecture.