JD.com Enters Open-Source AI Video with JoyAI-Echo
The Chinese e-commerce giant has released a new model capable of generating long-form, multi-shot videos with synchronized audio from text prompts.

Chinese e-commerce company JD.com has released JoyAI-Echo, a new open-source model for generating video from text. The release marks the company's entry into the competitive field of open-source generative video, adding another major corporate player to the ecosystem.
Unlike many models that produce short, single clips, JoyAI-Echo is designed for creating "long-form, multi-shot" videos. This allows it to generate a sequence of related scenes that can form a more coherent narrative. Crucially, the model also generates synchronized audio to accompany the video, a feature still emerging in many open video tools.
The model is based on the LTX-Video research framework, which focuses on generating temporally consistent and longer video sequences. The full model, code, and weights are available on its Hugging Face repository under an Apache 2.0 license, which permits commercial use.
JoyAI-Echo’s release highlights the growing trend of moving beyond simple clip generation toward more practical, narrative-driven video creation. Its focus on multi-shot storytelling and integrated audio pushes the capabilities of what's available in open-source AI, offering a new tool for creators and researchers exploring long-form generative content.
Sources
- Visit
jdopensource/JoyAI-Echo
Hugging Face
0 comments
No comments yet. Be the first to weigh in.
More in Text → Video

Baidu Releases NAVA for Text-to-Video with Audio
The new model from the Chinese tech giant uses a Multimodal Diffusion Transformer to generate synchronized audio and video from text or image prompts.
NVIDIA Releases SANA, a Camera-Controllable Video Model
The new model, SANA-WM, uses a bidirectional diffusion process to give creators fine-grained control over camera movement and video editing.

ByteDance Releases Lance, a Unified Generative AI Model
The 3-billion-parameter model handles image and video generation, editing, and understanding from a single set of weights under a permissive license.