Baidu Releases NAVA for Text-to-Video with Audio
The new model from the Chinese tech giant uses a Multimodal Diffusion Transformer to generate synchronized audio and video from text or image prompts.

Baidu has released the weights for NAVA, a new generative model capable of producing video complete with synchronized audio from a variety of inputs. NAVA, which stands for Native Audio-Video Animation, can take either a text prompt or a combination of text and an image to generate short video clips. The model and examples are available on its Hugging Face repository.
Under the hood, NAVA employs a sophisticated architecture known as a Multimodal Diffusion Transformer (MMDiT). This design allows the model to process and integrate different data types—like text and image features—within the same transformer blocks, creating a more cohesive understanding of the prompt. The model is built upon Baidu's own Wan2.2 video foundation model, extending its capabilities into multimodal generation.
A More Efficient Method
Instead of traditional diffusion methods, NAVA is trained using a flow-matching technique. This is a more recent approach to training generative models that can lead to more efficient training and faster inference times, as it learns the direct path from noise to a final, coherent output. This choice of technique points to a growing trend toward more computationally efficient generative architectures.
The release of NAVA adds another significant open-weights model to the competitive text-to-video landscape. Its ability to generate audio natively alongside video is a key differentiator, as audio is often a separate, post-processing step for other models. While the model is publicly available, it uses a custom license, so developers and researchers should review the terms before incorporating it into their work.
Sources
- Visit
baidu/NAVA
Hugging Face
0 comments
No comments yet. Be the first to weigh in.
More in Text → Video

JD.com Enters Open-Source AI Video with JoyAI-Echo
The Chinese e-commerce giant has released a new model capable of generating long-form, multi-shot videos with synchronized audio from text prompts.
NVIDIA Releases SANA, a Camera-Controllable Video Model
The new model, SANA-WM, uses a bidirectional diffusion process to give creators fine-grained control over camera movement and video editing.

ByteDance Releases Lance, a Unified Generative AI Model
The 3-billion-parameter model handles image and video generation, editing, and understanding from a single set of weights under a permissive license.