BaiduText → Video

Baidu Releases NAVA for Text-to-Video with Audio

The new model from the Chinese tech giant uses a Multimodal Diffusion Transformer to generate synchronized audio and video from text or image prompts.

May 29, 2026

NotableOther

Baidu has released the weights for NAVA, a new generative model capable of producing video complete with synchronized audio from a variety of inputs. NAVA, which stands for Native Audio-Video Animation, can take either a text prompt or a combination of text and an image to generate short video clips. The model and examples are available on its Hugging Face repository.

Under the hood, NAVA employs a sophisticated architecture known as a Multimodal Diffusion Transformer (MMDiT). This design allows the model to process and integrate different data types—like text and image features—within the same transformer blocks, creating a more cohesive understanding of the prompt. The model is built upon Baidu's own Wan2.2 video foundation model, extending its capabilities into multimodal generation.

A More Efficient Method

Instead of traditional diffusion methods, NAVA is trained using a flow-matching technique. This is a more recent approach to training generative models that can lead to more efficient training and faster inference times, as it learns the direct path from noise to a final, coherent output. This choice of technique points to a growing trend toward more computationally efficient generative architectures.

The release of NAVA adds another significant open-weights model to the competitive text-to-video landscape. Its ability to generate audio natively alongside video is a key differentiator, as audio is often a separate, post-processing step for other models. While the model is publicly available, it uses a custom license, so developers and researchers should review the terms before incorporating it into their work.

Sources

baidu/NAVA
Hugging Face
Visit

0 comments

No comments yet. Be the first to weigh in.

MiniMax Releases H3 Video Model on Hugging Face

The company's new diffusion model handles text-to-video and image-to-video, with support for joint audio-video generation.

Jul 28, 2026

robbyant/Text → Video

LingBot-Video puts a 30B MoE behind embodied AI video

A DiT-based mixture-of-experts model activates just 3B parameters per step and ships under an Apache 2.0 license.

Jul 8, 2026

NVIDIA/Text → Video

NVIDIA's Cosmos 3 Edge Brings World Models Closer

A new edge-optimized variant of NVIDIA's Cosmos world-model line aims to run generative video where the compute lives.

Jul 1, 2026

A More Efficient Method