Baidu Releases NAVA for Text-to-Video with Audio
The new model from the Chinese tech giant uses a Multimodal Diffusion Transformer to generate synchronized audio and video from text or image prompts.
Company
Releases
The new model from the Chinese tech giant uses a Multimodal Diffusion Transformer to generate synchronized audio and video from text or image prompts.
The large diffusion model from the Chinese tech giant is available under the commercially permissive Apache 2.0 license, a notable release for the community.
The new vision-language model from the Chinese tech giant is designed for complex, multilingual optical character recognition and layout analysis.
The new PaddleOCR-VL model is built to parse not just text, but also the tables, formulas, and page layouts found in complex documents.
The new 14-billion-parameter model uses audio input to generate realistic talking head videos from a single still image.
The new vision-language model is fine-tuned to understand not just text, but the complex structure of tables, charts, and formulas.