OpenMOSS Releases MOVA for Joint Video and Audio Gen
The new model generates 360p video from text or images and creates corresponding audio tracks simultaneously, a notable step for integrated audiovisual synthesis.

The OpenMOSS team has introduced MOVA-360p, a new generative model that can create short video clips from either a text description or a starting image. While many recent models focus on visual generation, MOVA stands out by tackling both video and audio in a single process.
MOVA's key feature is its ability to perform joint audio-video generation. Instead of producing a silent video that requires a separate soundtrack, the model synthesizes an accompanying audio track that is thematically consistent with the visual content. This integrated approach aims to create more immersive and complete generative media.
The model architecture is built upon established open-source components, using a Stable Diffusion 1.5 foundation for the visual elements and an audio generation model named Tango for the sound. It outputs clips at a resolution of 360p, positioning it as a tool for research and experimentation in multimodal generation.
Researchers and developers can explore the model, which is available on Hugging Face. The release is provided under a custom license, so users should review the terms to ensure compliance for their specific use cases. You can find the model card and download the weights at the official repository.
Sources
- Visit
OpenMOSS-Team/MOVA-360p
Hugging Face
0 comments
No comments yet. Be the first to weigh in.
More in Image → Video

Zhipu AI Releases SCAIL-2 for Character Animation
The new open-source diffusion model from the company's research arm generates video clips from a single character image and a sequence of poses.

NVIDIA Releases Cosmos3 Image-to-Video World Model
The latest release in NVIDIA's 'world model' research family aims to generate coherent and realistic video from a single static image.
NVIDIA Releases SANA, a Camera-Controllable Video Model
The new model, SANA-WM, uses a bidirectional diffusion process to give creators fine-grained control over camera movement and video editing.