NVIDIA Releases Cosmos3 Image-to-Video World Model
The latest release in NVIDIA's 'world model' research family aims to generate coherent and realistic video from a single static image.
Company
Releases
The latest release in NVIDIA's 'world model' research family aims to generate coherent and realistic video from a single static image.
The new model, SANA-WM, uses a bidirectional diffusion process to give creators fine-grained control over camera movement and video editing.
The 600-million-parameter model uses a FastConformer architecture for real-time, multilingual speech-to-text applications.
The new component is a specialized VAE decoder that works with Stability AI's Z-Image model to enhance super-resolution tasks.
The new 30-billion parameter Mixture-of-Experts model handles text and images while using only 3 billion active parameters for inference.
The new 30-billion-parameter Mixture-of-Experts model handles any combination of modalities with just 3 billion active parameters.
The new 3-billion-parameter model, based on the company's Eagle architecture, is designed for high-precision visual grounding tasks.
The 600-million-parameter Nemotron model is designed for real-time English transcription using a cache-aware FastConformer architecture.
The new diffusion model generates short video clips from text and image prompts, adding another major player to the open video space.
The new ERNIE 4.5 VL model brings advanced multimodal reasoning to the open-source community with an efficient Mixture-of-Experts architecture.
The new Sortformer-based model is designed for streaming audio, identifying up to four distinct speakers in real time.
The 600-million-parameter model offers real-time speech-to-text with speaker diarization, built on the efficient FastConformer architecture.
The new 1-billion-parameter model handles both transcription and translation across five languages using the company's efficient FastConformer architecture.
The new FastConformer model uses a specialized training technique to improve transcription accuracy in noisy, real-world environments.
The 2.5 billion-parameter speech model combines a FastConformer encoder with a Qwen LLM decoder, a hybrid approach to transcription.