The Open Weights
LatestModelsLeaderboardsUpcomingCompanies
Subscribe
The Open Weights

The daily record of open-source AI. New model releases, leaderboards, and what's coming next — written for people who ship.

Refreshed every 12 hours

Discover

  • Latest releases
  • New today
  • Trending models
  • Upcoming launches

Browse

  • All models
  • Companies
  • Categories
  • Leaderboards

About

  • About
  • Editorial policy
  • RSS feed
  • Newsletter

© 2026 The Open Weights. An independent publication.

Aggregated by Claude · written with Gemini · curated by humans.

The feed

Latest releases

Every new open-source model release and major update — aggregated from across the ecosystem, deduplicated, and refreshed every 12 hours.

164 releases

Zhipu AI/Text / LLMMajor release

Zhipu AI Releases MIT-Licensed GLM-5.2 MoE Model

The new bilingual model from the Chinese AI firm uses a Mixture of Experts architecture and sparse attention under a fully permissive license.

Jun 17, 2026
Text / LLMReasoning
GLM-5.2
GLM-5.2
Weibo AI/Reasoning

Weibo AI Releases VibeThinker-3B, a Compact Reasoning Model

The new 3-billion-parameter model from the Chinese tech giant focuses on challenging benchmarks in mathematics, coding, and graduate-level questions.

Jun 12, 2026
ReasoningText / LLM
VibeThinker-3B
VibeThinker-3B
Moonshot AI/CodeMajor release

Moonshot AI Releases Kimi, a Multimodal Coding Model

The new Mixture-of-Experts model from the Chinese AI company can generate code while also understanding visual inputs, a rare combination in open models.

Jun 11, 2026
CodeVision-Language
Kimi-K2.7-Code
Kimi-K2.7-Code
Zyphra/Text → Speech

Zyphra Releases Open-Source Zonos 2 TTS Model

The new text-to-speech model offers a commercially permissive alternative for developers in a field still dominated by closed-source APIs.

Jun 11, 2026
Text → Speech
Zonos 2
Zonos 2
Google DeepMind/Text / LLM

Google Releases Open-Source DiffusionGemma 26B Model

The new 26B parameter model from DeepMind uses a diffusion-based architecture, a technique more common in image generation, to produce text.

Jun 9, 2026
Text / LLMVision-Language
DiffusionGemma 26B-A4B Instruct
DiffusionGemma 26B-A4B Instruct
Zhipu AI/Image → Video

Zhipu AI Releases SCAIL-2 for Character Animation

The new open-source diffusion model from the company's research arm generates video clips from a single character image and a sequence of poses.

Jun 9, 2026
Image → Video
SCAIL-2
SCAIL-2
Cohere/Code

Cohere Releases North-Mini-Code, an Open MoE Model

The new Apache 2.0-licensed model is designed for code generation and agentic chat applications, using a Mixture-of-Experts architecture for efficiency.

Jun 5, 2026
CodeText / LLM
North-Mini-Code 1.0
North-Mini-Code 1.0
Boson AI/Text → Speech

Boson AI's Higgs Audio v3 Offers Expressive, Multilingual TTS

The new 4-billion-parameter text-to-speech model is available for non-commercial use, promising fine-grained control over vocal delivery.

Jun 4, 2026
Text → Speech
Higgs Audio v3 TTS 4B
Higgs Audio v3 TTS 4B
MiniMax/Vision-LanguageMajor release

MiniMax Releases M3, a Multimodal MoE Model

The new open-weight model from MiniMax AI combines vision, coding, and reasoning using a Mixture-of-Experts architecture.

Jun 2, 2026
Vision-LanguageAny-to-Any
MiniMax-M3
MiniMax-M3
Ideogram/Text → Image

Ideogram 4.0: A 9.3B Open-Weight Text-to-Image Model

The new 9.3 billion parameter model uses a Diffusion Transformer architecture and excels at rendering coherent text within generated images.

May 30, 2026
Text → Image
Ideogram 4.0
Ideogram 4.0
Google DeepMind/Any-to-AnyMajor release

Google Releases Gemma 4, a 12B 'Any-to-Any' Model

The new 12-billion-parameter model from Google DeepMind is designed to handle a flexible mix of data types, moving beyond traditional text and image inputs.

May 23, 2026
Any-to-AnyVision-Language
Gemma 4 12B
Gemma 4 12B
NVIDIA/Image → Video

NVIDIA Releases SANA, a Camera-Controllable Video Model

The new model, SANA-WM, uses a bidirectional diffusion process to give creators fine-grained control over camera movement and video editing.

May 18, 2026
Image → VideoText → Video
SANA-WM Bidirectional
SANA-WM Bidirectional
NVIDIA/Speech → Text

NVIDIA Releases Nemotron-3.5 Streaming ASR Model

The 600-million-parameter model uses a FastConformer architecture for real-time, multilingual speech-to-text applications.

May 15, 2026
Speech → Text
Nemotron 3.5 ASR Streaming 0.6B
Nemotron 3.5 ASR Streaming 0.6B
ByteDance/Any-to-AnyMajor release

ByteDance Releases Lance, a Unified Generative AI Model

The 3-billion-parameter model handles image and video generation, editing, and understanding from a single set of weights under a permissive license.

May 15, 2026
Any-to-AnyText → Image
Lance
Lance
SenseTime/Any-to-Any

SenseTime Releases 8B 'Any-to-Any' Infographic Model

The new 8B-parameter SenseNova U1 model from SenseTime is designed for complex multimodal tasks, including the in-conversation generation and editing of infographics.

May 14, 2026
Any-to-AnyText → Image
SenseNova U1 8B MoT Infographic
SenseNova U1 8B MoT Infographic
Lightricks/Image → Video

Lightricks Releases LoRA for AI Lip-Dubbing

The new 'Identity-Control' adapter fine-tunes the company's LTX-2.3 video model to create realistic lip-syncing for dubbing workflows.

May 11, 2026
Image → VideoText → Video
LTX-2.3
LTX-2.3
Tencent/Text / LLM

Tencent Releases 1.8B Model for Multilingual Translation

The 1.8 billion-parameter model from the Chinese tech giant is designed for high-quality translation across a wide range of language pairs.

May 11, 2026
Text / LLM
Hunyuan-MT2 1.8B
Hunyuan-MT2 1.8B
Supertone/Text → Speech

Supertone Releases On-Device Multilingual TTS Model

The new Supertonic 3 model supports seven languages and is optimized for local inference with the portable ONNX format.

May 6, 2026
Text → Speech
Supertonic 3
Supertonic 3
NVIDIA/Image Editing

NVIDIA Releases PiD for High-Quality Image Upscaling

The new component is a specialized VAE decoder that works with Stability AI's Z-Image model to enhance super-resolution tasks.

Apr 28, 2026
Image Editing
NVIDIA PiD (Pixel Diffusion Decoder)
NVIDIA PiD (Pixel Diffusion Decoder)
NVIDIA/Any-to-Any

NVIDIA Releases Efficient Nemotron-3 Multimodal MoE

The new 30-billion parameter Mixture-of-Experts model handles text and images while using only 3 billion active parameters for inference.

Apr 24, 2026
Any-to-AnyReasoning
Nemotron-3 Nano Omni 30B-A3B Reasoning
Nemotron-3 Nano Omni 30B-A3B Reasoning
Google DeepMind/Any-to-Any

Google Releases Gemma 4 Multimodal Open Model

The new 26-billion-parameter model from DeepMind uses a mixture-of-experts design for greater efficiency and is tuned for assistant-style tasks.

Apr 23, 2026
Any-to-AnyText / LLM
Gemma 4 26B-A4B Instruct (MoE)
Gemma 4 26B-A4B Instruct (MoE)
Google DeepMind/Any-to-AnyMajor release

Google Releases Multimodal Gemma 4 31B Model

The new 31-billion-parameter model is an instruction-tuned, 'any-to-any' powerhouse released under a permissive Apache 2.0 license.

Apr 23, 2026
Any-to-AnyText / LLM
Gemma 4 12B
Gemma 4 12B
Google DeepMind/Any-to-Any

Google Releases 4B Multimodal Gemma 4 Assistant

The new 4-billion-parameter model is instruction-tuned for 'any-to-any' tasks, handling a flexible mix of data types.

Apr 23, 2026
Any-to-AnyText / LLM
Gemma 4 E4B-it Assistant
Gemma 4 E4B-it Assistant
Google DeepMind/Any-to-Any

Google Releases 2B Multimodal Gemma 4 Assistant Model

The new compact model from DeepMind is instruction-tuned for "any-to-any" tasks, capable of processing and generating mixed data types.

Apr 23, 2026
Any-to-AnyText / LLM
Gemma 4 E2B-it Assistant
Gemma 4 E2B-it Assistant
Xiaomi/Speech → Text

Xiaomi Releases MiMo Model for Speech Recognition

The new open-source model from the Chinese tech giant offers automatic speech recognition for Mandarin, Cantonese, and English under a permissive MIT license.

Apr 23, 2026
Speech → Text
MiMo-V2.5-ASR
MiMo-V2.5-ASR
inclusionAI/Any-to-Any

LLaDA2.0-Uni: A Unified MoE for Vision Tasks

The new open-source model from inclusionAI uses a Mixture-of-Experts architecture to handle multiple vision tasks in a single, diffusion-based system.

Apr 22, 2026
Any-to-AnyText → Image
LLaDA2.0-Uni
LLaDA2.0-Uni
DeepSeek/Text / LLMMajor release

DeepSeek Releases V4-Pro, an Open MoE Contender

The new flagship model combines a Mixture-of-Experts architecture with a permissive MIT license, positioning it for wide commercial adoption.

Apr 22, 2026
Text / LLMReasoning
DeepSeek-V4-Pro
DeepSeek-V4-Pro
DeepSeek/Text / LLMMajor release

DeepSeek Releases V4-Flash, a Fast MIT-Licensed MoE Model

The new Mixture of Experts model from the Beijing-based AI lab is optimized for fast, efficient conversational AI and carries a fully permissive license.

Apr 22, 2026
Text / LLMReasoning
DeepSeek-V4-Flash
DeepSeek-V4-Flash
SenseTime/Any-to-Any

SenseTime Releases 8B Any-to-Any Multimodal Model

The new SenseNova-U1 model unifies image understanding, generation, and editing within a single 8-billion-parameter framework.

Apr 22, 2026
Any-to-AnyText → Image
SenseNova-U1-8B-MoT
SenseNova-U1-8B-MoT
Qwen · Alibaba/Vision-Language

Alibaba's Qwen Releases Open 27B Vision Model

The new dense model, licensed under Apache 2.0, brings both text and image understanding to the midrange parameter space.

Apr 21, 2026
Vision-LanguageText / LLM
Qwen3.6-27B
Qwen3.6-27B
NVIDIA/Any-to-Any

NVIDIA Releases Nemotron-3-Nano Omni-Modal MoE

The new 30-billion-parameter Mixture-of-Experts model handles any combination of modalities with just 3 billion active parameters.

Apr 20, 2026
Any-to-AnyReasoning
Nemotron-3 Nano Omni 30B-A3B Reasoning
Nemotron-3 Nano Omni 30B-A3B Reasoning
Resemble AI/Text → Speech

Resemble AI Releases Dramabox Voice Cloning TTS Model

The new text-to-speech model uses a diffusion-transformer architecture for high-quality, expressive audio and one-shot voice cloning.

Apr 17, 2026
Text → Speech
Dramabox TTS
Dramabox TTS
IBM/Speech → Text

IBM Releases 2B Granite Model for Multilingual Speech

The new two-billion-parameter model offers transcription capabilities for at least five major languages under a permissive Apache 2.0 license.

Apr 16, 2026
Speech → Text
Granite Speech 4.1 2B
Granite Speech 4.1 2B
Qwen · Alibaba/Vision-LanguageMajor release

Qwen Releases 35B Multimodal Mixture-of-Experts Model

The new Qwen3.6-35B-A3B from Alibaba's Qwen team combines vision and language capabilities using an efficient sparse architecture.

Apr 15, 2026
Vision-LanguageText / LLM
Qwen3.6-27B
Qwen3.6-27B
Motif Technologies/Text → Video

Motif Releases 2B Open-Source Text-to-Video Model

The new Apache 2.0 licensed model uses a diffusion transformer architecture to offer a new open alternative for video generation research.

Apr 14, 2026
Text → VideoImage → Video
Motif-Video-2B
Motif-Video-2B
Moonshot AI/Vision-LanguageMajor release

Moonshot AI Releases Kimi-K2.6 Multimodal Model

The Chinese AI lab has published weights for its new vision-language model, though a restrictive license limits its use to research applications.

Apr 14, 2026
Vision-LanguageText / LLM
Kimi-K2.6
Kimi-K2.6
OpenBMB/Vision-Language

OpenBMB Releases MiniCPM-V for On-Device Vision

The new open-source vision-language model is designed for high-resolution image understanding on mobile and edge devices.

Apr 13, 2026
Vision-Language
MiniCPM-V-4.6
MiniCPM-V-4.6
MiniMax/Text / LLM

MiniMax Releases M2.7, an MoE Model with FP8 Weights

The new conversational language model from the Chinese AI company uses a Mixture-of-Experts architecture and 8-bit weights, but is released under a restrictive custom license.

Apr 9, 2026
Text / LLMReasoning
MiniMax-M2.7
MiniMax-M2.7
Baidu/Vision-Language

Baidu Releases Qianfan-OCR for Document Intelligence

The new vision-language model from the Chinese tech giant is designed for complex, multilingual optical character recognition and layout analysis.

Mar 18, 2026
Vision-Language
Qianfan-OCR
Qianfan-OCR
Google DeepMind/Any-to-AnyMajor release

Google Releases Gemma 4, a 26B Vision-Language Model

The new open-source model from DeepMind uses a Mixture-of-Experts architecture to handle both text and image inputs efficiently.

Mar 11, 2026
Vision-LanguageText / LLM
Gemma 4 12B
Gemma 4 12B
Google DeepMind/Any-to-AnyMajor release

Google Releases Multimodal Gemma 4 31B Model

The new 31-billion-parameter model is instruction-tuned and can process both text and images, marking a significant expansion for the Gemma family.

Mar 11, 2026
Vision-LanguageText / LLM
Gemma 4 12B
Gemma 4 12B
Black Forest Labs/Text → ImageMajor release

Black Forest Labs Releases 9B FLUX.2 klein Image Model

The new open-weight model offers a more compact, distilled version of the advanced FLUX architecture for text-to-image and editing tasks.

Mar 9, 2026
Text → ImageImage Editing
FLUX.2 klein base 4B
FLUX.2 klein base 4B
NVIDIA/Vision-Language

NVIDIA's New 3B VLM Pinpoints Objects in Images

The new 3-billion-parameter model, based on the company's Eagle architecture, is designed for high-precision visual grounding tasks.

Mar 2, 2026
Vision-Language
LocateAnything-3B
LocateAnything-3B
Google DeepMind/Any-to-AnyMajor release

Google's Gemma 4 Debuts with Any-to-Any Multimodality

The new 4-billion parameter model from Google DeepMind is designed for versatile input and output, handling text, images, and other data types.

Mar 2, 2026
Any-to-AnyVision-Language
Gemma 4 E4B
Gemma 4 E4B
Xiaomi/Image Editing

Xiaomi Releases Bilingual Image Editing Model FireRed 1.1

The new open-source model from Xiaomi's FireRedTeam leverages the Qwen-Image-Edit pipeline to offer instruction-based image editing in both English and Chinese.

Mar 2, 2026
Image Editing
FireRed Image Edit 1.1
FireRed Image Edit 1.1
IBM/Speech → Text

IBM Releases 1B Granite Model for Multilingual Speech

The new Apache 2.0-licensed model is part of the company's Granite family and aims to provide high-quality speech-to-text across several languages.

Feb 27, 2026
Speech → Text
Granite 4.0 1B Speech
Granite 4.0 1B Speech
OpenAI/Text → Speech

Hume AI Releases 3B Multilingual Text-to-Speech Model

The new model, Tada-3B-ML, is designed for fine-grained control over vocal expression across more than 10 languages.

Feb 16, 2026
Text → Speech
Tada-3B-ML
Tada-3B-ML
OpenAI/Text → Speech

Kani-TTS-2 Offers New Open-Source Voice Generation

An independent researcher has released a new English text-to-speech model under a permissive license, built on a modern generative foundation.

Feb 12, 2026
Text → Speech
Kani-TTS-2 (English)
Kani-TTS-2 (English)
Zhipu AI/Text / LLMMajor release

Zhipu AI Releases Open-Source GLM-5 MoE Model

The new Mixture-of-Experts model from the Chinese AI company combines an advanced architecture with a fully permissive MIT license for commercial use.

Feb 11, 2026
Text / LLMReasoning
GLM-5
GLM-5
inclusionAI/Any-to-Any

inclusionAI's Ming 2.0 Tackles Any-to-Any Multimodality

The new open-source Mixture-of-Experts model can process and generate content across text, images, and audio in any combination.

Feb 10, 2026
Any-to-Any
Ming-flash-omni 2.0
Ming-flash-omni 2.0
Nanbeige/Text / LLM

Nanbeige Releases 3B Chinese-Enhanced Language Model

The new Llama-based model was trained from scratch on 3.5 trillion tokens of Chinese and English data to enhance its bilingual capabilities.

Feb 10, 2026
Text / LLM
Nanbeige4.1-3B
Nanbeige4.1-3B
OpenMOSS/Text → Speech

MOSS-TTS: A New Multilingual Text-to-Speech Model

The new system from the OpenMOSS Team uses a novel 'delay-pattern' architecture to generate natural-sounding speech in Chinese, English, and Japanese.

Feb 6, 2026
Text → Speech
MOSS-TTS
MOSS-TTS
OpenAI/Music

Soul-AILab Releases Zero-Shot Singing Voice Model

The new model, SoulX-Singer, can replicate a singing voice from a short audio sample and supports both English and Chinese under a permissive license.

Feb 6, 2026
MusicText → Speech
SoulX-Singer
SoulX-Singer
Zhipu AI/Vision-Language

Zhipu AI Releases Multilingual GLM-OCR Vision Model

The new vision-language model from the creators of the GLM series is specialized for recognizing and extracting text from images across multiple languages.

Jan 30, 2026
Vision-Language
GLM-OCR
GLM-OCR
OpenMOSS/Image → Video

OpenMOSS Releases MOVA for Joint Video and Audio Gen

The new model generates 360p video from text or images and creates corresponding audio tracks simultaneously, a notable step for integrated audiovisual synthesis.

Jan 28, 2026
Image → VideoText → Video
MOVA-360p
MOVA-360p
Qwen · Alibaba/Text → Image

Alibaba's Qwen Team Releases Z-Image Diffusion Model

The makers of the popular Qwen language models have published their first open-source text-to-image generator with a permissive Apache 2.0 license.

Jan 23, 2026
Text → Image
Z-Image
Z-Image
OpenMOSS/Text → Speech

LuxTTS Delivers Lightweight, Open-Source Speech Synthesis

The new text-to-speech model is optimized for the ONNX runtime, making it a promising option for efficient, on-device audio generation.

Jan 22, 2026
Text → Speech
LuxTTS
LuxTTS
Microsoft/Speech → Text

Microsoft Releases VibeVoice for Speech Transcription

The new open-source automatic speech recognition model handles multilingual transcription and speaker identification out of the box.

Jan 21, 2026
Speech → Text
VibeVoice ASR
VibeVoice ASR
Qwen · Alibaba/Text → Speech

Qwen Releases Open-Source Voice Cloning Model

The new 600-million-parameter Qwen3-TTS model can generate speech in multiple languages and clone voices from short audio clips.

Jan 21, 2026
Text → Speech
Qwen3-TTS 0.6B Base
Qwen3-TTS 0.6B Base
Qwen · Alibaba/Text → Speech

Qwen Releases a Compact Custom-Voice TTS Model

The new 600-million-parameter model from Alibaba's Qwen team can clone voices from short audio clips for multilingual speech synthesis.

Jan 21, 2026
Text → Speech
Qwen3-TTS-12Hz-0.6B CustomVoice
Qwen3-TTS-12Hz-0.6B CustomVoice
Zhipu AI/Text / LLM

Zhipu AI Releases GLM-4.7-Flash MoE Model

The new Mixture-of-Experts model from the Beijing-based AI company is optimized for speed and released under the permissive MIT license.

Jan 19, 2026
Text / LLM
GLM-4.7-Flash
GLM-4.7-Flash
Black Forest Labs/Text → Image

Black Forest Labs Releases 9B FLUX.2 Image Model

The new text-to-image model emphasizes speed and efficiency with a novel architecture and FP8 quantization.

Jan 14, 2026
Text → ImageImage Editing
FLUX.2 klein base 4B
FLUX.2 klein base 4B
OpenMOSS/Text → Speech

Soprano TTS Model Leverages Qwen3 Architecture

The new 80-million-parameter text-to-speech model adapts a powerful language model architecture for efficient, open-source audio generation.

Jan 14, 2026
Text → Speech
Soprano-1.1-80M
Soprano-1.1-80M
Black Forest Labs/Text → ImageMajor release

Black Forest Labs Releases 9B FLUX.2 Image Model

The new 9-billion-parameter model uses a Diffusion Transformer architecture, promising higher performance than existing open-source alternatives.

Jan 14, 2026
Text → ImageImage Editing
FLUX.2 klein base 4B
FLUX.2 klein base 4B
OpenAI/Text → Speech

Hume AI Releases TADA 1B for Expressive Speech

The new 1-billion-parameter model combines a Llama 3.2 base with text-to-speech to generate more natural and nuanced audio.

Jan 12, 2026
Text → Speech
TADA 1B
TADA 1B
OpenMOSS/Text → Speech

OpenMOSS Releases KugelAudio for European Languages

The new text-to-speech model uses a hybrid diffusion and autoregressive architecture for high-quality, multilingual synthesis.

Jan 11, 2026
Text → Speech
KugelAudio-0-open
KugelAudio-0-open
Zhipu AI/Text → Image

Zhipu AI Releases Open, Bilingual GLM-Image Model

The new text-to-image model is fluent in both Chinese and English, built on the CogView2 architecture and released under a permissive MIT license.

Jan 8, 2026
Text → Image
GLM-Image
GLM-Image
Supertone/Text → Speech

Supertone Open-Sources Supertonic 2 Voice Model

The new text-to-speech model from the audio AI company supports English, Korean, and Spanish and comes in the efficient ONNX format for deployment.

Jan 6, 2026
Text → Speech
Supertonic 2
Supertonic 2
Lightricks/Image → VideoMajor release

Lightricks Releases LTX-2 Multimodal Video Generator

The new diffusion model from the creative app company can generate short video clips from text, images, audio, and even other videos.

Jan 3, 2026
Image → VideoText → Video
LTX-2
LTX-2
Moonshot AI/Vision-LanguageMajor release

Moonshot AI Releases Kimi K2.5 Multimodal Model

The new vision-language model from the Chinese AI firm uses a Mixture-of-Experts architecture and is now available on Hugging Face.

Jan 1, 2026
Vision-LanguageText / LLM
Kimi K2.5
Kimi K2.5
Qwen · Alibaba/Any-to-Any

Qwen's Fun-Audio-Chat: An Open Speech-to-Speech LLM

The 8-billion-parameter model from Alibaba's Qwen team understands and generates spoken responses, enabling more natural audio-first applications.

Dec 23, 2025
Any-to-AnyText → Speech
Fun-Audio-Chat-8B
Fun-Audio-Chat-8B
OpenMOSS/Text → Speech

MiraTTS Brings Qwen2 to Bilingual Speech Synthesis

A new text-to-speech model from OpenMOSS leverages the Qwen2 architecture to generate speech in both English and Chinese.

Dec 17, 2025
Text → Speech
MiraTTS
MiraTTS
Qwen · Alibaba/Image Editing

Qwen Releases Open, Bilingual Image Editing Model

The new diffusion model from Alibaba's team allows for precise, instruction-based image modifications in both English and Chinese.

Dec 17, 2025
Image Editing
Qwen-Image-Edit 2511
Qwen-Image-Edit 2511
Qwen · Alibaba/Speech → Text

Qwen Releases Compact ASR Model for Streaming Audio

The new Fun-ASR-Nano model from Alibaba's team packs real-time multilingual transcription, speaker diarization, and hotword detection into an efficient package.

Dec 15, 2025
Speech → Text
Fun-ASR-Nano-2512
Fun-ASR-Nano-2512
Tencent/Image → Video

Tencent's HY-WorldPlay Creates 3D Scenes from One Image

The new model from Tencent's Hunyuan team generates dynamic video and reconstructs 3D environments using a single static picture.

Dec 12, 2025
Image → VideoText → 3D
HY-WorldPlay
HY-WorldPlay
Meituan/Image Editing

Meituan Releases Open, Bilingual Image Editing Model

The new LongCat-Image-Edit model follows natural language instructions to perform complex photo manipulations in both English and Chinese.

Dec 5, 2025
Image Editing
LongCat-Image-Edit
LongCat-Image-Edit
Baidu/Image → Video

Baidu's Live-Avatar Animates Photos With Audio

The new 14-billion-parameter model uses audio input to generate realistic talking head videos from a single still image.

Dec 4, 2025
Image → Video
Live-Avatar
Live-Avatar
Microsoft/Text → Speech

Microsoft Releases VibeVoice for Real-Time AI Speech

The new 500-million-parameter model is designed for generating natural, long-form speech with very low latency for interactive applications.

Dec 4, 2025
Text → Speech
VibeVoice Realtime 0.5B
VibeVoice Realtime 0.5B
FlashLabs/Any-to-Any

FlashLabs Releases Chroma-4B, an Any-to-Any Model

The new 4-billion-parameter model handles text, image, and speech inputs and outputs, including direct speech-to-speech translation.

Nov 28, 2025
Any-to-Any
Chroma-4B
Chroma-4B
Qwen · Alibaba/Text → Image

Alibaba Releases Z-Image-Turbo, A Fast Open Image Model

The new text-to-image model from the team behind Qwen uses a diffusion transformer to generate high-resolution images in just a single step.

Nov 25, 2025
Text → Image
Z-Image-Turbo
Z-Image-Turbo
OpenAI/Text → Speech

Nari Labs Releases Dia2-2B, an Open Voice Cloning Model

The 2-billion-parameter text-to-speech model can clone voices from a short audio sample and is available under an Apache 2.0 license.

Nov 15, 2025
Text → Speech
Dia2-2B
Dia2-2B
Allen Institute for AI/Any-to-Any

BAAI Releases Emu3.5, an 'Any-to-Any' Multimodal Model

The new open-source model from the Allen Institute for AI unifies text and image understanding and generation into a single architecture.

Oct 31, 2025
Any-to-AnyVision-Language
Emu3.5
Emu3.5
OpenMOSS/Text → Speech

SoulX-Podcast 1.7B Offers Open Multi-Speaker TTS

The new 1.7 billion-parameter model from OpenMOSS is trained on conversational data to generate natural dialogue in English and Chinese.

Oct 27, 2025
Text → Speech
SoulX-Podcast 1.7B
SoulX-Podcast 1.7B
Meituan/Text → Video

Meituan Releases Open-Source LongCat-Video Model

The Chinese tech giant has released a new MIT-licensed model capable of generating video from text, images, or by continuing existing clips.

Oct 24, 2025
Text → VideoImage → Video
LongCat-Video
LongCat-Video
Meituan/Any-to-Any

Meituan Debuts LongCat-Flash-Omni, an Any-to-Any AI Model

The new open-source Mixture-of-Experts model can process and generate any combination of text, images, video, audio, and 3D data.

Oct 23, 2025
Any-to-Any
LongCat-Flash-Omni
LongCat-Flash-Omni
NVIDIA/Speech → Text

NVIDIA Releases Real-Time Speaker Diarization Model

The new Sortformer-based model is designed for streaming audio, identifying up to four distinct speakers in real time.

Oct 22, 2025
Speech → Text
Streaming Sortformer Diarization 4spk v2.1
Streaming Sortformer Diarization 4spk v2.1
MiniMax/Text / LLMMajor release

MiniMax Releases M2, an Open-Weight MoE for Agents

The Shanghai-based AI startup has released a new Mixture-of-Experts model focused on complex reasoning, coding, and agentic tasks.

Oct 22, 2025
Text / LLMReasoning
MiniMax-M2
MiniMax-M2
Datalab/Vision-Language

Datalab Releases Chandra, a New OCR Vision Model

The new vision-language model from Datalab is fine-tuned from Qwen2-VL to specialize in extracting text and structure from complex documents.

Oct 21, 2025
Vision-Language
Chandra OCR
Chandra OCR
Kuaishou/Any-to-Any

Kling Releases UniVideo for Generation and Understanding

The new open-source model combines both video generation and comprehension, a rare dual capability built on the Qwen2.5 vision-language foundation.

Oct 18, 2025
Any-to-AnyText → Video
UniVideo
UniVideo
Maya Research/Text → Speech

Maya Research Releases Maya1, an Expressive TTS Model

The new Apache 2.0 licensed model uses a Llama-based architecture to generate more natural and emotionally nuanced speech from text.

Oct 18, 2025
Text → Speech
Maya1
Maya1
DeepSeek/Vision-LanguageMajor release

DeepSeek-OCR Tackles Document Parsing with Vision AI

The new vision-language model uses a novel context compression technique to efficiently extract text and structure from complex documents.

Oct 17, 2025
Vision-Language
DeepSeek-OCR
DeepSeek-OCR
Baidu/Vision-Language

Baidu Releases PaddleOCR-VL for Document AI

The new vision-language model is fine-tuned to understand not just text, but the complex structure of tables, charts, and formulas.

Oct 16, 2025
Vision-Language
PaddleOCR-VL
PaddleOCR-VL
NVIDIA/Speech → Text

NVIDIA's Parakeet ASR Tackles Multi-Speaker Audio

The 600-million-parameter model offers real-time speech-to-text with speaker diarization, built on the efficient FastConformer architecture.

Oct 15, 2025
Speech → Text
Multitalker Parakeet Streaming 0.6B
Multitalker Parakeet Streaming 0.6B
inclusionAI/Any-to-Any

inclusionAI Debuts 'Any-to-Any' Multimodal MoE Model

The new Ming-flash-omni-Preview aims to handle any combination of data modalities using an efficient Mixture of Experts architecture.

Oct 14, 2025
Any-to-Any
Ming-flash-omni-Preview
Ming-flash-omni-Preview
Qwen · Alibaba/Vision-Language

Alibaba Releases Qwen3-VL, an 8B Open-Source Vision Model

The latest vision-language model from the popular Qwen series is instruction-tuned and available under an Apache 2.0 license.

Oct 11, 2025
Vision-Language
Qwen3-VL-8B-Instruct
Qwen3-VL-8B-Instruct
Google DeepMind/Text / LLM

Google Releases Compact FunctionGemma Model

The new 270-million-parameter model from Google DeepMind is fine-tuned specifically for reliable function calling and tool use.

Oct 8, 2025
Text / LLM
FunctionGemma 270M IT
FunctionGemma 270M IT
EPFL VITA/Image → Video

EPFL Releases SVI for Streaming Image-to-Video

The new open-source model from Swiss researchers uses a novel chunking method to generate indefinitely long videos from a single still image.

Oct 8, 2025
Image → Video
SVI
SVI
Krea/Text → Video

Krea Releases Open-Source Real-Time Video Model

The new 14-billion-parameter model is a distilled, more efficient version of a larger foundation, designed for interactive video generation.

Oct 8, 2025
Text → Video
Krea Realtime Video
Krea Realtime Video
inclusionAI/Any-to-Any

inclusionAI Releases Ming-UniVision MoE Multimodal Model

The new 16-billion-parameter model uses a sparse Mixture-of-Experts design to efficiently handle 'any-to-any' data combinations, from text to images.

Sep 30, 2025
Any-to-AnyVision-Language
Ming-UniVision-16B-A3B
Ming-UniVision-16B-A3B
Qwen · Alibaba/Vision-LanguageMajor release

Qwen Releases 30B MoE Vision Model, Qwen3-VL

The new open-source model from Alibaba uses a Mixture-of-Experts architecture to make its powerful vision-language capabilities more efficient to run.

Sep 30, 2025
Vision-LanguageAny-to-Any
Qwen3-VL-8B-Instruct
Qwen3-VL-8B-Instruct
nineninesix/Text → Speech

Kani TTS 370M Offers Compact Multilingual Speech

Based on the Language-Free Modeling for Multilingual Text-To-Speech (LFM2) architecture, the new model offers an efficient solution for developers.

Sep 30, 2025
Text → Speech
Kani TTS 370M
Kani TTS 370M
chetwinlow1/Image → Video

Ovi Syncs Audio and Video in New Open-Source Model

Built on the Wan2.2 architecture, this new 5-billion-parameter model generates short video clips from a single image and simultaneously creates synchronized audio.

Sep 30, 2025
Image → Video
Ovi
Ovi
Zhipu AI/Text / LLMMajor release

Zhipu AI Releases Open-Weight MoE Model GLM-4.6

The new Mixture-of-Experts model is available under a permissive MIT license and is optimized for complex reasoning and coding tasks.

Sep 29, 2025
Text / LLMReasoning
GLM-4.6
GLM-4.6
inclusionAI/Any-to-Any

Ming-UniAudio Brings MoE to Unified Audio AI

A new 16-billion-parameter model from inclusionAI uses a Mixture-of-Experts architecture to handle a wide range of audio tasks efficiently.

Sep 29, 2025
Any-to-AnyText → Speech
Ming-UniAudio-16B-A3B
Ming-UniAudio-16B-A3B
ByteDance/Image → Video

ByteDance Releases Lynx for Identity-Preserving Video

The new model from the TikTok parent company generates short video clips that maintain a person's likeness from a single reference image.

Sep 26, 2025
Image → Video
Lynx
Lynx
Tencent/Text → ImageMajor release

Tencent Releases HunyuanImage 3.0 Text-to-Image Model

The new text-to-image generator from the Chinese tech giant uses a Mixture-of-Experts architecture for improved efficiency and output quality.

Sep 25, 2025
Text → Image
HunyuanImage 3.0 Instruct
HunyuanImage 3.0 Instruct
Qwen · Alibaba/Image Editing

Qwen Releases Open-Source Instruction-Based Image Editor

The new model from Alibaba's Qwen team allows users to modify images using natural language prompts instead of complex tools or masks.

Sep 22, 2025
Image Editing
Qwen-Image-Edit-2509
Qwen-Image-Edit-2509
Qwen · Alibaba/Any-to-AnyMajor release

Qwen3-Omni Arrives With Any-to-Any Multimodality

The new 30B Mixture-of-Experts model from Alibaba's Qwen team can process and generate content across text, image, and audio formats.

Sep 20, 2025
Any-to-AnyVision-Language
Qwen3-Omni-30B-A3B-Instruct
Qwen3-Omni-30B-A3B-Instruct
Xiaomi/Any-to-Any

Xiaomi's MiMo-Audio 7B Tackles Complex Speech Tasks

This new instruction-tuned model from Xiaomi can handle a flexible combination of audio and text inputs and outputs, from transcription to voice synthesis.

Sep 18, 2025
Any-to-AnyText → Speech
MiMo-Audio-7B-Instruct
MiMo-Audio-7B-Instruct
OpenBMB/Text → Speech

OpenBMB Releases VoxCPM for Open Voice Synthesis

The new 500-million-parameter model offers high-quality text-to-speech and zero-shot voice cloning under a permissive license.

Sep 16, 2025
Text → Speech
VoxCPM-0.5B
VoxCPM-0.5B
Qwen · Alibaba/Any-to-Any

Qwen Releases 'Thinking' Multimodal MoE Model

The new 30-billion-parameter Mixture-of-Experts model from Alibaba's Qwen team is designed to show its reasoning process for complex multimodal tasks.

Sep 15, 2025
Any-to-AnyReasoning
Qwen3-Omni-30B-A3B-Thinking
Qwen3-Omni-30B-A3B-Thinking
Qwen · Alibaba/Any-to-Any

Qwen Releases 30B Model for Audio Captioning

The new Mixture-of-Experts model from Alibaba is fine-tuned to generate detailed, multilingual descriptions for complex audio content.

Sep 15, 2025
Any-to-AnyText → Speech
Qwen3-Omni-30B-A3B-Captioner
Qwen3-Omni-30B-A3B-Captioner
neuphonic/Text → Speech

Neuphonic Releases NeuTTS Air for On-Device AI Speech

The new Apache 2.0 text-to-speech model is built on a Qwen2 architecture and optimized for local inference with GGUF support.

Sep 15, 2025
Text → Speech
NeuTTS Air
NeuTTS Air
moondream/Vision-Language

Moondream 3 Arrives in Preview Release

The next generation of the efficient, open-source vision-language model is now available for early testing and feedback.

Sep 11, 2025
Vision-Language
Moondream 3 (preview)
Moondream 3 (preview)
ByteDance/Image → Video

ByteDance Releases HuMo for Human Video Generation

The new open-source model specializes in creating realistic videos of people, separating appearance from motion for greater control.

Sep 10, 2025
Image → Video
HuMo
HuMo
Qwen · Alibaba/Text → Video

Alibaba's Wan2.2 Adds Control to Open Video

The new 14-billion-parameter model from Alibaba's PAI team offers fine-grained control over video generation using inputs like sketches and depth maps.

Sep 10, 2025
Text → Video
Wan2.2-VACE-Fun-A14B
Wan2.2-VACE-Fun-A14B
Qwen · Alibaba/Text / LLMMajor release

Qwen Releases 80B Mixture-of-Experts Model

The new Qwen3-Next model from Alibaba combines a large parameter count with an efficient MoE architecture to balance performance and computational cost.

Sep 9, 2025
Text / LLM
Qwen3-Next-80B-A3B-Instruct
Qwen3-Next-80B-A3B-Instruct
Alpha-VLLM/Any-to-Any

Lumina-DiMOO: A Diffusion Model for Any-to-Any AI

This new open-source model uses a diffusion architecture instead of a typical transformer to generate and understand a mix of media types.

Sep 9, 2025
Any-to-AnyText → Image
Lumina-DiMOO
Lumina-DiMOO
Tencent/Text → Image

Tencent SRPO Fine-Tunes SDXL with Preference Optimization

The new text-to-image model uses a novel rejection sampling technique to align Stable Diffusion XL more closely with human aesthetic preferences.

Sep 8, 2025
Text → Image
SRPO
SRPO
Tencent/Text → Image

Tencent Releases HunyuanImage 2.1 for Bilingual AI Art

The new text-to-image model from the Chinese tech giant is designed to understand both Chinese and English prompts at high resolutions.

Sep 5, 2025
Text → Image
HunyuanImage 2.1
HunyuanImage 2.1
Microsoft/Text → Speech

Microsoft Releases VibeVoice, a 7B Podcast TTS Model

The new 7-billion-parameter model is designed for generating long-form, multi-speaker audio in English and Chinese under a permissive MIT license.

Sep 4, 2025
Text → Speech
VibeVoice-7B
VibeVoice-7B
Microsoft/Text → Speech

Microsoft Releases VibeVoice, a Podcast-Ready TTS Model

The new open-source model specializes in generating long-form, multi-speaker audio in both English and Mandarin, mimicking a natural podcast conversation.

Sep 4, 2025
Text → Speech
VibeVoice Large
VibeVoice Large
StepFun/Any-to-Any

StepFun Releases Step-Audio 2 mini, a Unified Audio AI

The new open-source model handles both speech recognition and audio generation in a single, end-to-end architecture.

Aug 28, 2025
Any-to-AnyText → Speech
Step-Audio 2 mini
Step-Audio 2 mini
Tencent/Image → Video

Tencent's Voyager Model Turns Images into 3D Worlds

The new model from Tencent AI Lab generates temporally and spatially consistent video sequences from a single image, enabling virtual exploration of static scenes.

Aug 27, 2025
Image → VideoText → 3D
HunyuanWorld-Voyager
HunyuanWorld-Voyager
Microsoft/Text → Speech

Microsoft Releases VibeVoice for Long-Form Audio

The new 1.5-billion-parameter text-to-speech model is designed to generate natural, multi-speaker audio for podcasts and other long-form content.

Aug 25, 2025
Text → Speech
VibeVoice-1.5B
VibeVoice-1.5B
Qwen · Alibaba/Image → Video

Alibaba Releases 14B Model for Audio-Driven Video

The new Wan2.2-S2V model takes a still image and a speech track to generate a realistic talking-head animation, available under a permissive license.

Aug 25, 2025
Image → Video
Wan2.2-S2V-14B
Wan2.2-S2V-14B
OpenBMB/Vision-Language

OpenBMB Releases Compact Multimodal Model MiniCPM-V 4.5

The new vision-language model from the open-source research group demonstrates strong OCR and video understanding capabilities in a small package.

Aug 24, 2025
Vision-Language
MiniCPM-V 4.5
MiniCPM-V 4.5
DeepSeek/Text / LLMMajor release

DeepSeek Releases 671B MoE Model Under MIT License

The new DeepSeek-V3.1-Base is a massive 671-billion-parameter Mixture-of-Experts model designed for efficient, large-scale research and development.

Aug 19, 2025
Text / LLMReasoning
DeepSeek-V3.1-Base
DeepSeek-V3.1-Base
Qwen · Alibaba/Image EditingMajor release

Qwen Releases Open Model for Image Editing

The new open-source model from Alibaba lets users edit images with simple text commands in both English and Chinese.

Aug 17, 2025
Image Editing
Qwen-Image-Edit
Qwen-Image-Edit
NexaAI/Any-to-Any

NexaAI Releases OmniNeural-4B for On-Device AI

The new 4-billion-parameter model is designed for 'any-to-any' multimodal tasks and optimized to run efficiently on mobile hardware.

Aug 15, 2025
Any-to-Any
OmniNeural-4B
OmniNeural-4B
Tencent/Image → Video

Tencent Releases Controllable Game Video Model

The new Hunyuan-GameCraft 1.0 is an open image-to-video model that generates interactive game-like scenes with precise camera control.

Aug 13, 2025
Image → Video
Hunyuan-GameCraft 1.0
Hunyuan-GameCraft 1.0
FrancisRing/Image → Video

StableAvatar Brings Open Source Talking Heads to Life

A new diffusion-based model from developer FrancisRing animates still images into talking avatars using only an audio track.

Aug 12, 2025
Image → Video
StableAvatar
StableAvatar
Zhipu AI/Vision-LanguageMajor release

Zhipu AI Releases Open Vision Model GLM-4.5V

The new Mixture-of-Experts model offers strong multimodal reasoning capabilities under a permissive MIT license.

Aug 10, 2025
Vision-LanguageReasoning
GLM-4.5V
GLM-4.5V
Skywork/Image → Video

Skywork Releases Open 'World Model' for Playable Video

The new 1.3-billion-parameter model functions as an interactive 'world model,' generating controllable video scenes from a single static image.

Aug 8, 2025
Image → Video
Matrix-Game 2.0
Matrix-Game 2.0
Google DeepMind/Text / LLM

Google Releases Gemma 3 270M for On-Device AI

The new ultra-compact model from DeepMind is designed for efficient performance in resource-constrained environments like mobile and web.

Aug 5, 2025
Text / LLM
Gemma 3 270M
Gemma 3 270M
OpenAI/ReasoningMajor release

OpenAI Releases 21B Open-Weight MoE Model

The new `gpt-oss-20b` is an Apache 2.0-licensed Mixture-of-Experts model designed to run efficiently on consumer-grade hardware.

Aug 4, 2025
ReasoningText / LLM
gpt-oss-20b
gpt-oss-20b
OpenAI/ReasoningMajor release

OpenAI Releases Its First Open-Source MoE Model

The new 117-billion-parameter `gpt-oss-120b` is a Mixture-of-Experts model focused on reasoning, released under a permissive Apache 2.0 license.

Aug 4, 2025
ReasoningText / LLM
gpt-oss-20b
gpt-oss-20b
NVIDIA/Speech → Text

NVIDIA Releases Canary 1B v2 Multilingual Speech Model

The new 1-billion-parameter model handles both transcription and translation across five languages using the company's efficient FastConformer architecture.

Aug 4, 2025
Speech → Text
Canary 1B v2
Canary 1B v2
NVIDIA/Speech → Text

NVIDIA Releases 600M Parakeet for Speech Recognition

The new FastConformer model uses a specialized training technique to improve transcription accuracy in noisy, real-world environments.

Aug 4, 2025
Speech → Text
Parakeet TDT 0.6B v3
Parakeet TDT 0.6B v3
Qwen · Alibaba/Text → ImageMajor release

Qwen releases open model for text-in-image generation

The new Apache 2.0 diffusion model from Alibaba's Qwen team focuses on accurately rendering both English and Chinese characters within generated images.

Aug 2, 2025
Text → Image
Qwen-Image
Qwen-Image
Qwen · Alibaba/Code

Qwen Releases Compact 30B MoE for Coding Agents

The new Apache 2.0 model from Alibaba's Qwen team uses a Mixture-of-Experts architecture to deliver strong performance with only 3B active parameters.

Jul 31, 2025
CodeText / LLM
Qwen3-Coder-30B-A3B-Instruct
Qwen3-Coder-30B-A3B-Instruct
rednote-hilab/Vision-Language

New VLM `dots.ocr` Takes on Complex Documents

The new 3B-parameter model from rednote-hilab uses a vision-language approach to parse tables, layouts, and even mathematical formulas.

Jul 30, 2025
Vision-Language
dots.ocr
dots.ocr
Skywork/Any-to-Any

Skywork Releases UniPic, a Unified 1.5B Vision Model

The new autoregressive model from the Chinese AI lab can understand, generate, and edit images within a single, compact framework.

Jul 29, 2025
Any-to-AnyText → Image
Skywork-UniPic-1.5B
Skywork-UniPic-1.5B
Qwen · Alibaba/Image → VideoMajor release

Alibaba Releases Wan2.2, a 14B MoE Video Model

The new open-source diffusion model from the team behind Qwen uses a Mixture-of-Experts architecture to animate still images.

Jul 28, 2025
Image → Video
Wan2.2-I2V-A14B
Wan2.2-I2V-A14B
Qwen · Alibaba/Text → Video

Qwen Releases Wan2.2, a 5B Open-Source Video Model

The new Apache 2.0 licensed model from Alibaba's team generates video from either text prompts or still images, offering a unified approach in a compact package.

Jul 28, 2025
Text → VideoImage → Video
Wan2.2-TI2V-5B
Wan2.2-TI2V-5B
Qwen · Alibaba/Text → Video

Qwen Unveils Wan2.2, a 14B Open Text-to-Video Model

The new Apache 2.0-licensed model from Alibaba's team uses a Mixture-of-Experts architecture for efficient, high-quality video generation.

Jul 24, 2025
Text → Video
Wan2.2 T2V A14B
Wan2.2 T2V A14B
Qwen · Alibaba/Image → Video

Qwen Releases Wan2.2, a 14B Image-to-Video Model

The new 14-billion parameter model from Alibaba's AI team uses a Mixture-of-Experts design and is available under the permissive Apache 2.0 license.

Jul 24, 2025
Image → Video
Wan2.2-I2V-A14B
Wan2.2-I2V-A14B
Qwen · Alibaba/CodeMajor release

Qwen Releases 480B Open-Source Model for Code Agents

The new flagship coding model from Alibaba's Qwen team uses a massive Mixture-of-Experts architecture and is released under a permissive Apache-2.0 license.

Jul 22, 2025
CodeText / LLM
Qwen3-Coder-30B-A3B-Instruct
Qwen3-Coder-30B-A3B-Instruct
Zhipu AI/ReasoningMajor release

Z.ai Releases 355B Parameter GLM-4.5 Under MIT License

The new Mixture-of-Experts model combines massive scale with a fully permissive license, targeting complex reasoning and agentic applications.

Jul 20, 2025
ReasoningText / LLM
GLM-4.5
GLM-4.5
Qwen · Alibaba/Text → VideoMajor release

Qwen Releases Wan 2.2, a 5B Open Video AI Model

The new Apache 2.0 licensed model from Alibaba's team can generate video from both text and image prompts, adding a powerful new tool to the open-source creative ecosystem.

Jul 18, 2025
Text → VideoImage → Video
Wan2.2-TI2V-5B
Wan2.2-TI2V-5B
HiDream.ai/Image Editing

HiDream.ai Releases 17B Open Image Editing Model

The new MIT-licensed model, HiDream-E1.1, allows for complex image modifications by following natural language instructions.

Jul 16, 2025
Image Editing
HiDream-E1.1
HiDream-E1.1
inclusionAI/Any-to-Any

Ming-Lite-Omni 1.5 Brings Any-to-Any Modality to Open Source

The new MIT-licensed model from inclusionAI can process and generate a mix of text, images, audio, and video, pushing the boundaries of open multimodal AI.

Jul 15, 2025
Any-to-Any
Ming-Lite-Omni 1.5
Ming-Lite-Omni 1.5
RaphaelLiu/Image → Video

Pusa V1: A New Open Model for Image-to-Video Animation

Based on the Wan2.1 architecture, this new 14B parameter model offers fine-grained control over video generation from still images and text.

Jul 14, 2025
Image → VideoText → Video
Pusa V1
Pusa V1
T-Tech/Speech → Text

T-Tech Releases T-one for Russian Speech Recognition

The new streaming Conformer model from the Russian digital bank is optimized for real-time transcription of telephone conversations.

Jul 14, 2025
Speech → Text
T-one
T-one
Moonshot AI/Text / LLMMajor release

Moonshot AI Releases Trillion-Parameter Kimi-K2 Model

The new Mixture-of-Experts model brings massive scale to the open-weights community, focusing on complex reasoning and coding tasks with a 128K context window.

Jul 11, 2025
Text / LLMReasoning
Kimi-K2-Instruct
Kimi-K2-Instruct
Black Forest Labs/Text → ImageMajor release

Black Forest Labs Releases FLUX.1 Krea Image Model

The new 12-billion-parameter model, tuned by creative AI platform Krea, focuses on high-quality aesthetic output and prompt fidelity.

Jul 7, 2025
Text → Image
FLUX.1 Krea [dev]
FLUX.1 Krea [dev]
ByteDance/Any-to-Any

ByteDance Releases Tar-7B for 'Any-to-Any' Multimodality

The new 7-billion-parameter model from the company's SEED team can process and generate a mix of text, images, audio, and video in a single unified framework.

Jul 2, 2025
Any-to-Any
Tar-7B
Tar-7B
Boson AI/Text → Speech

Boson AI Releases Higgs Audio v2 for Expressive TTS

The new 3-billion-parameter model focuses on generating expressive, multilingual speech and is fully open for commercial use under an Apache 2.0 license.

Jul 1, 2025
Text → Speech
Higgs Audio v2 (3B)
Higgs Audio v2 (3B)
Kyutai/Text → Speech

Kyutai Releases 1.6B Bilingual TTS Model

The French AI lab's new open-source model generates streaming audio in English and French under a permissive license.

Jun 30, 2025
Text → Speech
Kyutai TTS 1.6B (en/fr)
Kyutai TTS 1.6B (en/fr)
Zhipu AI/Vision-Language

Zhipu AI Open-Sources 9B Vision Model with 'Thinking' Mode

The new GLM-4.1V-9B-Thinking model makes its vision and chain-of-thought reasoning capabilities available under a permissive MIT license.

Jun 28, 2025
Vision-LanguageReasoning
GLM-4.1V-9B-Thinking
GLM-4.1V-9B-Thinking
AIDC-AI/Any-to-Any

Ovis-U1-3B Unifies Image Understanding and Generation

The new 3-billion-parameter model from AIDC-AI combines vision-language understanding and image generation into a single 'any-to-any' framework.

Jun 28, 2025
Any-to-AnyVision-Language
Ovis-U1-3B
Ovis-U1-3B
NVIDIA/Speech → Text

NVIDIA Fuses LLM and ASR in Canary-Qwen 2.5B Model

The 2.5 billion-parameter speech model combines a FastConformer encoder with a Qwen LLM decoder, a hybrid approach to transcription.

Jun 26, 2025
Speech → Text
Canary-Qwen 2.5B
Canary-Qwen 2.5B
Maya Research/Text → Speech

Veena TTS Model Targets Indian Languages with Llama Base

Maya Research has released a 3-billion-parameter model designed to generate natural-sounding speech in Hindi and English.

Jun 24, 2025
Text → Speech
Veena
Veena
FreedomIntelligence/Any-to-Any

Janus-4o-7B Adds Image Generation to 7B Multimodal AI

The new 7-billion-parameter model from FreedomIntelligence can process various inputs and generate or edit images based on text prompts.

Jun 23, 2025
Any-to-AnyText → Image
Janus-4o-7B
Janus-4o-7B
Filter