FlashLabsAny-to-Any

FlashLabs Releases Chroma-4B, an Any-to-Any Model

The new 4-billion-parameter model handles text, image, and speech inputs and outputs, including direct speech-to-speech translation.

Nov 28, 2025

UpdateApache 2.0

AI research group FlashLabs has released Chroma-4B, a new multimodal model designed for true “any-to-any” capabilities. The 4-billion-parameter model is available under an Apache 2.0 license, making it accessible for both research and commercial applications.

Unlike many multimodal models that are limited to text and image processing, Chroma-4B can understand and generate content across text, images, and audio streams simultaneously. This allows for novel use cases that have been challenging for previous open-source models.

A More Flexible Multimodal Architecture

The model's key feature is its ability to handle complex input and output combinations. According to the release documentation, Chroma-4B supports tasks such as:

Direct speech-to-speech translation
Generating an audio description from an image
Answering text-based questions about an audio clip

This versatility stems from a unified architecture that processes all modalities within a single framework, rather than relying on separate, specialized components.

While at 4 billion parameters Chroma-4B is smaller than many flagship models, its release marks an interesting step forward for open, natively multi-sensory AI. By moving beyond the common text-vision paradigm, it provides a foundation for developing more integrated and intuitive applications. The model and its weights are available on Hugging Face.

Sources

FlashLabs/Chroma-4B
Hugging Face
Visit

0 comments

No comments yet. Be the first to weigh in.

Thinking Machines Debuts Inkling Small, a Compact Multimodal MoE

The Apache-2.0 model brings mixture-of-experts efficiency to image, audio, and text tasks in a smaller footprint.

Jul 27, 2026

KRAFTON/Any-to-Any

KRAFTON releases A.X-K2 Raon speech MoE model

The game maker's new open model blends text-to-speech and speech recognition in a single 21B mixture-of-experts system with just 3B active parameters.

Jul 27, 2026

Microsoft/Vision-Language

Microsoft's Mage-VL Streams Video Natively

A codec-native multimodal foundation model aims to understand live video and vision-language input in real time.

Jul 26, 2026

A More Flexible Multimodal Architecture

The model's key feature is its ability to handle complex input and output combinations. According to the release documentation, Chroma-4B supports tasks such as:

Direct speech-to-speech translation

Generating an audio description from an image

Answering text-based questions about an audio clip

This versatility stems from a unified architecture that processes all modalities within a single framework, rather than relying on separate, specialized components.