Resemble AI Releases Dramabox Voice Cloning TTS Model
The new text-to-speech model uses a diffusion-transformer architecture for high-quality, expressive audio and one-shot voice cloning.

Resemble AI has publicly released Dramabox, a new text-to-speech (TTS) model designed for generating expressive and high-quality audio. The model's standout feature is its ability to perform one-shot voice cloning, replicating a speaker's voice from just a single short audio clip.
Under the hood, Dramabox employs a diffusion-transformer architecture. According to the company, this approach is built upon their LTX-2 flow-matching audio technology, which enables fine-grained control over the generated speech. This allows the model to produce not just clear narration but also audio with emotional nuance and expressiveness, a key challenge in speech synthesis.
Availability and License
Developers and researchers can access the model weights and inference code on the Hugging Face Hub. It's important to note that Dramabox is released under a custom Community License. This license permits non-commercial use and research, but requires a separate commercial license for any business applications.
The release of Dramabox provides the open-weights community with a powerful tool for creative and research-oriented audio projects. Its combination of a modern architecture and effective voice cloning makes it a significant new entry in the landscape of publicly available TTS models, offering a high-quality foundation for non-commercial applications.
Sources
- Visit
ResembleAI/Dramabox
Hugging Face
0 comments
No comments yet. Be the first to weigh in.
More in Text → Speech
Zyphra Releases Open-Source Zonos 2 TTS Model
The new text-to-speech model offers a commercially permissive alternative for developers in a field still dominated by closed-source APIs.

Boson AI's Higgs Audio v3 Offers Expressive, Multilingual TTS
The new 4-billion-parameter text-to-speech model is available for non-commercial use, promising fine-grained control over vocal delivery.
MOSS-TTS Aims for More Robust Speech Synthesis
A new text-to-speech model introduces 'delay-pattern decoding' to solve common word skipping and repetition errors in parallel generation.