Nari Labs Releases Dia2-2B, an Open Voice Cloning Model
The 2-billion-parameter text-to-speech model can clone voices from a short audio sample and is available under an Apache 2.0 license.
Nari Labs has introduced Dia2-2B, a powerful new open-source model for text-to-speech (TTS) applications. The 2-billion-parameter model is designed for high-fidelity audio generation and is released under the permissive Apache 2.0 license, allowing for broad commercial and research use.
The model's primary capability is zero-shot voice cloning. It can analyze a brief audio sample to capture the unique acoustic properties of a speaker—including timbre, rhythm, and prosody—and then generate new speech in that voice from any given text. This allows for the creation of dynamic, custom voice outputs without needing to train a new model for each speaker.
Technical Foundations
Dia2-2B is a diffusion-based model, a technique known for producing high-quality generative results. It was trained on a substantial dataset of over 200,000 hours of English speech sourced from public domain audiobooks. While building on foundational concepts from the Bark model, Dia2 features a distinct architecture and was trained on a completely new dataset.
This release provides developers with a strong, openly available tool for creating sophisticated audio applications. As an alternative to proprietary TTS and voice cloning APIs, Dia2-2B enables a new class of customizable products, from personalized digital assistants to dynamic content creation tools. The model is available for download and use from the Nari Labs Hugging Face repository.
Sources
- Visit
nari-labs/Dia2-2B
Hugging Face
0 comments
No comments yet. Be the first to weigh in.
More in Text → Speech
Zyphra Releases Open-Source Zonos 2 TTS Model
The new text-to-speech model offers a commercially permissive alternative for developers in a field still dominated by closed-source APIs.

Boson AI's Higgs Audio v3 Offers Expressive, Multilingual TTS
The new 4-billion-parameter text-to-speech model is available for non-commercial use, promising fine-grained control over vocal delivery.
MOSS-TTS Aims for More Robust Speech Synthesis
A new text-to-speech model introduces 'delay-pattern decoding' to solve common word skipping and repetition errors in parallel generation.