MisoLabsText → Speech

MisoLabs Debuts MisoTTS, an Open Voice Model

The new text-to-speech system adapts the decoder-only architecture of language models like Llama to generate more natural-sounding speech.

May 21, 2026

UpdateOther

A new contender has entered the open-source speech synthesis space. Startup MisoLabs has released MisoTTS, a text-to-speech (TTS) model that applies popular architectural patterns from large language models to the challenge of generating human-like audio.

Unlike many traditional TTS systems, MisoTTS uses a decoder-only transformer architecture, a design heavily inspired by models in the Llama family. This approach treats audio generation as a sequence-to-sequence task, similar to how an LLM predicts the next word in a sentence. The goal is to produce more natural and expressive speech by leveraging the same principles that have dramatically advanced text generation.

MisoTTS at a Glance

The model was trained on a foundation of public domain audiobooks and currently supports two languages. Key features of the initial release include:

Architecture: 24-layer decoder-only transformer.
Languages: English and Japanese.
License: Creative Commons BY-NC-SA 4.0 (non-commercial use).

The release of MisoTTS highlights a growing trend of cross-pollination in AI research, where successful architectures from one domain are adapted to solve problems in another. While its non-commercial license limits its use in products, it provides researchers and hobbyists a new tool for exploring the intersection of language and speech. The model and code are available now on the Hugging Face Hub.

Sources

MisoLabs/MisoTTS
Hugging Face
Visit

0 comments

No comments yet. Be the first to weigh in.

Audio8 debuts a 0.6B multilingual zero-shot TTS preview

The compact text-to-speech model promises voice cloning across languages from a footprint small enough to run without heavy hardware.

Jul 28, 2026

KRAFTON/Any-to-Any

KRAFTON releases A.X-K2 Raon speech MoE model

The game maker's new open model blends text-to-speech and speech recognition in a single 21B mixture-of-experts system with just 3B active parameters.

Jul 27, 2026

NVIDIA/Any-to-Any

NVIDIA's Audex Unifies Audio Understanding and Speech

A new 30B mixture-of-experts model from NVIDIA handles both listening and speaking within a single audio-text architecture.

Jul 6, 2026

MisoTTS at a Glance

The model was trained on a foundation of public domain audiobooks and currently supports two languages. Key features of the initial release include:

Architecture: 24-layer decoder-only transformer.

Languages: English and Japanese.

License: Creative Commons BY-NC-SA 4.0 (non-commercial use).