k2-fsaText → Speech

OmniVoice TTS Offers Zero-Shot Multilingual Voice Cloning

A new open-source text-to-speech model from the k2-fsa project can replicate a voice and generate speech in multiple languages from a single short audio sample.

Mar 30, 2026

NotableOther

The team behind the k2-fsa speech recognition toolkit has released OmniVoice, a new open-source model for text-to-speech synthesis. Released under an Apache 2.0 license, the model is designed for high-quality, multilingual voice generation from minimal user input.

The system's core feature is its zero-shot voice cloning capability. Using just a three-second audio clip of a target speaker, OmniVoice can replicate their voice and use it to generate new speech. This process works across multiple languages, allowing a user to provide an English voice sample and generate speech in Chinese, Spanish, or other supported languages without requiring specific training.

Beyond simple cloning, OmniVoice also provides tools for "voice design." By supplying a secondary audio recording as a style reference, users can transfer prosody, rhythm, and emotion to the synthesized output. This enables more granular control over the performance of the generated voice.

OmniVoice lowers the barrier for creating custom, expressive synthetic voices for applications ranging from accessibility tools to content creation. Its ability to separate voice characteristics from language and style provides a flexible foundation for developers and researchers. The model and usage examples are available on Hugging Face.

Sources

k2-fsa/OmniVoice
Hugging Face
Visit

0 comments

No comments yet. Be the first to weigh in.

Audio8 debuts a 0.6B multilingual zero-shot TTS preview

The compact text-to-speech model promises voice cloning across languages from a footprint small enough to run without heavy hardware.

Jul 28, 2026

KRAFTON/Any-to-Any

KRAFTON releases A.X-K2 Raon speech MoE model

The game maker's new open model blends text-to-speech and speech recognition in a single 21B mixture-of-experts system with just 3B active parameters.

Jul 27, 2026

NVIDIA/Any-to-Any

NVIDIA's Audex Unifies Audio Understanding and Speech

A new 30B mixture-of-experts model from NVIDIA handles both listening and speaking within a single audio-text architecture.

Jul 6, 2026