OmniVoice TTS Offers Zero-Shot Multilingual Voice Cloning
A new open-source text-to-speech model from the k2-fsa project can replicate a voice and generate speech in multiple languages from a single short audio sample.
The team behind the k2-fsa speech recognition toolkit has released OmniVoice, a new open-source model for text-to-speech synthesis. Released under an Apache 2.0 license, the model is designed for high-quality, multilingual voice generation from minimal user input.
The system's core feature is its zero-shot voice cloning capability. Using just a three-second audio clip of a target speaker, OmniVoice can replicate their voice and use it to generate new speech. This process works across multiple languages, allowing a user to provide an English voice sample and generate speech in Chinese, Spanish, or other supported languages without requiring specific training.
Beyond simple cloning, OmniVoice also provides tools for "voice design." By supplying a secondary audio recording as a style reference, users can transfer prosody, rhythm, and emotion to the synthesized output. This enables more granular control over the performance of the generated voice.
OmniVoice lowers the barrier for creating custom, expressive synthetic voices for applications ranging from accessibility tools to content creation. Its ability to separate voice characteristics from language and style provides a flexible foundation for developers and researchers. The model and usage examples are available on Hugging Face.
Sources
- Visit
k2-fsa/OmniVoice
Hugging Face
0 comments
No comments yet. Be the first to weigh in.
More in Text → Speech
Zyphra Releases Open-Source Zonos 2 TTS Model
The new text-to-speech model offers a commercially permissive alternative for developers in a field still dominated by closed-source APIs.

Boson AI's Higgs Audio v3 Offers Expressive, Multilingual TTS
The new 4-billion-parameter text-to-speech model is available for non-commercial use, promising fine-grained control over vocal delivery.
MOSS-TTS Aims for More Robust Speech Synthesis
A new text-to-speech model introduces 'delay-pattern decoding' to solve common word skipping and repetition errors in parallel generation.