Supertone Open-Sources Supertonic 2 Voice Model
The new text-to-speech model from the audio AI company supports English, Korean, and Spanish and comes in the efficient ONNX format for deployment.

Audio AI company Supertone has released Supertonic 2, a new open-source model for generating high-quality speech from text. The model stands out for its multilingual capabilities, with initial support for English, Korean, and Spanish, among others. This release adds another strong contender to the growing ecosystem of open text-to-speech (TTS) systems.
Unlike many research-focused releases, Supertonic 2 is distributed in the ONNX (Open Neural Network Exchange) format. This makes it easier for developers to integrate and run the model efficiently across different platforms and hardware, signaling that it was designed with practical application in mind.
Developers can find the model and usage instructions on the official Hugging Face repository. Supertonic 2 is released under an OpenRAIL license, a popular choice for AI models that permits commercial use but includes restrictions against certain harmful applications, aligning with responsible AI development practices.
The availability of high-quality, multilingual, and deployment-ready TTS models is a critical step for building more accessible and global applications. Supertonic 2 provides a valuable new tool for developers looking to integrate voice into their products without relying on proprietary, closed-source APIs.
Sources
- Visit
Supertone/supertonic-2
Hugging Face
0 comments
No comments yet. Be the first to weigh in.
More in Text → Speech
Zyphra Releases Open-Source Zonos 2 TTS Model
The new text-to-speech model offers a commercially permissive alternative for developers in a field still dominated by closed-source APIs.

Boson AI's Higgs Audio v3 Offers Expressive, Multilingual TTS
The new 4-billion-parameter text-to-speech model is available for non-commercial use, promising fine-grained control over vocal delivery.
MOSS-TTS Aims for More Robust Speech Synthesis
A new text-to-speech model introduces 'delay-pattern decoding' to solve common word skipping and repetition errors in parallel generation.