Fish Audio's S2-Pro Brings Expressive TTS to Open Source
The new text-to-speech model can follow natural language instructions to control tone, clone voices from short clips, and speak multiple languages.

A new open-source model from a group called Fish Audio is bringing more expressive and controllable speech synthesis to the community. Called S2-Pro, the model's standout feature is its ability to follow instructions. Instead of just converting text to audio, users can guide the output's style, tone, and emotion using natural language prompts like "speak in a gentle voice."
The model also features powerful voice cloning capabilities. It can perform zero-shot cloning from a sample as short as three seconds, meaning it can replicate a voice without being specifically trained on it. This feature works across its supported languages of English, Japanese, and Chinese, allowing for cross-language voice replication.
How it Works
S2-Pro is built on a GPT-VITS architecture, combining a large language model's comprehension with a VITS-based speech synthesis system. The model, weights, and code are all available on the Hugging Face Hub for developers to explore.
It's important to note the model's usage constraints. S2-Pro is released under a custom license, the openai-ft-community-license-v0.1, which restricts its use to non-commercial, research, and personal projects. This makes it a valuable tool for experimentation but not for integration into commercial applications.
This release represents a meaningful step for open-source speech generation. By providing fine-grained, instruction-based control over audio output, S2-Pro gives creators and researchers access to a level of expressiveness typically found only in proprietary, API-gated systems.
Sources
- Visit
fishaudio/s2-pro
Hugging Face
0 comments
No comments yet. Be the first to weigh in.
More in Text → Speech
Zyphra Releases Open-Source Zonos 2 TTS Model
The new text-to-speech model offers a commercially permissive alternative for developers in a field still dominated by closed-source APIs.

Boson AI's Higgs Audio v3 Offers Expressive, Multilingual TTS
The new 4-billion-parameter text-to-speech model is available for non-commercial use, promising fine-grained control over vocal delivery.
MOSS-TTS Aims for More Robust Speech Synthesis
A new text-to-speech model introduces 'delay-pattern decoding' to solve common word skipping and repetition errors in parallel generation.