FishaudioText → Speech

Fish Audio's S2-Pro Brings Expressive TTS to Open Source

The new text-to-speech model can follow natural language instructions to control tone, clone voices from short clips, and speak multiple languages.

Mar 9, 2026

NotableOther

A new open-source model from a group called Fish Audio is bringing more expressive and controllable speech synthesis to the community. Called S2-Pro, the model's standout feature is its ability to follow instructions. Instead of just converting text to audio, users can guide the output's style, tone, and emotion using natural language prompts like "speak in a gentle voice."

The model also features powerful voice cloning capabilities. It can perform zero-shot cloning from a sample as short as three seconds, meaning it can replicate a voice without being specifically trained on it. This feature works across its supported languages of English, Japanese, and Chinese, allowing for cross-language voice replication.

How it Works

S2-Pro is built on a GPT-VITS architecture, combining a large language model's comprehension with a VITS-based speech synthesis system. The model, weights, and code are all available on the Hugging Face Hub for developers to explore.

It's important to note the model's usage constraints. S2-Pro is released under a custom license, the openai-ft-community-license-v0.1, which restricts its use to non-commercial, research, and personal projects. This makes it a valuable tool for experimentation but not for integration into commercial applications.

This release represents a meaningful step for open-source speech generation. By providing fine-grained, instruction-based control over audio output, S2-Pro gives creators and researchers access to a level of expressiveness typically found only in proprietary, API-gated systems.

Sources

fishaudio/s2-pro
Hugging Face
Visit

0 comments

No comments yet. Be the first to weigh in.

Audio8 debuts a 0.6B multilingual zero-shot TTS preview

The compact text-to-speech model promises voice cloning across languages from a footprint small enough to run without heavy hardware.

Jul 28, 2026

KRAFTON/Any-to-Any

KRAFTON releases A.X-K2 Raon speech MoE model

The game maker's new open model blends text-to-speech and speech recognition in a single 21B mixture-of-experts system with just 3B active parameters.

Jul 27, 2026

NVIDIA/Any-to-Any

NVIDIA's Audex Unifies Audio Understanding and Speech

A new 30B mixture-of-experts model from NVIDIA handles both listening and speaking within a single audio-text architecture.

Jul 6, 2026

How it Works