What You'll Build
A local 13-language text-to-speech pipeline using OpenAudio S1 Mini — the 0.5 B-parameter distilled version of Fish Audio's S1 model — running on an RTX 5060. You'll launch the official tools.api_server from the fish-speech codebase, hit it with curl, and get back synthesised speech in any of English, Chinese, Japanese, German, French, Spanish, Korean, Arabic, Russian, Dutch, Italian, Polish, or Portuguese.
Hardware data: RTX 5060 (8 GB GDDR7, GB206-250 Blackwell, ~3840 CUDA cores, 128-bit ~448 GB/s, ~145 W, PCIe Gen5 x8 per Wikipedia "GeForce 50 series" and the ASUS Dual RTX 5060 O8G techspec) · openaudio-s1-mini runtime fits in ~5 GB VRAM per a TrueNAS deployment that tested it on both a 24 GB RTX 3090 and a 6 GB RTX A2000 (archy.net, 2026-02-17) · weights ~3.61 GB on disk · See benchmark data
ℹ️ Comfortable fit, not a squeeze. S1 Mini's ~5 GB runtime envelope leaves the RTX 5060's 8 GB budget with roughly 3 GB of headroom — enough for
torch.compilegraphs, longer prompt batches, or a small concurrent workload. The model loaded successfully on a 6 GB RTX A2000 per archy.net, so the 5060 is over-provisioned for inference.
⚠️ Blackwell + PyTorch wheels. The RTX 5060 is a Blackwell card (compute capability
sm_120). Defaultpip install torchships onlysm_50…sm_90kernels and will crash withNVIDIA GeForce RTX 5060 with CUDA capability sm_120 is not compatible with the current PyTorch installation— the same trap reported and reproduced on the HF discussions tab (RTX 5070 Ti,sm_120) and on the upstreamfish-speechissue #1126 (RTX 5090,sm_120). The fix is to pick thecu128orcu129extra at install time — both ship sm_120 kernels — instead of lettinguv/pipresolve a stale wheel.
⚠️ Non-commercial weights. The S1 Mini weights are licensed CC-BY-NC-SA-4.0 per the Hugging Face model card — you may use them for research, demos, and personal projects, but not for commercial deployment or paid services. The
fish-speechcodebase itself ships under the "Fish Audio Research License" (repo LICENSE). For a commercially-usable open-weight TTS in the same VRAM class, see Kokoro or other CC0/Apache-licensed TTS recipes.
Requirements
| Component | Minimum | Tested |
|---|---|---|
| GPU | 6 GB VRAM (model ran on a 6 GB RTX A2000 per archy.net) | RTX 5060 (8 GB) |
| RAM | — (no published minimum; allocate ≥ 8 GB system for the Python env + HF cache) | — |
| Storage | ~4 GB | 1.74 GB model.pth + 1.87 GB codec.pth (HF Files tab) |
| Software | Python 3.12, PyTorch ≥ 2.7 with CUDA 12.8 or 12.9 (sm_120 kernels), Linux/WSL2 | fish-speech (GitHub main) |
Installation
1. Clone the fish-speech repo
Per the official install guide:
git clone https://github.com/fishaudio/fish-speech.git
cd fish-speech
2. Create the environment and install dependencies
The official docs offer two paths — conda and uv. On Blackwell you must pull a PyTorch build with sm_120 kernels, so use the cu128 or cu129 extra (not the default resolution, which can land a stale sm_50…sm_90-only wheel):
uv sync --python 3.12 --extra cu129
Conda equivalent (official install guide):
conda create -n fish-speech python=3.12 -y
conda activate fish-speech
pip install -e .[cu129]
If you previously installed PyTorch from a different index and want to be explicit, the Blackwell-tested combination is torch >= 2.7 from https://download.pytorch.org/whl/cu128 — confirmed working on RTX 5070 (sm_120) by an independent WSL2 install walkthrough. Unlike Ada GPUs (sm_89), Blackwell (sm_120) is not covered by the kernels in a default pip install torch, so the CUDA extra selection is mandatory, not optional.
3. Authenticate with Hugging Face
The S1 Mini repo is gated — you must accept the model's terms on the Hugging Face page and log in locally:
huggingface-cli login
4. Download the weights
hf download fishaudio/openaudio-s1-mini --local-dir checkpoints/openaudio-s1-mini
This pulls ~3.6 GB of files (model.pth, codec.pth, config.json, tokenizer.tiktoken, special_tokens.json) into checkpoints/openaudio-s1-mini/, the path expected by the inference scripts. Reported download size matches: "About 3.5 GB to download" per the archy.net deployment guide.
Equivalent Python form:
python -c "from huggingface_hub import snapshot_download; snapshot_download('fishaudio/openaudio-s1-mini', local_dir='checkpoints/openaudio-s1-mini')"
Running
Option A — API server (recommended)
The canonical entrypoint for serving S1 Mini is tools.api_server. The flags below come verbatim from the official Fish Audio Running Inference docs:
uv run --python 3.12 python -m tools.api_server \
--listen 0.0.0.0:8080 \
--llama-checkpoint-path "checkpoints/openaudio-s1-mini" \
--decoder-checkpoint-path "checkpoints/openaudio-s1-mini/codec.pth" \
--decoder-config-name modded_dac_vq
The --decoder-config-name modded_dac_vq is mandatory for S1 Mini — it pins the codec architecture variant the distilled model was trained against (official docs).
Synthesise a sentence with curl:
curl -X POST "http://127.0.0.1:8080/v1/tts" \
-H "Content-Type: application/json" \
-d '{"text": "Testing one two three."}' \
--output out.wav
Option B — Gradio WebUI
For an interactive UI, swap tools.api_server for tools.run_webui with the same checkpoint flags:
uv run --python 3.12 python -m tools.run_webui \
--llama-checkpoint-path "checkpoints/openaudio-s1-mini" \
--decoder-checkpoint-path "checkpoints/openaudio-s1-mini/codec.pth" \
--decoder-config-name modded_dac_vq
Then open http://127.0.0.1:7860.
Optional: --compile for faster repeat inference
The official docs note: "Add the --compile flag to enable torch.compile optimization for faster inference". The first call will pay a one-time compile cost (10–30 s), after which subsequent calls run on a fused graph. The 5060's ~3 GB headroom over the ~5 GB inference envelope leaves room for the compiled graph without OOM risk.
Results
- Speed: No measurement on RTX 5060 has been published. As reference points, an independent TrueNAS deployment (archy.net, Feb 2026) measured 4.85 s to synthesise the "Testing one two three." sentence on an RTX 3090 and 9.35 s on a 6 GB RTX A2000. The RTX 5060 sits between these in raw compute — closer to the A2000 class than the 3090 — so expect end-to-end latency on the order of several seconds per short sentence. Submit a measured run via /contribute to seed
/check/. - VRAM usage: ~5 GB during inference, measured on RTX 3090 and confirmed by the model loading on a 6 GB RTX A2000 (archy.net). The article states verbatim: "The model needs about 5GB of VRAM." This is consistent with the derived envelope from the HF Files tab — 1.74 GB
model.pth+ 1.87 GBcodec.pth= 3.61 GB on disk + activations. On the 5060's 8 GB budget, that leaves ~3 GB free for the OS desktop, browser, or a--compilegraph. - Quality notes: 13 languages supported (en, zh, ja, de, fr, es, ko, ar, ru, nl, it, pl, pt); emotion / tone markers like
(angry),(laughing),(in a hurry tone)are honoured per the HF model card. S1 Mini is a distillation of the larger 4 B S1 — quality is close but not identical; the model card publishes per-language WER/CER tables for comparison.
For the full benchmark data, see /check/openaudio-s1-mini/rtx-5060.
Troubleshooting
CUDA error: no kernel image is available for execution on the device / sm_120 is not compatible
Your PyTorch wheel doesn't ship sm_120 kernels. This is the single most common Blackwell failure for this model — reproduced on RTX 5070 Ti (HF discussion #19) and RTX 5090 (fish-speech#1126), both sm_120 like the 5060. Re-install with a recent CUDA 12.8 or 12.9 wheel:
pip install --upgrade --force-reinstall torch torchaudio --index-url https://download.pytorch.org/whl/cu128
or re-run uv sync --python 3.12 --extra cu129 from a clean .venv. Verify with:
python -c "import torch; print(torch.cuda.get_device_capability())"
# Expect: (12, 0)
OSError: You are trying to access a gated repo
You haven't accepted the model's terms or aren't logged in. Visit the model page, click "Agree and access repository", then re-run huggingface-cli login with a token that has read access to gated repos.
Codec config name confusion
If you copy-paste an older snippet that uses --decoder-config-name firefly_gan_vq (the pre-OpenAudio codec name), inference will fail with a config-load error. S1 Mini requires modded_dac_vq per the official Running Inference docs.
Out of memory on first generation
The model runs in ~5 GB but enabling --compile or sending very large prompt batches can push the working set higher. On the 5060 (8 GB) this is rarely an issue, but if you see OOM during a long generation, drop --compile and try the --half flag — flagged by the official docs as a fallback for GPUs lacking native bf16. The RTX 5060 (Blackwell) does support bf16 natively, so --half is normally unnecessary; reach for it only as a last-resort memory-reduction switch.