What You'll Build
A local 13-language text-to-speech pipeline using OpenAudio S1 Mini — the 0.5 B-parameter distilled version of Fish Audio's S1 model — running on an RTX 4070 Ti Super. You'll launch the official tools.api_server from the fish-speech codebase, hit it with curl, and get back synthesised speech in any of English, Chinese, Japanese, German, French, Spanish, Korean, Arabic, Russian, Dutch, Italian, Polish, or Portuguese.
Hardware data: RTX 4070 Ti Super (16 GB VRAM) · openaudio-s1-mini runtime fits in ~5 GB VRAM per a TrueNAS deployment that tested it on a 24 GB RTX 3090 and confirmed it loading on a 6 GB RTX A2000 (archy.net, 2026-02-17) · weights ~3.61 GB on disk · See benchmark data
ℹ️ The RTX 4070 Ti Super is heavily over-provisioned for this model. S1 Mini's ~5 GB runtime envelope leaves the 16 GB RTX 4070 Ti Super with roughly 11 GB of headroom. The model loaded successfully on a 6 GB RTX A2000 per archy.net, so the 4070 Ti Super's value here is not "does it fit" (it fits trivially) but raw inference speed plus the spare VRAM — enough for
torch.compilegraphs, long prompt batches, multiple concurrent voice presets resident in memory, or running a second small model (e.g. an ASR pipeline) on the same card. See "Using the 16 GB headroom" in Troubleshooting.
⚠️ Non-commercial, share-alike weights. The S1 Mini weights are licensed CC-BY-NC-SA-4.0 per the Hugging Face model card (confirmed via the HF API
cardData.license). This is not a permissive license: you may use the weights for research, demos, and personal projects, but the NC clause forbids commercial deployment or paid services, and the SA clause requires that any redistributed derivative carry the same CC-BY-NC-SA-4.0 terms with attribution. For a commercially-usable open-weight TTS in the same VRAM class, see Kokoro or other CC0/Apache-licensed TTS recipes.
ℹ️ Repo recently renamed. The canonical Hugging Face slug
fishaudio/openaudio-s1-mininow 307-redirects tofishaudio/s1-mini— same model, same weights, samecc-by-nc-sa-4.0license. Both the redirecting old URL and the new one work; the install commands below use the path names the official docs expect.
Requirements
| Component | Minimum | Tested |
|---|---|---|
| GPU | 6 GB VRAM (model loaded on a 6 GB RTX A2000 per archy.net) | RTX 4070 Ti Super (16 GB) |
| RAM | — (no published minimum; allocate ≥ 8 GB system for the Python env + HF cache) | — |
| Storage | ~4 GB | 1.74 GB model.pth + 1.87 GB codec.pth (HF tree API) |
| Software | Python 3.12, PyTorch ≥ 2.5 with CUDA 12.6+, Linux/WSL2 | fish-speech (GitHub main) |
Installation
1. Clone the fish-speech repo
Per the official install guide:
git clone https://github.com/fishaudio/fish-speech.git
cd fish-speech
2. Create the environment and install dependencies
The official docs offer two paths — conda and uv — and three CUDA build extras: cu126, cu128, cu129. The RTX 4070 Ti Super is Ada-class (compute capability sm_89), which is supported by every modern PyTorch wheel, so any of the three CUDA extras work; the official docs default to cu129:
uv sync --python 3.12 --extra cu129
Conda equivalent (official install guide):
conda create -n fish-speech python=3.12 -y
conda activate fish-speech
pip install -e .[cu129]
If you prefer a wheel from PyTorch's index directly, the Ada-tested combination is torch >= 2.5 from https://download.pytorch.org/whl/cu126, cu128, or cu129 — the three CUDA build extras documented by the official install guide. Unlike Blackwell GPUs (sm_120), no special wheel selection is required for the RTX 4070 Ti Super — the default pip install torch already includes sm_89 kernels and full FlashAttention-2 / Triton support.
3. Authenticate with Hugging Face
The S1 Mini repo is gated (gated: auto on the HF API) — you must request access and accept the model's terms on the Hugging Face page, then log in locally:
huggingface-cli login
4. Download the weights
hf download fishaudio/openaudio-s1-mini --local-dir checkpoints/openaudio-s1-mini
This pulls ~3.6 GB of files (model.pth, codec.pth, config.json, tokenizer.tiktoken, special_tokens.json) into checkpoints/openaudio-s1-mini/, the path expected by the inference scripts. The size matches: "About 3.5 GB to download" per the archy.net deployment guide, and the live HF tree API reports 1.74 GB model.pth + 1.87 GB codec.pth = 3.61 GB total.
Equivalent Python form:
python -c "from huggingface_hub import snapshot_download; snapshot_download('fishaudio/openaudio-s1-mini', local_dir='checkpoints/openaudio-s1-mini')"
Running
Option A — API server (recommended)
The canonical entrypoint for serving S1 Mini is tools.api_server. The flags below come verbatim from the official Fish Audio Running Inference docs:
uv run --python 3.12 python -m tools.api_server \
--listen 0.0.0.0:8080 \
--llama-checkpoint-path "checkpoints/openaudio-s1-mini" \
--decoder-checkpoint-path "checkpoints/openaudio-s1-mini/codec.pth" \
--decoder-config-name modded_dac_vq
The --decoder-config-name modded_dac_vq is mandatory for S1 Mini — it pins the codec architecture variant the distilled model was trained against (official docs).
Synthesise a sentence with curl:
curl -X POST "http://127.0.0.1:8080/v1/tts" \
-H "Content-Type: application/json" \
-d '{"text": "Testing one two three."}' \
--output out.wav
Option B — Gradio WebUI
For an interactive UI, swap tools.api_server for tools.run_webui with the same checkpoint flags:
uv run --python 3.12 python -m tools.run_webui \
--llama-checkpoint-path "checkpoints/openaudio-s1-mini" \
--decoder-checkpoint-path "checkpoints/openaudio-s1-mini/codec.pth" \
--decoder-config-name modded_dac_vq
Then open http://127.0.0.1:7860.
Optional: --compile for faster repeat inference
The official docs note: "Add the --compile flag to enable torch.compile optimization for faster inference." The first call will pay a one-time compile cost (10–30 s), after which subsequent calls run on a fused graph. The RTX 4070 Ti Super's ~11 GB headroom over the ~5 GB inference envelope leaves room for the compiled graph plus any concurrent workload without OOM risk.
Results
- Speed: No measurement on the RTX 4070 Ti Super has been published, and our backend has no benchmark for this pair yet (/check/openaudio-s1-mini/rtx-4070-ti-super returns
verdict: unknown). As external reference points only, an independent TrueNAS deployment (archy.net, Feb 2026) measured 4.85 s (≈22 tok/s) to synthesise the "Testing one two three." sentence on a 24 GB RTX 3090, and 9.35 s (≈8 tok/s) on a 6 GB RTX A2000. The RTX 4070 Ti Super (Adasm_89, 16 GB GDDR6X, 256-bit, ~672 GB/s bandwidth) sits in the same Ada generation as those reference cards and should land in the same low-single-digit-seconds range for a short sentence on the first call, faster with--compilewarm. These are not RTX 4070 Ti Super figures — submit a measured run via /contribute to seed/check/. - VRAM usage: ~5 GB during inference, measured on a 24 GB RTX 3090 and confirmed by the model loading on a 6 GB RTX A2000 (archy.net). The article states verbatim: "The model needs about 5GB of VRAM." This is consistent with the derived envelope from the live HF tree API — 1.74 GB
model.pth+ 1.87 GBcodec.pth= 3.61 GB on disk + activations. On the RTX 4070 Ti Super's 16 GB budget that leaves ~11 GB free for the OS desktop, browser,--compilegraph, batched generations, or a co-resident small model. - Quality notes: 13 languages supported (en, zh, ja, de, fr, es, ko, ar, ru, nl, it, pl, pt); emotion / tone markers like
(angry),(laughing),(whispering)are honoured per the HF model card. S1 Mini is a distillation of the larger S1 — quality is close but not identical.
For the full benchmark data, see /check/openaudio-s1-mini/rtx-4070-ti-super.
Troubleshooting
OSError: You are trying to access a gated repo
You haven't requested access / accepted the model's terms, or aren't logged in. Visit the model page, click "Agree and access repository", then re-run huggingface-cli login with a token that has read access to gated repos. (An unauthenticated download of a gated repo returns HTTP 401 — that's expected, not a broken link.)
Codec config name confusion
If you copy-paste an older snippet that uses --decoder-config-name firefly_gan_vq (the pre-OpenAudio codec name), inference will fail with a config-load error. S1 Mini requires modded_dac_vq per the official Running Inference docs.
Out of memory on first generation
The model runs in ~5 GB but enabling --compile or sending very large prompt batches can push the working set higher. On the RTX 4070 Ti Super this is essentially impossible from the model itself — you have ~11 GB of headroom — but it can still bite if you stack additional models on the same card. Drop --compile, or add the --half flag, which the official docs describe for "GPUs without bf16 support" to use fp16 instead. The RTX 4070 Ti Super (Ada) supports bf16 natively, so --half is normally unnecessary; reach for it only when juggling multiple co-resident models.
pip install -e .[cu129] fails to resolve a torch wheel
The cu129 extra requires a recent enough pip + index state. If resolution stalls, switch to cu128 or cu126 — all three include sm_89 kernels and are functionally equivalent for the RTX 4070 Ti Super. Re-run:
pip install -e .[cu126]
Using the 16 GB headroom
Because S1 Mini only needs ~5 GB at runtime, the RTX 4070 Ti Super is comfortable as a multi-purpose audio card. Common patterns: keep the S1 Mini API server running on port 8080 while a separate ASR or VAD model (e.g. faster-whisper, Silero) loads in another process, or pre-warm --compile for several speakers and batch larger prompt windows. The RTX 4070 Ti Super's PCIe Gen4 x16 link also makes any CPU↔GPU offload smoother than on narrower-bus cards. Monitor headroom with nvidia-smi --query-gpu=memory.used --format=csv -l 1 during your workload to confirm the working set stays well under the 16 GB budget.