self-hosted/ai
§01·recipe · tts

OpenAudio S1 Mini on RTX 4060 Ti 16GB: 13-Language Distilled TTS in ~5 GB VRAM

ttsintermediate5GB+ VRAMMay 21, 2026
models
tools
prerequisites
  • NVIDIA RTX 4060 Ti 16GB or any CUDA GPU with ≥ 6 GB VRAM
  • Python 3.12 (fresh conda or `uv` environment recommended)
  • Linux or WSL2 (per the official install matrix)
  • Hugging Face account — the `fishaudio/openaudio-s1-mini` repo is gated and requires `huggingface-cli login`
  • ~4 GB free disk for weights (1.74 GB `model.pth` + 1.87 GB `codec.pth`)

What You'll Build

A local 13-language text-to-speech pipeline using OpenAudio S1 Mini — the 0.5 B-parameter distilled version of Fish Audio's S1 model — running on an RTX 4060 Ti 16GB. You'll launch the official tools.api_server from the fish-speech codebase, hit it with curl, and get back synthesised speech in any of English, Chinese, Japanese, German, French, Spanish, Korean, Arabic, Russian, Dutch, Italian, Polish, or Portuguese.

Hardware data: RTX 4060 Ti 16GB · openaudio-s1-mini runtime fits in ~5 GB VRAM per a TrueNAS deployment that tested it on both a 24 GB RTX 3090 and a 6 GB RTX A2000 (archy.net, 2026-02-17) · weights ~3.61 GB on disk · See benchmark data

ℹ️ The 8 GB → 16 GB tier upgrade is the point. S1 Mini's ~5 GB runtime envelope leaves the RTX 4060 Ti 16GB with roughly 11 GB of headroom — far more than the RTX 4060 8 GB sibling recipe (~3 GB headroom) and enough for torch.compile graphs, long prompt batches, multiple concurrent voice presets resident in memory, or running a second small model (e.g. an ASR pipeline) on the same card. The model loaded successfully on a 6 GB RTX A2000 per archy.net, so the 4060 Ti 16GB is heavily over-provisioned for inference.

⚠️ Non-commercial weights. The S1 Mini weights are licensed CC-BY-NC-SA-4.0 per the Hugging Face model card — you may use them for research, demos, and personal projects, but not for commercial deployment or paid services. The fish-speech codebase itself ships under the "Fish Audio Research License" (repo LICENSE). For a commercially-usable open-weight TTS in the same VRAM class, see Kokoro or other CC0/Apache-licensed TTS recipes.

Requirements

ComponentMinimumTested
GPU6 GB VRAM (model ran on a 6 GB RTX A2000 per archy.net)RTX 4060 Ti 16GB
RAM— (no published minimum; allocate ≥ 8 GB system for the Python env + HF cache)
Storage~4 GB1.74 GB model.pth + 1.87 GB codec.pth (HF Files tab)
SoftwarePython 3.12, PyTorch ≥ 2.5 with CUDA 12.6+, Linux/WSL2fish-speech (GitHub main)

Installation

1. Clone the fish-speech repo

Per the official install guide:

git clone https://github.com/fishaudio/fish-speech.git
cd fish-speech

2. Create the environment and install dependencies

The official docs offer two paths — conda and uv — and three CUDA build extras: cu126, cu128, cu129. The RTX 4060 Ti 16GB is Ada-class (compute capability sm_89), which is supported by every modern PyTorch wheel, so any of the three CUDA extras work; the official docs default to cu129:

uv sync --python 3.12 --extra cu129

Conda equivalent (official install guide):

conda create -n fish-speech python=3.12 -y
conda activate fish-speech
pip install -e .[cu129]

If you prefer a wheel from PyTorch's index directly, the Ada-tested combination is torch >= 2.5 from https://download.pytorch.org/whl/cu126, cu128, or cu129 — the three CUDA build extras documented by the official install guide. Unlike Blackwell GPUs (sm_120), no special wheel selection is required for the 4060 Ti 16GB — the default pip install torch already includes sm_89 kernels and full FlashAttention-2 / Triton support.

3. Authenticate with Hugging Face

The S1 Mini repo is gated — you must accept the model's terms on the Hugging Face page and log in locally:

huggingface-cli login

4. Download the weights

hf download fishaudio/openaudio-s1-mini --local-dir checkpoints/openaudio-s1-mini

This pulls ~3.6 GB of files (model.pth, codec.pth, config.json, tokenizer.tiktoken, special_tokens.json) into checkpoints/openaudio-s1-mini/, the path expected by the inference scripts. Reported download size matches: "About 3.5 GB to download" per the archy.net deployment guide.

Equivalent Python form:

python -c "from huggingface_hub import snapshot_download; snapshot_download('fishaudio/openaudio-s1-mini', local_dir='checkpoints/openaudio-s1-mini')"

Running

Option A — API server (recommended)

The canonical entrypoint for serving S1 Mini is tools.api_server. The flags below come verbatim from the official Fish Audio Running Inference docs:

uv run --python 3.12 python -m tools.api_server \
  --listen 0.0.0.0:8080 \
  --llama-checkpoint-path "checkpoints/openaudio-s1-mini" \
  --decoder-checkpoint-path "checkpoints/openaudio-s1-mini/codec.pth" \
  --decoder-config-name modded_dac_vq

The --decoder-config-name modded_dac_vq is mandatory for S1 Mini — it pins the codec architecture variant the distilled model was trained against (official docs).

Synthesise a sentence with curl:

curl -X POST "http://127.0.0.1:8080/v1/tts" \
  -H "Content-Type: application/json" \
  -d '{"text": "Testing one two three."}' \
  --output out.wav

Option B — Gradio WebUI

For an interactive UI, swap tools.api_server for tools.run_webui with the same checkpoint flags:

uv run --python 3.12 python -m tools.run_webui \
  --llama-checkpoint-path "checkpoints/openaudio-s1-mini" \
  --decoder-checkpoint-path "checkpoints/openaudio-s1-mini/codec.pth" \
  --decoder-config-name modded_dac_vq

Then open http://127.0.0.1:7860.

Optional: --compile for faster repeat inference

The official docs note: "Add the --compile flag to enable torch.compile optimization for faster inference". The first call will pay a one-time compile cost (10–30 s), after which subsequent calls run on a fused graph. The 4060 Ti 16GB's ~11 GB headroom over the ~5 GB inference envelope leaves room for the compiled graph plus any concurrent workload without OOM risk.

Results

  • Speed: No measurement on RTX 4060 Ti 16GB has been published. As reference points, an independent TrueNAS deployment (archy.net, Feb 2026) measured 4.85 s to synthesise the "Testing one two three." sentence on an RTX 3090 and 9.35 s on a 6 GB RTX A2000. The RTX 4060 Ti 16GB sits between these in raw compute (Ada sm_89, 22 TFLOPS FP16, 288 GB/s bandwidth) — its bandwidth profile is closer to the A2000 than the 3090, so expect end-to-end latency on the order of several seconds per short sentence rather than near-real-time on the first call. Submit a measured run via /contribute to seed /check/.
  • VRAM usage: ~5 GB during inference, measured on RTX 3090 and confirmed by the model loading on a 6 GB RTX A2000 (archy.net). The article states verbatim: "The model needs about 5GB of VRAM." This is consistent with the derived envelope from the HF Files tab — 1.74 GB model.pth + 1.87 GB codec.pth = 3.61 GB on disk + activations. On the 4060 Ti 16GB's 16 GB budget, that leaves ~11 GB free for the OS desktop, browser, --compile graph, batched generations, or a co-resident small model.
  • Quality notes: 13 languages supported (en, zh, ja, de, fr, es, ko, ar, ru, nl, it, pl, pt); emotion / tone markers like (angry), (laughing), (in a hurry tone) are honoured per the HF model card. S1 Mini is a distillation of the larger 4 B S1 — quality is close but not identical; the model card publishes per-language WER/CER tables for comparison.

For the full benchmark data, see /check/openaudio-s1-mini/rtx-4060-ti-16gb.

Troubleshooting

OSError: You are trying to access a gated repo

You haven't accepted the model's terms or aren't logged in. Visit the model page, click "Agree and access repository", then re-run huggingface-cli login with a token that has read access to gated repos.

Codec config name confusion

If you copy-paste an older snippet that uses --decoder-config-name firefly_gan_vq (the pre-OpenAudio codec name), inference will fail with a config-load error. S1 Mini requires modded_dac_vq per the official Running Inference docs.

Out of memory on first generation

The model runs in ~5 GB but enabling --compile or sending very large prompt batches can push the working set higher. On the 4060 Ti 16GB this is essentially impossible from the model itself — you have ~11 GB of headroom — but it can still bite if you stack additional models on the same card. Drop --compile and try the --half flag — flagged by the official docs as a fallback for GPUs lacking native bf16. The RTX 4060 Ti 16GB (Ada) supports bf16 natively, so --half is normally unnecessary; reach for it only when juggling multiple co-resident models.

pip install -e .[cu129] fails to resolve a torch wheel

The cu129 extra requires a recent enough pip + index state. If resolution stalls, switch to cu128 or cu126 — all three include sm_89 kernels and are functionally equivalent for the RTX 4060 Ti 16GB. Re-run:

pip install -e .[cu126]

Using the 16 GB headroom

Because S1 Mini only needs ~5 GB at runtime, the 4060 Ti 16GB is comfortable as a multi-purpose audio card. Common patterns: keep the S1 Mini API server running on port 8080 while a separate ASR or VAD model (e.g. faster-whisper, Silero) loads in another process, or pre-warm --compile for several speakers and batch larger prompt windows. Monitor headroom with nvidia-smi --query-gpu=memory.used --format=csv -l 1 during your workload to confirm the working set stays well under the 16 GB budget before you ship to production.