How much VRAM does OpenAudio S1 Mini need?

About 5 GB — the minimum this recipe targets.

How hard is this setup?

Intermediate — follow the steps above.

OpenAudio S1 Mini on RTX 4060 Ti 8GB: 13-Language Distilled TTS in ~5 GB VRAM

What You'll Build

A local 13-language text-to-speech pipeline using OpenAudio S1 Mini — the 0.5 B-parameter distilled version of Fish Audio's S1 model — running on an RTX 4060 Ti 8GB. You'll launch the official tools.api_server from the fish-speech codebase, hit it with curl, and get back synthesised speech in any of English, Chinese, Japanese, German, French, Spanish, Korean, Arabic, Russian, Dutch, Italian, Polish, or Portuguese.

Hardware data: RTX 4060 Ti 8GB (8 GB VRAM, AD106-350 die, 4352 CUDA cores, 288 GB/s, 160 W TGP — Wikipedia GeForce 40 series, ASUS Dual RTX 4060 Ti 8GB tech-spec) · openaudio-s1-mini runtime fits in ~5 GB VRAM per a TrueNAS deployment that tested it on both a 24 GB RTX 3090 and a 6 GB RTX A2000 (archy.net, 2026-02-17) · weights ~3.61 GB on disk · See benchmark data

ℹ️ Comfortable fit, not a squeeze. S1 Mini's ~5 GB runtime envelope leaves the RTX 4060 Ti 8GB's 8 GB budget with roughly 3 GB of headroom — enough for torch.compile graphs, longer prompt batches, or a small concurrent workload. The model loaded successfully on a 6 GB RTX A2000 per archy.net, so the 4060 Ti 8GB is over-provisioned for inference.

⚠️ Non-commercial weights. The S1 Mini weights are licensed CC-BY-NC-SA-4.0 per the Hugging Face model card — you may use them for research, demos, and personal projects, but not for commercial deployment or paid services. The fish-speech codebase itself ships under the "Fish Audio Research License" (repo LICENSE). For a commercially-usable open-weight TTS in the same VRAM class, see Kokoro or other CC0/Apache-licensed TTS recipes.

ℹ️ Repo recently renamed. The canonical Hugging Face slug fishaudio/openaudio-s1-mini now 307-redirects to fishaudio/s1-mini — same model, same weights, same cc-by-nc-sa-4.0 license. Both the redirecting old URL and the new one work; the install commands below use the path names the official docs expect.

Requirements

Component	Minimum	Tested
GPU	6 GB VRAM (model ran on a 6 GB RTX A2000 per archy.net)	RTX 4060 Ti 8GB
RAM	— (no published minimum; allocate ≥ 8 GB system for the Python env + HF cache)	—
Storage	~4 GB	1.74 GB `model.pth` + 1.87 GB `codec.pth` (HF Files tab)
Software	Python 3.12, PyTorch ≥ 2.5 with CUDA 12.6+, Linux/WSL2	`fish-speech` (GitHub `main`)

Installation

1. Clone the `fish-speech` repo

Per the official install guide:

git clone https://github.com/fishaudio/fish-speech.git
cd fish-speech

2. Create the environment and install dependencies

The official docs offer two paths — conda and uv — and three CUDA build extras: cu126, cu128, cu129. The RTX 4060 Ti 8GB is Ada-class (compute capability sm_89), which is supported by every modern PyTorch wheel, so any of the three CUDA extras work; the official docs default to cu129:

uv sync --python 3.12 --extra cu129

Conda equivalent (official install guide):

conda create -n fish-speech python=3.12 -y
conda activate fish-speech
pip install -e .[cu129]

If you prefer a wheel from PyTorch's index directly, the Ada-tested combination is torch >= 2.5 from https://download.pytorch.org/whl/cu126, cu128, or cu129 — the three CUDA build extras documented by the official install guide. Unlike Blackwell GPUs (sm_120), no special wheel selection is required for the 4060 Ti 8GB — the default pip install torch already includes sm_89 kernels.

3. Authenticate with Hugging Face

The S1 Mini repo is gated — you must accept the model's terms on the Hugging Face page and log in locally:

huggingface-cli login

4. Download the weights

hf download fishaudio/openaudio-s1-mini --local-dir checkpoints/openaudio-s1-mini

This pulls ~3.6 GB of files (model.pth, codec.pth, config.json, tokenizer.tiktoken, special_tokens.json) into checkpoints/openaudio-s1-mini/, the path expected by the inference scripts. Reported download size matches: "About 3.5 GB to download" per the archy.net deployment guide.

Equivalent Python form:

python -c "from huggingface_hub import snapshot_download; snapshot_download('fishaudio/openaudio-s1-mini', local_dir='checkpoints/openaudio-s1-mini')"

Running

Option A — API server (recommended)

The canonical entrypoint for serving S1 Mini is tools.api_server. The flags below come verbatim from the official Fish Audio Running Inference docs:

uv run --python 3.12 python -m tools.api_server \
  --listen 0.0.0.0:8080 \
  --llama-checkpoint-path "checkpoints/openaudio-s1-mini" \
  --decoder-checkpoint-path "checkpoints/openaudio-s1-mini/codec.pth" \
  --decoder-config-name modded_dac_vq

The --decoder-config-name modded_dac_vq is mandatory for S1 Mini — it pins the codec architecture variant the distilled model was trained against (official docs).

Synthesise a sentence with curl:

curl -X POST "http://127.0.0.1:8080/v1/tts" \
  -H "Content-Type: application/json" \
  -d '{"text": "Testing one two three."}' \
  --output out.wav

Option B — Gradio WebUI

For an interactive UI, swap tools.api_server for tools.run_webui with the same checkpoint flags:

uv run --python 3.12 python -m tools.run_webui \
  --llama-checkpoint-path "checkpoints/openaudio-s1-mini" \
  --decoder-checkpoint-path "checkpoints/openaudio-s1-mini/codec.pth" \
  --decoder-config-name modded_dac_vq

Then open http://127.0.0.1:7860.

Optional: `--compile` for faster repeat inference

The official docs note: "Add the --compile flag to enable torch.compile optimization for faster inference". The first call will pay a one-time compile cost (10–30 s), after which subsequent calls run on a fused graph. The 4060 Ti 8GB's ~3 GB headroom over the ~5 GB inference envelope leaves room for the compiled graph without OOM risk.

Results

Speed: No measurement on RTX 4060 Ti 8GB has been published. Submit a measured run via /contribute to seed /check/.
VRAM usage: ~5 GB during inference, measured on RTX 3090 and confirmed by the model loading on a 6 GB RTX A2000 (archy.net). The article states verbatim: "The model needs about 5GB of VRAM." This is consistent with the derived envelope from the HF Files tab — 1.74 GB model.pth + 1.87 GB codec.pth = 3.61 GB on disk + activations. On the 4060 Ti 8GB's 8 GB budget, that leaves ~3 GB free for the OS desktop, browser, or a --compile graph.
Quality notes: 13 languages supported (en, zh, ja, de, fr, es, ko, ar, ru, nl, it, pl, pt); emotion / tone markers like (angry), (laughing), (in a hurry tone) are honoured per the HF model card. S1 Mini is a distillation of the larger 4 B S1 — quality is close but not identical; the model card publishes per-language WER/CER tables for comparison.

For the full benchmark data, see /check/openaudio-s1-mini/rtx-4060-ti-8gb.

Troubleshooting

`OSError: You are trying to access a gated repo`

You haven't accepted the model's terms or aren't logged in. Visit the model page, click "Agree and access repository", then re-run huggingface-cli login with a token that has read access to gated repos.

Codec config name confusion

If you copy-paste an older snippet that uses --decoder-config-name firefly_gan_vq (the pre-OpenAudio codec name), inference will fail with a config-load error. S1 Mini requires modded_dac_vq per the official Running Inference docs.

Out of memory on first generation

The model runs in ~5 GB but enabling --compile or sending very large prompt batches can push the working set higher. On the 4060 Ti 8GB (8 GB) this is rarely an issue, but if you see OOM during a long generation, drop --compile and try the --half flag — flagged by the official docs as a fallback for GPUs lacking native bf16. The RTX 4060 Ti (Ada) does support bf16 natively, so --half is normally unnecessary; reach for it only as a last-resort memory-reduction switch.

`pip install -e .[cu129]` fails to resolve a torch wheel

The cu129 extra requires a recent enough pip + index state. If resolution stalls, switch to cu128 or cu126 — all three include sm_89 kernels and are functionally equivalent for the RTX 4060 Ti 8GB. Re-run:

pip install -e .[cu126]