self-hosted/ai
§01·recipe · tts

OpenAudio S1 Mini on RTX 3060 Ti: 13-Language Distilled TTS in ~5 GB VRAM

ttsintermediate5GB+ VRAMJun 16, 2026

This intermediate recipe sets up OpenAudio S1 Mini on the RTX 3060 Ti, needing about 5 GB of VRAM.

models
tools
prerequisites
  • NVIDIA RTX 3060 Ti (8 GB VRAM) or any CUDA GPU with ≥ 6 GB VRAM
  • Python 3.12 (fresh conda or `uv` environment recommended)
  • Linux or WSL2 (per the official install matrix)
  • Hugging Face account — the `fishaudio/openaudio-s1-mini` repo is gated and requires `huggingface-cli login`
  • ~4 GB free disk for weights (1.74 GB `model.pth` + 1.87 GB `codec.pth`)

What You'll Build

A local 13-language text-to-speech pipeline using OpenAudio S1 Mini — the 0.5 B-parameter distilled version of Fish Audio's S1 model — running on an RTX 3060 Ti. You'll launch the official tools.api_server from the fish-speech codebase, hit it with curl, and get back synthesised speech in any of English, Chinese, Japanese, German, French, Spanish, Korean, Arabic, Russian, Dutch, Italian, Polish, or Portuguese.

Hardware data: RTX 3060 Ti (8 GB VRAM, Ampere GA104-200 sm_86, 256-bit GDDR6 ~448 GB/s, 4864 CUDA cores — Wikipedia GeForce 30 series, ASUS Dual RTX 3060 Ti) · openaudio-s1-mini runtime fits in ~5 GB VRAM per a TrueNAS deployment that tested it on a 24 GB RTX 3090 and confirmed it loading on a 6 GB RTX A2000 (archy.net, 2026-02-17) · weights ~3.61 GB on disk · See benchmark data

ℹ️ Comfortable fit, not a squeeze. S1 Mini's ~5 GB runtime envelope leaves the RTX 3060 Ti's 8 GB budget with roughly 3 GB of headroom — enough for torch.compile graphs, longer prompt batches, or a small concurrent workload. The model loaded successfully on a 6 GB RTX A2000 per archy.net, so the 3060 Ti is over-provisioned for inference.

⚠️ Non-commercial, share-alike weights. The S1 Mini weights are licensed CC-BY-NC-SA-4.0 per the Hugging Face model card (confirmed via the HF API cardData.license). This is not a permissive license: you may use the weights for research, demos, and personal projects, but the NC clause forbids commercial deployment or paid services, and the SA clause requires that any redistributed derivative carry the same CC-BY-NC-SA-4.0 terms with attribution. The fish-speech codebase itself ships under the "Fish Audio Research License" (repo LICENSE). For a commercially-usable open-weight TTS in the same VRAM class, see VoxCPM or Kokoro.

ℹ️ Repo recently renamed. The canonical Hugging Face slug fishaudio/openaudio-s1-mini now 307-redirects to fishaudio/s1-mini — same model, same weights, same cc-by-nc-sa-4.0 license. Both the redirecting old URL and the new one work; the install commands below use the path names the official docs expect.

Requirements

ComponentMinimumTested
GPU6 GB VRAM (model loaded on a 6 GB RTX A2000 per archy.net)RTX 3060 Ti (8 GB GDDR6, 256-bit, Ampere GA104-200 sm_86 — Wikipedia GeForce 30 series)
RAM16 GB system
Storage~4 GB1.74 GB model.pth + 1.87 GB codec.pth (HF tree API)
SoftwarePython 3.12, PyTorch ≥ 2.5 with CUDA 12.4+, Linux/WSL2fish-speech (GitHub main)

Installation

1. Clone the fish-speech repo

Per the official install guide:

git clone https://github.com/fishaudio/fish-speech.git
cd fish-speech

2. Create the environment and install dependencies

The official docs offer two paths — conda and uv — and three CUDA build extras: cu126, cu128, cu129. The RTX 3060 Ti is Ampere-class (compute capability sm_86), which is supported by every modern PyTorch wheel, so any of the three CUDA extras work and no special wheel selection is required:

uv sync --python 3.12 --extra cu126

Conda equivalent (official install guide):

conda create -n fish-speech python=3.12 -y
conda activate fish-speech
pip install -e .[cu126]

If you prefer a wheel from PyTorch's index directly, the Ampere-tested combination is torch >= 2.5 from https://download.pytorch.org/whl/cu124 or any of the cu126/cu128/cu129 build extras documented by the official install guide — Ampere sm_86 kernels have shipped in every PyTorch CUDA wheel since the 11.x series, and sm_86 has had stock prebuilt FlashAttention-2 wheels since FA2 2.x, so you can keep flash_attention_2 wherever sample code enables it. The default pip install torch already includes sm_86 kernels — no special wheel selection applies on Ampere.

3. Authenticate with Hugging Face

The S1 Mini repo is gated (gated: auto on the HF API) — you must request access and accept the model's terms on the Hugging Face page, then log in locally:

huggingface-cli login

4. Download the weights

hf download fishaudio/openaudio-s1-mini --local-dir checkpoints/openaudio-s1-mini

This pulls ~3.6 GB of files (model.pth, codec.pth, config.json, tokenizer.tiktoken, special_tokens.json) into checkpoints/openaudio-s1-mini/, the path expected by the inference scripts. The size matches: "About 3.5 GB to download" per the archy.net deployment guide, and the live HF tree API reports 1.74 GB model.pth + 1.87 GB codec.pth = 3.61 GB total.

Equivalent Python form:

python -c "from huggingface_hub import snapshot_download; snapshot_download('fishaudio/openaudio-s1-mini', local_dir='checkpoints/openaudio-s1-mini')"

Running

Option A — API server (recommended)

The canonical entrypoint for serving S1 Mini is tools.api_server. The flags below come verbatim from the official Fish Audio Running Inference docs:

uv run --python 3.12 python -m tools.api_server \
  --listen 0.0.0.0:8080 \
  --llama-checkpoint-path "checkpoints/openaudio-s1-mini" \
  --decoder-checkpoint-path "checkpoints/openaudio-s1-mini/codec.pth" \
  --decoder-config-name modded_dac_vq

The --decoder-config-name modded_dac_vq is mandatory for S1 Mini — it pins the codec architecture variant the distilled model was trained against (official docs).

Synthesise a sentence with curl:

curl -X POST "http://127.0.0.1:8080/v1/tts" \
  -H "Content-Type: application/json" \
  -d '{"text": "Testing one two three."}' \
  --output out.wav

Option B — Gradio WebUI

For an interactive UI, swap tools.api_server for tools.run_webui with the same checkpoint flags:

uv run --python 3.12 python -m tools.run_webui \
  --llama-checkpoint-path "checkpoints/openaudio-s1-mini" \
  --decoder-checkpoint-path "checkpoints/openaudio-s1-mini/codec.pth" \
  --decoder-config-name modded_dac_vq

Then open http://127.0.0.1:7860.

Optional: --compile for faster repeat inference

Add the --compile flag to enable torch.compile optimisation for faster inference (official docs). The first call pays a one-time compile cost (10–30 s), after which subsequent calls run on a fused graph. The RTX 3060 Ti's ~3 GB headroom over the ~5 GB inference envelope leaves room for the compiled graph without OOM risk.

Results

  • Speed: No measurement on the RTX 3060 Ti has been published, and our backend has no benchmark for this pair yet (/check/openaudio-s1-mini/rtx-3060-ti returns verdict: unknown). As external reference points only, an independent TrueNAS deployment (archy.net, Feb 2026) measured 4.85 s to synthesise the "Testing one two three." sentence on a 24 GB RTX 3090, and 9.35 s on a 6 GB RTX A2000. The RTX 3060 Ti is the same Ampere generation as the RTX 3090 but a far smaller card — ~448 GB/s memory bandwidth and 4864 CUDA cores versus the 3090's 936 GB/s and 10496 cores (Wikipedia GeForce 30 series) — so neither reference card is close enough to quote as a 3060 Ti target number, and we do not extrapolate one. These are not RTX 3060 Ti figures — submit a measured run via /contribute to seed /check/.
  • VRAM usage: ~5 GB during inference, measured on a 24 GB RTX 3090 and confirmed by the model loading on a 6 GB RTX A2000 (archy.net). The article states verbatim: "The model needs about 5GB of VRAM." This is consistent with the derived envelope from the live HF tree API — 1.74 GB model.pth + 1.87 GB codec.pth = 3.61 GB on disk + activations. On the RTX 3060 Ti's 8 GB budget that leaves ~3 GB free for the OS desktop, browser, --compile graph, or longer prompt batches.
  • Quality notes: 13 languages supported (en, zh, ja, de, fr, es, ko, ar, ru, nl, it, pl, pt); emotion / tone markers like (angry), (laughing), (whispering) are honoured per the HF model card. S1 Mini is a distillation of the larger S1 — quality is close but not identical.

For the full benchmark data, see /check/openaudio-s1-mini/rtx-3060-ti.

Troubleshooting

OSError: You are trying to access a gated repo

You haven't requested access / accepted the model's terms, or aren't logged in. Visit the model page, click "Agree and access repository", then re-run huggingface-cli login with a token that has read access to gated repos. (An unauthenticated request to the repo's README returns HTTP 401 — that is expected for a gated repo, not a broken link.)

Codec config name confusion

If you copy-paste an older snippet that uses --decoder-config-name firefly_gan_vq (the pre-OpenAudio codec name), inference will fail with a config-load error. S1 Mini requires modded_dac_vq per the official Running Inference docs.

Out of memory on first generation

The model runs in ~5 GB but enabling --compile or sending very large prompt batches can push the working set higher. On the RTX 3060 Ti (8 GB, bf16-capable Ampere) you have ~3 GB of headroom, so this is rarely an issue from the model itself — but it can still bite if you stack additional models on the same card or run long batched generations. Drop --compile, or add the --half flag, which the official docs describe for GPUs without bf16 support to use fp16 instead. The RTX 3060 Ti (Ampere) supports bf16 natively, so --half is normally unnecessary; reach for it only as a last-resort memory-reduction switch.

pip install -e .[cu126] fails to resolve a torch wheel

If wheel resolution stalls, switch to cu128 or cu129 — all include sm_86 kernels and are functionally equivalent for the RTX 3060 Ti. Re-run:

pip install -e .[cu128]
common questions
How much VRAM does OpenAudio S1 Mini need?

About 5 GB — the minimum this recipe targets.

Which GPUs is OpenAudio S1 Mini tested on?

RTX 3060 Ti (8 GB).

How hard is this setup?

Intermediate — follow the steps above.