self-hosted/ai
§01·recipe · tts

OpenAudio S1 Mini on RTX 5070: 13-Language Distilled TTS in ~5 GB VRAM

ttsintermediate5GB+ VRAMJun 5, 2026
models
tools
prerequisites
  • NVIDIA RTX 5070 (12 GB VRAM) or any CUDA GPU with ≥ 6 GB VRAM
  • Python 3.12 (fresh conda or `uv` environment recommended)
  • Linux or WSL2 (per the official install matrix)
  • Hugging Face account — the `fishaudio/openaudio-s1-mini` repo is gated and requires `huggingface-cli login`
  • ~4 GB free disk for weights (1.74 GB `model.pth` + 1.87 GB `codec.pth`)

What You'll Build

A local 13-language text-to-speech pipeline using OpenAudio S1 Mini — the 0.5 B-parameter distilled version of Fish Audio's S1 model — running on an RTX 5070. You'll launch the official tools.api_server from the fish-speech codebase, hit it with curl, and get back synthesised speech in any of English, Chinese, Japanese, German, French, Spanish, Korean, Arabic, Russian, Dutch, Italian, Polish, or Portuguese.

Hardware data: RTX 5070 (12 GB VRAM) · openaudio-s1-mini runtime fits in ~5 GB VRAM — measured as 4–6 GB on an RTX 5070 by an independent WSL2 walkthrough and consistent with a TrueNAS deployment that tested it on a 24 GB RTX 3090 and a 6 GB RTX A2000 (archy.net, 2026-02-17) · weights ~3.61 GB on disk · See benchmark data

ℹ️ The 12 GB card is over-provisioned for this 0.5 B model. S1 Mini's runtime footprint (~5 GB) leaves roughly 7 GB of the RTX 5070's 12 GB idle. That headroom is real but not the point of this recipe — S1 Mini fits comfortably on far smaller cards (it loads on a 6 GB RTX A2000 per archy.net). If you want to put the spare VRAM to work, you can colocate a second small model (e.g. a Whisper ASR model for a speech-to-speech loop) on the same card; that workload is out of scope here.

⚠️ Non-commercial weights. The S1 Mini weights are licensed CC-BY-NC-SA-4.0 per the Hugging Face model card — you may use them for research, demos, and personal projects, but not for commercial deployment or paid services. The fish-speech codebase itself ships under the "Fish Audio Research License" (repo LICENSE). For a commercially-usable open-weight TTS in the same VRAM class, see VoxCPM or Kokoro.

⚠️ Blackwell + PyTorch wheels. The RTX 5070 is a Blackwell card (GB205 die, compute capability sm_120). Default pip install torch ships only sm_50…sm_90 kernels and will crash with NVIDIA GeForce RTX 5070 with CUDA capability sm_120 is not compatible with the current PyTorch installation — the same trap reported and reproduced on this model on the HF discussions tab and on the upstream fish-speech issue #1126 for another Blackwell card (RTX 5090). The fix is to pick the cu128 or cu129 extra at install time — both ship sm_120 kernels — instead of letting uv/pip resolve a stale wheel.

Requirements

ComponentMinimumTested
GPU6 GB VRAM (model ran on a 6 GB RTX A2000 per archy.net)RTX 5070 (12 GB GDDR7, 192-bit, GB205 sm_120)
RAM16 GB system
Storage~4 GB1.74 GB model.pth + 1.87 GB codec.pth (HF Files tab)
SoftwarePython 3.12, PyTorch ≥ 2.5 with CUDA 12.8 or 12.9, Linux/WSL2fish-speech (GitHub main)

Installation

1. Clone the fish-speech repo

Per the official install guide:

git clone https://github.com/fishaudio/fish-speech.git
cd fish-speech

2. Create the environment and install dependencies

The official docs offer two paths — conda and uv. uv resolves Blackwell-compatible wheels faster, so it's the recommended path here. The key flag is --extra cu129 (or cu128) which pulls a PyTorch build with sm_120 kernels:

uv sync --python 3.12 --extra cu129

Conda equivalent (official install guide):

conda create -n fish-speech python=3.12 -y
conda activate fish-speech
pip install -e .[cu129]

If you previously installed PyTorch from a different index and want to be explicit, the Blackwell-tested combination is torch >= 2.7 from https://download.pytorch.org/whl/cu128 — confirmed working on an RTX 5070 (sm_120) by an independent WSL2 install walkthrough that names the same cu128 index and the RTX 5070 explicitly.

3. Authenticate with Hugging Face

The S1 Mini repo is gated — you must accept the model's terms on the Hugging Face page and log in locally:

huggingface-cli login

4. Download the weights

hf download fishaudio/openaudio-s1-mini --local-dir checkpoints/openaudio-s1-mini

This pulls ~3.6 GB of files (model.pth, codec.pth, config.json, tokenizer.tiktoken, special_tokens.json) into checkpoints/openaudio-s1-mini/, the path expected by the inference scripts. Reported download size matches: "About 3.5GB to download" per the archy.net deployment guide.

Equivalent Python form (used by community recipes such as the Furious-Green Japanese-TTS webserver):

python -c "from huggingface_hub import snapshot_download; snapshot_download('fishaudio/openaudio-s1-mini', local_dir='checkpoints/openaudio-s1-mini')"

ℹ️ Repo rename. The canonical HF repo has been renamed to fishaudio/s1-mini; the fishaudio/openaudio-s1-mini slug above still resolves (it 307-redirects), so the download commands work unchanged.

Running

Option A — API server (recommended)

The canonical entrypoint for serving S1 Mini is tools.api_server. The flags below come verbatim from the official Fish Audio Running Inference docs and are corroborated by the published API-server command in Furious-Green's webserver:

uv run --python 3.12 python -m tools.api_server \
  --listen 0.0.0.0:8080 \
  --llama-checkpoint-path "checkpoints/openaudio-s1-mini" \
  --decoder-checkpoint-path "checkpoints/openaudio-s1-mini/codec.pth" \
  --decoder-config-name modded_dac_vq

The --decoder-config-name modded_dac_vq is mandatory for S1 Mini — it pins the codec architecture variant the distilled model was trained against.

Synthesise a sentence with curl:

curl -X POST "http://127.0.0.1:8080/v1/tts" \
  -H "Content-Type: application/json" \
  -d '{"text": "Testing one two three."}' \
  --output out.wav

Option B — Gradio WebUI

For an interactive UI, swap tools.api_server for tools.run_webui with the same checkpoint flags:

uv run --python 3.12 python -m tools.run_webui \
  --llama-checkpoint-path "checkpoints/openaudio-s1-mini" \
  --decoder-checkpoint-path "checkpoints/openaudio-s1-mini/codec.pth" \
  --decoder-config-name modded_dac_vq

Then open http://127.0.0.1:7860.

Results

  • Speed: No clean first-party RTX 5070 benchmark exists yet. An independent RTX 5070 WSL2 walkthrough reports only a rough real-time factor rather than a measured per-sentence time, and an independent TrueNAS deployment (archy.net, Feb 2026) measured 4.85 s to synthesise the "Testing one two three." sentence on an RTX 3090 and 9.35 s on a 6 GB RTX A2000 — but the RTX 5070 (Blackwell GB205, 12 GB, 192-bit GDDR7) is not a close enough sibling to either to quote as a target number, so we do not extrapolate one. Submit a measured RTX 5070 run via /contribute to seed /check/.
  • VRAM usage: ~5 GB during inference. This is the figure measured on an RTX 5070 directly — 4–6 GB during inference per the sneekes.app RTX 5070 walkthrough — and corroborated on an RTX 3090 plus the model loading on a 6 GB RTX A2000 (archy.net). It is consistent with the derived envelope from the HF Files tab — 1.74 GB model.pth + 1.87 GB codec.pth = 3.61 GB on disk + activations. The RTX 5070's 12 GB envelope leaves ample headroom.
  • Quality notes: 13 languages supported (en, zh, ja, de, fr, es, ko, ar, ru, nl, it, pl, pt); emotion / tone markers like (angry), (laughing), (in a hurry tone) are honoured per the HF model card. S1 Mini is a distillation of the larger 4 B S1 — quality is close but not identical; the model card publishes per-language WER/CER tables for comparison.

For the full benchmark data, see /check/openaudio-s1-mini/rtx-5070.

Troubleshooting

CUDA error: no kernel image is available for execution on the device

Your PyTorch wheel doesn't ship sm_120 kernels. This is the single most common Blackwell failure for this model — a community user reproduced the sm_120 incompatibility on the HF discussions tab and the same crash is tracked for an RTX 5090 on fish-speech#1126. Re-install with a recent CUDA 12.8 or 12.9 wheel:

pip install --upgrade --force-reinstall torch torchaudio --index-url https://download.pytorch.org/whl/cu128

or re-run uv sync --python 3.12 --extra cu129 from a clean .venv. Verify with:

python -c "import torch; print(torch.cuda.get_device_capability())"
# Expect: (12, 0)

OSError: You are trying to access a gated repo

You haven't accepted the model's terms or aren't logged in. Visit the model page, click "Agree and access repository", then re-run huggingface-cli login with a token that has read access to gated repos. (An unauthenticated request to the repo's README returns HTTP 401 — that is expected for a gated repo, not a broken link.)

Out of memory

The model runs in ~5 GB but torch.compile and large prompt batches can push the working set higher. The RTX 5070 (12 GB, bf16-capable) has a wide margin for this model, so OOM is unlikely. On smaller GPUs that lack native bf16 support, the official Running Inference docs suggest adding the --half flag (fp16 rather than bf16) to the text2semantic step — not needed on the bf16-capable RTX 5070.

Codec config name confusion

If you copy-paste an older snippet that uses --decoder-config-name firefly_gan_vq (the pre-OpenAudio codec name), inference will fail with a config-load error. S1 Mini requires modded_dac_vq — the same flag used by every cited source (official docs, Furious-Green).