How much VRAM does OpenAudio S1 Mini need?

About 5 GB — the minimum this recipe targets.

How hard is this setup?

Intermediate — follow the steps above.

OpenAudio S1 Mini on RX 7900 XTX: 13-Language Distilled TTS on ROCm (BF16)

What You'll Build

A local 13-language text-to-speech pipeline using OpenAudio S1 Mini — the 0.5 B-parameter distilled version of Fish Audio's S1 model — running on a 24 GB Radeon RX 7900 XTX (RDNA3, Navi 31, gfx1100) through the ROCm stack. You'll launch the official tools.api_server from the fish-speech codebase, hit it with curl, and get back synthesised speech in any of English, Chinese, Japanese, German, French, Spanish, Korean, Arabic, Russian, Dutch, Italian, Polish, or Portuguese.

Hardware data: RX 7900 XTX (24 GB VRAM) · openaudio-s1-mini runtime fits in ~5 GB VRAM · BF16 weights · fish-speech on ROCm · See benchmark data

⚠️ This is a ROCm recipe, not CUDA. The RX 7900 XTX runs on AMD's ROCm/HIP stack — there is no cu126/cu128/cu129 wheel, no flash-attn wheel, and no FP8/FP4 path here. RDNA3 has no FP8/FP4 hardware (its WMMA units accept FP16, BF16, INT8, INT4 only), so an FP8 checkpoint would just upcast to BF16 with no memory saving — and at 24 GB you don't need it anyway. Fish-Speech's attention is PyTorch scaled-dot-product attention (SDPA) — verified in the model code, which calls F.scaled_dot_product_attention(...) and never imports the flash_attn package (llama.py). On ROCm that SDPA call routes to AOTriton's forward-only Flash backend automatically; there is no custom kernel to build. If a guide tells you to pip install flash-attn or pick a cu12x wheel for this card, it's written for the wrong vendor.

⚠️ Non-commercial weights. The S1 Mini weights are licensed CC-BY-NC-SA-4.0 per the Hugging Face model card — you may use them for research, demos, and personal projects, but not for commercial deployment or paid services. For a commercially-usable open-weight TTS in the same VRAM class, see Kokoro or other CC0/Apache-licensed TTS recipes.

ℹ️ Repo recently renamed. The canonical Hugging Face slug fishaudio/openaudio-s1-mini now redirects to fishaudio/s1-mini — same model, same weights, same cc-by-nc-sa-4.0 license. Both the redirecting old URL and the new one work; the install commands below use the path names the official docs expect.

Requirements

Component	Minimum	Tested
GPU	6 GB VRAM (ROCm-supported AMD card)	RX 7900 XTX (24 GB, gfx1100)
RAM	— (no published minimum; allocate ≥ 8 GB system for the Python env + HF cache)	—
Storage	~3.5 GB	1.62 GB `model.pth` + 1.74 GB `codec.pth` (HF Files tree)
Driver	AMD ROCm 7.2.x on Linux	—
Software	Python 3.12, PyTorch (ROCm build) ≥ 2.5, Linux	`fish-speech` (GitHub `main`)

The model is a distillation of Fish Audio's S1 and ships in the fish-speech lineage. The repo is gated on Hugging Face — you must accept the model's terms before the weights download. Weight-file sizes are verified from the Hugging Face tree API (model.pth 1,735,122,974 bytes ≈ 1.62 GB; codec.pth 1,871,099,728 bytes ≈ 1.74 GB; ~3.37 GB total) — the repo is gated, so these come from the tree listing, not a HEAD probe.

Installation

1. Clone the `fish-speech` repo

Per the official install guide:

git clone https://github.com/fishaudio/fish-speech.git
cd fish-speech

2. Install PyTorch for ROCm

The RX 7900 XTX (gfx1100) is an officially ROCm-supported GPU on Linux, so it uses the stable ROCm PyTorch wheel from PyTorch's ROCm index — not the cu12x CUDA extras the upstream fish-speech docs default to. Install Torch first against the ROCm wheel index, then install fish-speech's remaining dependencies:

# Fresh environment
conda create -n fish-speech python=3.12 -y
conda activate fish-speech

# PyTorch built for ROCm (NOT CUDA) — replaces the cu126/cu128/cu129 extras
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm7.2

ℹ️ Verify the ROCm tag before you copy it. The rocmX.Y tag in the index URL moves over time (6.3 → 6.4 → 7.x). Read the current stable line in the live PyTorch "Get Started" selector (pick ROCm) before running. A community effort to add a native rocm72 extra to fish-speech's pyproject.toml is tracked in fish-speech issue #1246 — but note the config there (ROCm 7.2 / PyTorch 2.11.0) was validated on an RX 9070 (gfx1201, RDNA4), not the 7900 XTX (gfx1100, RDNA3); the thread references the 7900 XTX only as a higher-VRAM card, so treat the exact rocm72 extra as unverified on this GPU. If that PR has merged by the time you read this, you can try BACKEND=rocm UV_EXTRA=rocm72 with uv instead of the manual Torch install above, but fall back to the whl/rocm7.2 wheel command if it misbehaves.

3. Install fish-speech's remaining dependencies

With the ROCm Torch build already in place, install the package without its CUDA extra so pip does not pull a cu12x wheel over your ROCm one:

pip install -e .

If pip tries to replace your ROCm Torch with a CUDA build during this step, see "Torch got replaced with a CUDA build" in Troubleshooting.

4. Authenticate with Hugging Face

The S1 Mini repo is gated — you must accept the model's terms on the Hugging Face page and log in locally:

huggingface-cli login

5. Download the weights

hf download fishaudio/openaudio-s1-mini --local-dir checkpoints/openaudio-s1-mini

This pulls ~3.4 GB of files (model.pth, codec.pth, config.json, tokenizer.tiktoken, special_tokens.json) into checkpoints/openaudio-s1-mini/, the path expected by the inference scripts. File sizes are verified from the HF tree API: model.pth ≈ 1.62 GB + codec.pth ≈ 1.74 GB.

Equivalent Python form:

python -c "from huggingface_hub import snapshot_download; snapshot_download('fishaudio/openaudio-s1-mini', local_dir='checkpoints/openaudio-s1-mini')"

Running

Option A — API server (recommended)

The canonical entrypoint for serving S1 Mini is tools.api_server. The flags below come verbatim from the official Fish Audio Running Inference docs:

python -m tools.api_server \
  --listen 0.0.0.0:8080 \
  --llama-checkpoint-path "checkpoints/openaudio-s1-mini" \
  --decoder-checkpoint-path "checkpoints/openaudio-s1-mini/codec.pth" \
  --decoder-config-name modded_dac_vq

The --decoder-config-name modded_dac_vq is mandatory for S1 Mini — it pins the codec architecture variant the distilled model was trained against (official docs).

Synthesise a sentence with curl:

curl -X POST "http://127.0.0.1:8080/v1/tts" \
  -H "Content-Type: application/json" \
  -d '{"text": "Testing one two three."}' \
  --output out.wav

Option B — Gradio WebUI

For an interactive UI, swap tools.api_server for tools.run_webui with the same checkpoint flags:

python -m tools.run_webui \
  --llama-checkpoint-path "checkpoints/openaudio-s1-mini" \
  --decoder-checkpoint-path "checkpoints/openaudio-s1-mini/codec.pth" \
  --decoder-config-name modded_dac_vq

Then open http://127.0.0.1:7860.

Optional: `--compile` for faster repeat inference

The official docs note: "Add the --compile flag to enable torch.compile optimization for faster inference." The first call pays a one-time compile cost, after which subsequent calls run on a fused graph. On ROCm, torch.compile lowers through Triton-ROCm and works on RDNA3 for mainstream transformer blocks — expect a slower initial compile than on CUDA, and an occasional kernel fallback. The 24 GB 7900 XTX's large headroom over the ~5 GB inference envelope leaves plenty of room for the compiled graph.

Results

Speed: No measurement on the RX 7900 XTX has been published, and our backend has no benchmark for this pair yet (/check/openaudio-s1-mini/rx-7900-xtx returns verdict: unknown). No verifiable RX-7900-XTX-named tok/s or seconds-per-sentence figure for this model was found in research, so the Speed figure is omitted rather than transferred from a different card or vendor. If you've measured S1 Mini latency on a 7900 XTX, please contribute it so it lands on /check/openaudio-s1-mini/rx-7900-xtx.
VRAM usage: ~5 GB during inference. This is consistent with the weights on disk — model.pth ≈ 1.62 GB + codec.pth ≈ 1.74 GB = ~3.37 GB (HF tree API) plus activations and KV cache. On the RX 7900 XTX's 24 GB budget that leaves roughly 19 GB free for the OS desktop, a --compile graph, batched generations, multiple voice presets resident in memory, or a co-resident small model (e.g. an ASR pipeline). Note the VQ-GAN decoder loads in float32 (~1.87 GB) regardless of precision per fish-speech issue #1246 — a non-issue at 24 GB, but the reason the floor sits near 5 GB rather than 3.4 GB.
Quality notes: 13 languages supported (en, zh, ja, de, fr, es, ko, ar, ru, nl, it, pl, pt); emotion / tone markers like (angry), (laughing), (whispering) are honoured (45+ markers per the HF model card). S1 Mini is a distillation of the larger S1 — quality is close but not identical. Run the native BF16 weights on this card; there is no quantization tradeoff to consider at 24 GB.

For the full benchmark data and other-GPU comparisons, see /check/openaudio-s1-mini/rx-7900-xtx.

Troubleshooting

`OSError: You are trying to access a gated repo`

You haven't accepted the model's terms or aren't logged in. Visit the model page, click "Agree and access repository", then re-run huggingface-cli login with a token that has read access to gated repos.

"Torch not compiled with CUDA enabled" / Torch got replaced with a CUDA build

ROCm masquerades as the cuda device namespace under HIP, so the device API stays torch.cuda.* — but if you see "Torch not compiled with CUDA enabled", a CUDA build of PyTorch was installed (often pulled in as a transitive dependency during pip install -e .). Reinstall the ROCm wheel:

pip uninstall -y torch torchvision torchaudio
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm7.2

Confirm the installed build is the ROCm one: python -c "import torch; print(torch.__version__)" should print a +rocm7.2-style suffix, and python -c "import torch; print(torch.cuda.is_available())" returns True (HIP reports through the cuda namespace).

Codec config name confusion

If you copy-paste an older snippet that uses --decoder-config-name firefly_gan_vq (the pre-OpenAudio codec name), inference fails with a config-load error. S1 Mini requires modded_dac_vq per the official Running Inference docs.

Do not install `flash-attn` or xformers on this card

HF and fish-speech guides written for NVIDIA frequently suggest pip install flash-attn or an xformers wheel. On RDNA3 these are the wrong path: the upstream CK (Composable Kernel) build of Dao-AILab flash-attn targets CDNA/MI accelerators and commonly fails to build on gfx1100, and the ROCm xformers fork is limited. Fish-Speech already routes attention through PyTorch SDPA, which on ROCm dispatches to AOTriton's forward-only Flash backend with no extra install. Leave the attention path alone.

`bf16` / precision flag

The official docs describe a --half flag for "GPUs without bf16 support" to fall back to fp16. The RX 7900 XTX (RDNA3) supports bf16 natively (its WMMA units accept BF16), so --half is normally unnecessary — run the default BF16 path. Reach for --half only if a specific op complains about bf16 on your ROCm version.

First `--compile` run is slow or a kernel fails to compile

On ROCm, torch.compile lowers through Triton-ROCm. The first compiled call is slower than on CUDA, and exotic ops can occasionally fail to compile and fall back to eager. If --compile errors outright on your ROCm/Triton combination, drop the flag — eager SDPA inference works without it, and the speed difference for short prompts is small. You can also try PYTORCH_TUNABLEOP_ENABLED=1 to auto-tune GEMM kernels (very slow first run, cached afterward) if you are throughput-bound on long batches.