What You'll Build
A local 13-language text-to-speech pipeline using OpenAudio S1 Mini — the 0.5 B-parameter distilled version of Fish Audio's S1 model — running on a 24 GB Radeon RX 7900 XTX (RDNA3, Navi 31, gfx1100) through the ROCm stack. You'll launch the official tools.api_server from the fish-speech codebase, hit it with curl, and get back synthesised speech in any of English, Chinese, Japanese, German, French, Spanish, Korean, Arabic, Russian, Dutch, Italian, Polish, or Portuguese.
Hardware data: RX 7900 XTX (24 GB VRAM) · openaudio-s1-mini runtime fits in ~5 GB VRAM · BF16 weights · fish-speech on ROCm · See benchmark data
⚠️ This is a ROCm recipe, not CUDA. The RX 7900 XTX runs on AMD's ROCm/HIP stack — there is no
cu126/cu128/cu129wheel, noflash-attnwheel, and no FP8/FP4 path here. RDNA3 has no FP8/FP4 hardware (its WMMA units accept FP16, BF16, INT8, INT4 only), so an FP8 checkpoint would just upcast to BF16 with no memory saving — and at 24 GB you don't need it anyway. Fish-Speech's attention is PyTorch scaled-dot-product attention (SDPA) — verified in the model code, which callsF.scaled_dot_product_attention(...)and never imports theflash_attnpackage (llama.py). On ROCm that SDPA call routes to AOTriton's forward-only Flash backend automatically; there is no custom kernel to build. If a guide tells you topip install flash-attnor pick acu12xwheel for this card, it's written for the wrong vendor.
⚠️ Non-commercial weights. The S1 Mini weights are licensed CC-BY-NC-SA-4.0 per the Hugging Face model card — you may use them for research, demos, and personal projects, but not for commercial deployment or paid services. For a commercially-usable open-weight TTS in the same VRAM class, see Kokoro or other CC0/Apache-licensed TTS recipes.
ℹ️ Repo recently renamed. The canonical Hugging Face slug
fishaudio/openaudio-s1-mininow redirects tofishaudio/s1-mini— same model, same weights, samecc-by-nc-sa-4.0license. Both the redirecting old URL and the new one work; the install commands below use the path names the official docs expect.
Requirements
| Component | Minimum | Tested |
|---|---|---|
| GPU | 6 GB VRAM (ROCm-supported AMD card) | RX 7900 XTX (24 GB, gfx1100) |
| RAM | — (no published minimum; allocate ≥ 8 GB system for the Python env + HF cache) | — |
| Storage | ~3.5 GB | 1.62 GB model.pth + 1.74 GB codec.pth (HF Files tree) |
| Driver | AMD ROCm 7.2.x on Linux | — |
| Software | Python 3.12, PyTorch (ROCm build) ≥ 2.5, Linux | fish-speech (GitHub main) |
The model is a distillation of Fish Audio's S1 and ships in the fish-speech lineage. The repo is gated on Hugging Face — you must accept the model's terms before the weights download. Weight-file sizes are verified from the Hugging Face tree API (model.pth 1,735,122,974 bytes ≈ 1.62 GB; codec.pth 1,871,099,728 bytes ≈ 1.74 GB; ~3.37 GB total) — the repo is gated, so these come from the tree listing, not a HEAD probe.
Installation
1. Clone the fish-speech repo
Per the official install guide:
git clone https://github.com/fishaudio/fish-speech.git
cd fish-speech
2. Install PyTorch for ROCm
The RX 7900 XTX (gfx1100) is an officially ROCm-supported GPU on Linux, so it uses the stable ROCm PyTorch wheel from PyTorch's ROCm index — not the cu12x CUDA extras the upstream fish-speech docs default to. Install Torch first against the ROCm wheel index, then install fish-speech's remaining dependencies:
# Fresh environment
conda create -n fish-speech python=3.12 -y
conda activate fish-speech
# PyTorch built for ROCm (NOT CUDA) — replaces the cu126/cu128/cu129 extras
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm7.2
ℹ️ Verify the ROCm tag before you copy it. The
rocmX.Ytag in the index URL moves over time (6.3 → 6.4 → 7.x). Read the current stable line in the live PyTorch "Get Started" selector (pick ROCm) before running. A community effort to add a nativerocm72extra to fish-speech'spyproject.tomlis tracked in fish-speech issue #1246 — but note the config there (ROCm 7.2 / PyTorch 2.11.0) was validated on an RX 9070 (gfx1201, RDNA4), not the 7900 XTX (gfx1100, RDNA3); the thread references the 7900 XTX only as a higher-VRAM card, so treat the exactrocm72extra as unverified on this GPU. If that PR has merged by the time you read this, you can tryBACKEND=rocm UV_EXTRA=rocm72withuvinstead of the manual Torch install above, but fall back to thewhl/rocm7.2wheel command if it misbehaves.
3. Install fish-speech's remaining dependencies
With the ROCm Torch build already in place, install the package without its CUDA extra so pip does not pull a cu12x wheel over your ROCm one:
pip install -e .
If pip tries to replace your ROCm Torch with a CUDA build during this step, see "Torch got replaced with a CUDA build" in Troubleshooting.
4. Authenticate with Hugging Face
The S1 Mini repo is gated — you must accept the model's terms on the Hugging Face page and log in locally:
huggingface-cli login
5. Download the weights
hf download fishaudio/openaudio-s1-mini --local-dir checkpoints/openaudio-s1-mini
This pulls ~3.4 GB of files (model.pth, codec.pth, config.json, tokenizer.tiktoken, special_tokens.json) into checkpoints/openaudio-s1-mini/, the path expected by the inference scripts. File sizes are verified from the HF tree API: model.pth ≈ 1.62 GB + codec.pth ≈ 1.74 GB.
Equivalent Python form:
python -c "from huggingface_hub import snapshot_download; snapshot_download('fishaudio/openaudio-s1-mini', local_dir='checkpoints/openaudio-s1-mini')"
Running
Option A — API server (recommended)
The canonical entrypoint for serving S1 Mini is tools.api_server. The flags below come verbatim from the official Fish Audio Running Inference docs:
python -m tools.api_server \
--listen 0.0.0.0:8080 \
--llama-checkpoint-path "checkpoints/openaudio-s1-mini" \
--decoder-checkpoint-path "checkpoints/openaudio-s1-mini/codec.pth" \
--decoder-config-name modded_dac_vq
The --decoder-config-name modded_dac_vq is mandatory for S1 Mini — it pins the codec architecture variant the distilled model was trained against (official docs).
Synthesise a sentence with curl:
curl -X POST "http://127.0.0.1:8080/v1/tts" \
-H "Content-Type: application/json" \
-d '{"text": "Testing one two three."}' \
--output out.wav
Option B — Gradio WebUI
For an interactive UI, swap tools.api_server for tools.run_webui with the same checkpoint flags:
python -m tools.run_webui \
--llama-checkpoint-path "checkpoints/openaudio-s1-mini" \
--decoder-checkpoint-path "checkpoints/openaudio-s1-mini/codec.pth" \
--decoder-config-name modded_dac_vq
Then open http://127.0.0.1:7860.
Optional: --compile for faster repeat inference
The official docs note: "Add the --compile flag to enable torch.compile optimization for faster inference." The first call pays a one-time compile cost, after which subsequent calls run on a fused graph. On ROCm, torch.compile lowers through Triton-ROCm and works on RDNA3 for mainstream transformer blocks — expect a slower initial compile than on CUDA, and an occasional kernel fallback. The 24 GB 7900 XTX's large headroom over the ~5 GB inference envelope leaves plenty of room for the compiled graph.
Results
- Speed: No measurement on the RX 7900 XTX has been published, and our backend has no benchmark for this pair yet (/check/openaudio-s1-mini/rx-7900-xtx returns
verdict: unknown). No verifiable RX-7900-XTX-named tok/s or seconds-per-sentence figure for this model was found in research, so the Speed figure is omitted rather than transferred from a different card or vendor. If you've measured S1 Mini latency on a 7900 XTX, please contribute it so it lands on /check/openaudio-s1-mini/rx-7900-xtx. - VRAM usage: ~5 GB during inference. This is consistent with the weights on disk —
model.pth≈ 1.62 GB +codec.pth≈ 1.74 GB = ~3.37 GB (HF tree API) plus activations and KV cache. On the RX 7900 XTX's 24 GB budget that leaves roughly 19 GB free for the OS desktop, a--compilegraph, batched generations, multiple voice presets resident in memory, or a co-resident small model (e.g. an ASR pipeline). Note the VQ-GAN decoder loads in float32 (~1.87 GB) regardless of precision per fish-speech issue #1246 — a non-issue at 24 GB, but the reason the floor sits near 5 GB rather than 3.4 GB. - Quality notes: 13 languages supported (en, zh, ja, de, fr, es, ko, ar, ru, nl, it, pl, pt); emotion / tone markers like
(angry),(laughing),(whispering)are honoured (45+ markers per the HF model card). S1 Mini is a distillation of the larger S1 — quality is close but not identical. Run the native BF16 weights on this card; there is no quantization tradeoff to consider at 24 GB.
For the full benchmark data and other-GPU comparisons, see /check/openaudio-s1-mini/rx-7900-xtx.
Troubleshooting
OSError: You are trying to access a gated repo
You haven't accepted the model's terms or aren't logged in. Visit the model page, click "Agree and access repository", then re-run huggingface-cli login with a token that has read access to gated repos.
"Torch not compiled with CUDA enabled" / Torch got replaced with a CUDA build
ROCm masquerades as the cuda device namespace under HIP, so the device API stays torch.cuda.* — but if you see "Torch not compiled with CUDA enabled", a CUDA build of PyTorch was installed (often pulled in as a transitive dependency during pip install -e .). Reinstall the ROCm wheel:
pip uninstall -y torch torchvision torchaudio
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm7.2
Confirm the installed build is the ROCm one: python -c "import torch; print(torch.__version__)" should print a +rocm7.2-style suffix, and python -c "import torch; print(torch.cuda.is_available())" returns True (HIP reports through the cuda namespace).
Codec config name confusion
If you copy-paste an older snippet that uses --decoder-config-name firefly_gan_vq (the pre-OpenAudio codec name), inference fails with a config-load error. S1 Mini requires modded_dac_vq per the official Running Inference docs.
Do not install flash-attn or xformers on this card
HF and fish-speech guides written for NVIDIA frequently suggest pip install flash-attn or an xformers wheel. On RDNA3 these are the wrong path: the upstream CK (Composable Kernel) build of Dao-AILab flash-attn targets CDNA/MI accelerators and commonly fails to build on gfx1100, and the ROCm xformers fork is limited. Fish-Speech already routes attention through PyTorch SDPA, which on ROCm dispatches to AOTriton's forward-only Flash backend with no extra install. Leave the attention path alone.
bf16 / precision flag
The official docs describe a --half flag for "GPUs without bf16 support" to fall back to fp16. The RX 7900 XTX (RDNA3) supports bf16 natively (its WMMA units accept BF16), so --half is normally unnecessary — run the default BF16 path. Reach for --half only if a specific op complains about bf16 on your ROCm version.
First --compile run is slow or a kernel fails to compile
On ROCm, torch.compile lowers through Triton-ROCm. The first compiled call is slower than on CUDA, and exotic ops can occasionally fail to compile and fall back to eager. If --compile errors outright on your ROCm/Triton combination, drop the flag — eager SDPA inference works without it, and the speed difference for short prompts is small. You can also try PYTORCH_TUNABLEOP_ENABLED=1 to auto-tune GEMM kernels (very slow first run, cached afterward) if you are throughput-bound on long batches.