How much VRAM does OpenAudio S1 Mini need?

About 5 GB — the minimum this recipe targets.

How hard is this setup?

Intermediate — follow the steps above.

OpenAudio S1 Mini on RX 7800 XT: 13-Language Distilled TTS on ROCm (BF16)

What You'll Build

A local 13-language text-to-speech pipeline using OpenAudio S1 Mini — the 0.5 B-parameter distilled version of Fish Audio's S1 model — running on a 16 GB Radeon RX 7800 XT (RDNA3, Navi 32, gfx1101) through the ROCm stack. You'll launch the official tools.api_server from the fish-speech codebase, hit it with curl, and get back synthesised speech in any of English, Chinese, Japanese, German, French, Spanish, Korean, Arabic, Russian, Dutch, Italian, Polish, or Portuguese.

Hardware data: RX 7800 XT (16 GB VRAM) · openaudio-s1-mini runtime fits in ~5 GB VRAM · BF16 weights · fish-speech on ROCm · See benchmark data

⚠️ This is a ROCm recipe, not CUDA. The RX 7800 XT runs on AMD's ROCm/HIP stack — there is no cu126/cu128/cu129 wheel, no flash-attn wheel, and no FP8/FP4 path here. RDNA3 has no FP8/FP4 hardware (its WMMA units accept FP16, BF16, INT8, INT4 only), so an FP8 checkpoint would just upcast to BF16 with no memory saving — and at ~5 GB this model never comes close to filling 16 GB anyway. Fish-Speech's attention is PyTorch scaled-dot-product attention (SDPA) — verified in the model code, which calls F.scaled_dot_product_attention(...) and never imports the flash_attn package (llama.py). On ROCm that SDPA call routes to AOTriton's forward-only Flash backend automatically; there is no custom kernel to build. If a guide tells you to pip install flash-attn or pick a cu12x wheel for this card, it's written for the wrong vendor.

⚠️ Non-commercial weights. The S1 Mini weights are licensed CC-BY-NC-SA-4.0 per the Hugging Face model card — you may use them for research, demos, and personal projects, but not for commercial deployment or paid services. For a commercially-usable open-weight TTS in the same VRAM class, see Kokoro or other CC0/Apache-licensed TTS recipes.

ℹ️ Repo recently renamed. The canonical Hugging Face slug fishaudio/openaudio-s1-mini now redirects to fishaudio/s1-mini — same model, same weights, same cc-by-nc-sa-4.0 license. Both the redirecting old URL and the new one work; the install commands below use the path names the official docs expect.

Requirements

Component	Minimum	Tested
GPU	6 GB VRAM (ROCm-supported AMD card)	RX 7800 XT (16 GB, gfx1101)
RAM	— (no published minimum; allocate ≥ 8 GB system for the Python env + HF cache)	—
Storage	~3.5 GB	1.62 GB `model.pth` + 1.74 GB `codec.pth` (HF Files tree)
Driver	AMD ROCm 7.2.x on Linux	—
Software	Python 3.12, PyTorch (ROCm build) ≥ 2.5, Linux	`fish-speech` (GitHub `main`)

The model is a distillation of Fish Audio's S1 and ships in the fish-speech lineage. The repo is gated on Hugging Face — you must accept the model's terms before the weights download. Weight-file sizes are verified from the Hugging Face tree API (model.pth 1,735,122,974 bytes ≈ 1.62 GB; codec.pth 1,871,099,728 bytes ≈ 1.74 GB; ~3.37 GB total) — the repo is gated, so these come from the tree listing, not a HEAD probe.

Installation

1. Clone the `fish-speech` repo

Per the official install guide:

git clone https://github.com/fishaudio/fish-speech.git
cd fish-speech

2. Install PyTorch for ROCm

The RX 7800 XT (gfx1101) is an officially ROCm-supported GPU on Linux — the AMD ROCm system-requirements matrix lists the RX 7800 XT as gfx1101 with full support — so it uses the stable ROCm PyTorch wheel from PyTorch's ROCm index, not the cu12x CUDA extras the upstream fish-speech docs default to. Install Torch first against the ROCm wheel index, then install fish-speech's remaining dependencies:

# Fresh environment
conda create -n fish-speech python=3.12 -y
conda activate fish-speech

# PyTorch built for ROCm (NOT CUDA) — replaces the cu126/cu128/cu129 extras
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm7.2

ℹ️ Verify the ROCm tag before you copy it. The rocmX.Y tag in the index URL moves over time (6.3 → 6.4 → 7.x). Read the current stable line in the live PyTorch "Get Started" selector (pick ROCm) before running. A community effort to add a native rocm72 extra to fish-speech's pyproject.toml is tracked in fish-speech issue #1246 — but note the config there (ROCm 7.2 / PyTorch 2.11.0) was validated on an RX 9070 (gfx1201, RDNA4), not this card; the thread references the 7900 XTX only as a higher-VRAM example, and the 7800 XT (gfx1101, RDNA3) is not mentioned at all, so treat the exact rocm72 extra as community config, not card-validated on the 7800 XT. If that PR has merged by the time you read this, you can try BACKEND=rocm UV_EXTRA=rocm72 with uv instead of the manual Torch install above, but fall back to the whl/rocm7.2 wheel command if it misbehaves.

3. Install fish-speech's remaining dependencies

With the ROCm Torch build already in place, install the package without its CUDA extra so pip does not pull a cu12x wheel over your ROCm one:

pip install -e .

If pip tries to replace your ROCm Torch with a CUDA build during this step, see "Torch got replaced with a CUDA build" in Troubleshooting.

4. Authenticate with Hugging Face

The S1 Mini repo is gated — you must accept the model's terms on the Hugging Face page and log in locally:

huggingface-cli login

5. Download the weights

hf download fishaudio/openaudio-s1-mini --local-dir checkpoints/openaudio-s1-mini

This pulls ~3.4 GB of files (model.pth, codec.pth, config.json, tokenizer.tiktoken, special_tokens.json) into checkpoints/openaudio-s1-mini/, the path expected by the inference scripts. File sizes are verified from the HF tree API: model.pth ≈ 1.62 GB + codec.pth ≈ 1.74 GB.

Equivalent Python form:

python -c "from huggingface_hub import snapshot_download; snapshot_download('fishaudio/openaudio-s1-mini', local_dir='checkpoints/openaudio-s1-mini')"

Running

Option A — API server (recommended)

The canonical entrypoint for serving S1 Mini is tools.api_server. The flags below come verbatim from the official Fish Audio Running Inference docs:

python -m tools.api_server \
  --listen 0.0.0.0:8080 \
  --llama-checkpoint-path "checkpoints/openaudio-s1-mini" \
  --decoder-checkpoint-path "checkpoints/openaudio-s1-mini/codec.pth" \
  --decoder-config-name modded_dac_vq

The --decoder-config-name modded_dac_vq is mandatory for S1 Mini — it pins the codec architecture variant the distilled model was trained against (official docs).

Synthesise a sentence with curl:

curl -X POST "http://127.0.0.1:8080/v1/tts" \
  -H "Content-Type: application/json" \
  -d '{"text": "Testing one two three."}' \
  --output out.wav

Option B — Gradio WebUI

For an interactive UI, swap tools.api_server for tools.run_webui with the same checkpoint flags:

python -m tools.run_webui \
  --llama-checkpoint-path "checkpoints/openaudio-s1-mini" \
  --decoder-checkpoint-path "checkpoints/openaudio-s1-mini/codec.pth" \
  --decoder-config-name modded_dac_vq

Then open http://127.0.0.1:7860.

Optional: `--compile` for faster repeat inference

The official docs note: "Add the --compile flag to enable torch.compile optimization for faster inference." The first call pays a one-time compile cost, after which subsequent calls run on a fused graph. On ROCm, torch.compile lowers through Triton-ROCm and works on RDNA3 for mainstream transformer blocks — expect a slower initial compile than on CUDA, and an occasional kernel fallback. The 16 GB 7800 XT's large headroom over the ~5 GB inference envelope leaves plenty of room for the compiled graph.

Results

Speed: No measurement on the RX 7800 XT has been published, and our backend has no benchmark for this pair yet (/check/openaudio-s1-mini/rx-7800-xt returns verdict: unknown). No verifiable RX-7800-XT-named tok/s or seconds-per-sentence figure for this model was found in research, so the Speed figure is omitted rather than transferred from a different card or vendor — note in particular that the 7800 XT has roughly two-thirds of the 7900 XTX's memory bandwidth (624 vs 960 GB/s) and fewer WMMA units, so a sibling RDNA3 number would not transfer cleanly. If you've measured S1 Mini latency on a 7800 XT, please contribute it so it lands on /check/openaudio-s1-mini/rx-7800-xt.
VRAM usage: ~5 GB during inference. This is consistent with the weights on disk — model.pth ≈ 1.62 GB + codec.pth ≈ 1.74 GB = ~3.37 GB (HF tree API) plus activations and KV cache. On the RX 7800 XT's 16 GB budget that leaves roughly 11 GB free for the OS desktop, a --compile graph, batched generations, multiple voice presets resident in memory, or a co-resident small model (e.g. an ASR pipeline). Note the VQ-GAN decoder loads in float32 (~1.87 GB) regardless of precision per fish-speech issue #1246 — easily absorbed at 16 GB, but the reason the floor sits near 5 GB rather than 3.4 GB.
Quality notes: 13 languages supported (en, zh, ja, de, fr, es, ko, ar, ru, nl, it, pl, pt); emotion / tone markers like (angry), (laughing), (whispering) are honoured (dozens of markers per the HF model card). S1 Mini is a distillation of the larger S1 — quality is close but not identical. Run the native BF16 weights on this card; at ~5 GB into a 16 GB budget there is no quantization tradeoff to consider.

For the full benchmark data and other-GPU comparisons, see /check/openaudio-s1-mini/rx-7800-xt.

Troubleshooting

`OSError: You are trying to access a gated repo`

You haven't accepted the model's terms or aren't logged in. Visit the model page, click "Agree and access repository", then re-run huggingface-cli login with a token that has read access to gated repos.

"Torch not compiled with CUDA enabled" / Torch got replaced with a CUDA build

ROCm masquerades as the cuda device namespace under HIP, so the device API stays torch.cuda.* — but if you see "Torch not compiled with CUDA enabled", a CUDA build of PyTorch was installed (often pulled in as a transitive dependency during pip install -e .). Reinstall the ROCm wheel:

pip uninstall -y torch torchvision torchaudio
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm7.2

Confirm the installed build is the ROCm one: python -c "import torch; print(torch.__version__)" should print a +rocm7.2-style suffix, and python -c "import torch; print(torch.cuda.is_available())" returns True (HIP reports through the cuda namespace).

A library ships only gfx1100 kernels and won't load on the 7800 XT

The 7800 XT is gfx1101 (Navi 32), while the flagship 7900 XTX is gfx1100 (Navi 31). Most of the ROCm stack — including fish-speech's PyTorch-SDPA attention path — ships kernels for both, so the standard install above runs natively on gfx1101 with no override. But occasionally a prebuilt extension or third-party wheel only carries gfx1100 kernels and refuses to load on gfx1101 with a "no kernel image is available" error. The standard Linux-only fallback is to mask the card as gfx1100 at runtime:

HSA_OVERRIDE_GFX_VERSION=11.0.0 python -m tools.api_server \
  --llama-checkpoint-path "checkpoints/openaudio-s1-mini" \
  --decoder-checkpoint-path "checkpoints/openaudio-s1-mini/codec.pth" \
  --decoder-config-name modded_dac_vq

This is a legacy fallback, not a default — the 7800 XT is officially supported as gfx1101 (ROCm system-requirements) and the stable ROCm PyTorch wheel runs natively without it. Only reach for the override if a specific library hits a missing-gfx1101-kernel error.

Codec config name confusion

If you copy-paste an older snippet that uses --decoder-config-name firefly_gan_vq (the pre-OpenAudio codec name), inference fails with a config-load error. S1 Mini requires modded_dac_vq per the official Running Inference docs.

Do not install `flash-attn` or xformers on this card

HF and fish-speech guides written for NVIDIA frequently suggest pip install flash-attn or an xformers wheel. On RDNA3 these are the wrong path: the upstream CK (Composable Kernel) build of Dao-AILab flash-attn targets CDNA/MI accelerators and commonly fails to build on gfx1101, and the ROCm xformers fork is limited. Fish-Speech already routes attention through PyTorch SDPA, which on ROCm dispatches to AOTriton's forward-only Flash backend with no extra install. Leave the attention path alone.

`bf16` / precision flag

The official docs describe a --half flag for "GPUs without bf16 support" to fall back to fp16. The RX 7800 XT (RDNA3) supports bf16 natively (its WMMA units accept BF16), so --half is normally unnecessary — run the default BF16 path. Reach for --half only if a specific op complains about bf16 on your ROCm version.

First `--compile` run is slow or a kernel fails to compile

On ROCm, torch.compile lowers through Triton-ROCm. The first compiled call is slower than on CUDA, and exotic ops can occasionally fail to compile and fall back to eager. If --compile errors outright on your ROCm/Triton combination, drop the flag — eager SDPA inference works without it, and the speed difference for short prompts is small. You can also try PYTORCH_TUNABLEOP_ENABLED=1 to auto-tune GEMM kernels (very slow first run, cached afterward) if you are throughput-bound on long batches.