How much VRAM does Qwen3-TTS need?

About 8 GB — the minimum this recipe targets.

How hard is this setup?

Intermediate — follow the steps above.

Qwen3-TTS 1.7B-Base on RX 7800 XT: Multilingual Voice Cloning in 10 Languages on ROCm

What You'll Build

A local zero-shot text-to-speech pipeline using Qwen3-TTS-12Hz-1.7B-Base on a 16 GB Radeon RX 7800 XT (RDNA3, Navi 32, gfx1101) through the ROCm stack — clone any voice from a short reference clip, then synthesise new sentences in Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, or Italian. At ~4.54 GB of BF16 weights this 1.7B model leaves the 16 GB card with comfortable headroom — no quantization needed.

Hardware data: RX 7800 XT (16 GB VRAM) · weights 4.54 GB on disk · BF16 · PyTorch SDPA on ROCm · See benchmark data

⚠️ This is a ROCm recipe, not CUDA. The RX 7800 XT runs on AMD's ROCm/HIP stack — there is no cu124/cu128 wheel and no FlashAttention-2 pre-built wheel here. RDNA3 has no FP8/FP4 hardware (its WMMA units accept FP16, BF16, INT8, INT4 only per AMD GPUOpen "WMMA on RDNA3"), so an FP8 checkpoint would just upcast to BF16 with no memory saving — and at this model's size you don't need it anyway. The attention path is PyTorch scaled-dot-product attention (SDPA), which the model loads via attn_implementation="sdpa". If a guide tells you to pip install flash-attn and pass attn_implementation="flash_attention_2" for this card, it was written for NVIDIA — on RDNA3 the upstream CK FlashAttention build is CDNA/MI-only and fails on gfx1101.

⚠️ The headline ROCm footgun: decode_window_frames = 72. If you run the compiled / CUDA-graphs streaming backend, set the streaming decode_window_frames to 72. Values 66, 67, and 71 trigger a CUDA-graph capture bug on ROCm that causes a 5–10× slowdown per the Qwen3-TTS ROCm config notes; values 64 and 80 are also reported problematic on some setups. The default basic (non-compiled) inference path below is unaffected, but the moment you enable use_cuda_graphs: true, this is the single most important knob on AMD. See Running and Troubleshooting.

⚠️ Variant pinned. This recipe targets the 1.7B-Base checkpoint (voice cloning). The slug's 12Hz infix and Base suffix both matter — sibling variants live on the same Qwen HF org and are not covered here:

Qwen3-TTS-12Hz-1.7B-CustomVoice — same 1.7B parameter count and runtime VRAM envelope, but ships 9 pre-defined premium speakers plus natural-language style control via the instruct= argument. The install / runtime steps below carry over; only the inference call changes (generate_custom_voice(...) with speaker= and optional instruct=) per the GitHub README.

Qwen3-TTS-12Hz-1.7B-VoiceDesign — generates a voice from a natural-language persona description rather than a reference clip, per the variant table on the HF model card.

Two 0.6B variants — Qwen3-TTS-12Hz-0.6B-Base and Qwen3-TTS-12Hz-0.6B-CustomVoice — are also released with the same language coverage and streaming support per the GitHub README. Lighter footprint; same ROCm install path.

Requirements

Component	Minimum	Tested
GPU	8 GB VRAM (ROCm-supported AMD card)	RX 7800 XT (16 GB)
RAM	16 GB system	—
Storage	5 GB free	4.54 GB weights (HF Files tab)
Driver	AMD ROCm 7.2.x on Linux	—
Software	Python 3.12, PyTorch (ROCm build), `ffmpeg`	qwen-tts

The model is released under the Apache-2.0 license per the HF card and the weights are not gated on Hugging Face — no access request or login is required to download them. Free for commercial use.

Installation

1. Create the environment

conda create -n qwen3-tts python=3.12 -y
conda activate qwen3-tts

2. Install PyTorch for ROCm

The RX 7800 XT (gfx1101) is an officially ROCm-supported GPU on Linux (named explicitly as gfx1101 in the ROCm install-on-linux system-requirements matrix), so it uses the ROCm PyTorch wheel — not a cu12x CUDA wheel:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm7.2

ℹ️ Verify the ROCm tag before you copy it. The rocmX.Y wheel tag moves over time (6.3 → 6.4 → 7.x). Read the current Linux/ROCm line in the live PyTorch "Get Started" selector before running. AMD also publishes its own Radeon-tuned wheels at repo.radeon.com if you prefer the AMD-recommended path. (If you ever need the experimental RDNA-3-specific nightly index, https://rocm.nightlies.amd.com/v2/gfx110X-all/ covers the whole RDNA3 family — gfx1100/gfx1101/gfx1102 — so it is the right one for the 7800 XT's gfx1101. On the officially-supported stable wheel above you do not need it.)

Confirm you got the ROCm build, not a CUDA one: python -c "import torch; print(torch.__version__)" should print a +rocm7.2-style suffix, and torch.cuda.is_available() returns True (ROCm masquerades as the cuda device namespace under HIP).

3. Install the qwen-tts package

pip install -U qwen-tts

This installs the Qwen3TTSModel Python class and the qwen-tts-demo CLI entrypoint, per the GitHub README.

4. Download the weights

First run will fetch them automatically from Qwen/Qwen3-TTS-12Hz-1.7B-Base (3.86 GB main model.safetensors + 682 MB speech_tokenizer/model.safetensors, both visible on the HF Files tab). To pre-cache:

huggingface-cli download Qwen/Qwen3-TTS-12Hz-1.7B-Base

5. (Optional) A community ROCm Docker path

If you'd rather not hand-assemble the ROCm environment, a community fork ships a Dockerfile.rocm / docker-compose.rocm.yml ROCm-tuned image. This is not an official Qwen or AMD release — the upstream QwenLM/Qwen3-TTS repo provides no ROCm support; the Docker path below comes from the third-party groxaxo/Qwen3-TTS-Openai-Fastapi fork. Per its README, the ROCm image applies AMD-specific tuning (FLASH_ATTENTION_TRITON_AMD_ENABLE, hipBLASLt preference, TunableOp, and GPU_MAX_HW_QUEUES=1) without changing the default CUDA/CPU paths:

# build
docker build -f Dockerfile.rocm -t qwen3-tts-api:rocm .
# or via compose
docker compose -f docker-compose.rocm.yml up qwen3-tts-rocm

Running

Save as clone.py and run with python clone.py. Note attn_implementation="sdpa" and device_map="cuda:0" — under ROCm the cuda namespace is your AMD GPU via HIP:

import torch
import soundfile as sf
from qwen_tts import Qwen3TTSModel

model = Qwen3TTSModel.from_pretrained(
    "Qwen/Qwen3-TTS-12Hz-1.7B-Base",
    device_map="cuda:0",            # ROCm/HIP maps this to the 7800 XT
    dtype=torch.bfloat16,           # BF16 — never FP8/FP4 on RDNA3
    attn_implementation="sdpa",     # PyTorch SDPA, NOT flash_attention_2 on AMD
)

ref_audio = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-TTS-Repo/clone.wav"
ref_text = "Okay. Yeah. I resent you. I love you. I respect you. But you know what? You blew it!"

wavs, sr = model.generate_voice_clone(
    text="Local inference on a Radeon GPU — clean, multilingual, in your own voice.",
    language="English",
    ref_audio=ref_audio,
    ref_text=ref_text,
)
sf.write("output.wav", wavs[0], sr)
print(f"wrote output.wav @ {sr} Hz")

The output output.wav lands next to the script. For an interactive demo UI on port 8000, use the bundled CLI from the official README:

qwen-tts-demo Qwen/Qwen3-TTS-12Hz-1.7B-Base --ip 0.0.0.0 --port 8000

If you enable the compiled / CUDA-graphs streaming backend

torch.compile and CUDA graphs work on ROCm, not just CUDA, per the Qwen3-TTS ROCm config notes — and the optimized streaming backend can use them for lower latency. If you turn this on, you must set decode_window_frames to 72 for AMD. A minimal config.yaml optimization block:

optimization:
  compile_mode: max-autotune
  use_cuda_graphs: true
  streaming:
    decode_window_frames: 72   # AMD: use 72 (not 64 or 80). NVIDIA: 64 or 80 also work.
    emit_every_frames: 6

On AMD, decode_window_frames values 66, 67, and 71 trigger a CUDA-graph capture bug → 5–10× slowdown; 72 avoids it. This is the single most common ROCm performance regression for this model — see Troubleshooting.

Results

Speed: Omitted. No verifiable RX-7800-XT-named Qwen3-TTS benchmark was found, and a number measured on any other vendor's architecture (e.g. an NVIDIA RTX card) does not transfer to RDNA3 — so no speed figure is quoted here rather than launder one across cards. The 7800 XT also has roughly two-thirds of the 7900 XTX's memory bandwidth (624 vs 960 GB/s) and fewer WMMA units, so even another RDNA3 card's number would mislead. If you've measured Qwen3-TTS real-time-factor on a 7800 XT, please contribute it so it lands on /check/qwen3tts/rx-7800-xt. As a general note, generation is autoregressive and compute-light, so RTF can sit above 1.0 (audio takes longer than realtime) even on capable GPUs.
VRAM usage: The weights are 4.54 GB on disk (3.86 GB main + 682 MB speech tokenizer, HF Files tree). Resident BF16 weights plus KV cache and activations during inference keep the working set comfortably in single-digit GB — well within the 16 GB 7800 XT. min_vram_gb is set to a conservative 8 GB floor; this card has ample headroom. See /check/qwen3tts/rx-7800-xt for any community-submitted measurement.
Languages: 10 — Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian, per the HF model card.
License: Apache-2.0 per the HF card (ungated — weights download freely). Free for commercial use.
No quantization tradeoff on this card: run the native BF16 weights. There is no FP8/FP4 path on RDNA3 and none is needed for a ~4.5 GB model on a 16 GB card.

For the live, measured benchmark data on this exact card, see /check/qwen3tts/rx-7800-xt.

Tradeoffs vs. siblings

Variant	What you get	When to choose
`Qwen3-TTS-12Hz-1.7B-Base` (this recipe)	Zero-shot voice cloning from a short reference clip	You want to clone arbitrary voices
`Qwen3-TTS-12Hz-1.7B-CustomVoice`	9 curated premium speakers + natural-language style control	You want preset voices without supplying reference audio
`Qwen3-TTS-12Hz-0.6B-Base` / `0.6B-CustomVoice`	Lighter footprint, same 10 languages and streaming	You're packing other models alongside on a tight VRAM budget

Troubleshooting

Generation suddenly runs 5–10× slower with the compiled backend — fix `decode_window_frames`

This is the headline ROCm footgun for Qwen3-TTS. When use_cuda_graphs: true is enabled, decode_window_frames values 66, 67, and 71 trigger a CUDA-graph capture bug on ROCm that causes a 5–10× slowdown per the Qwen3-TTS ROCm config notes. Set it to 72 (values 64 and 80 are also reported problematic on some setups). If you don't run the compiled/streaming backend at all, you won't hit this — but it's the first thing to check the moment AMD throughput collapses.

"Torch not compiled with CUDA enabled" / no GPU detected

A CUDA build of PyTorch got installed instead of the ROCm build. Uninstall and reinstall against the ROCm wheel index:

pip uninstall torch torchvision torchaudio
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm7.2

Confirm the suffix is +rocm... and torch.cuda.is_available() is True (HIP exposes the AMD GPU under the cuda namespace).

A library ships only gfx1100 kernels and won't load on the 7800 XT

The 7800 XT is gfx1101 (Navi 32), while the flagship 7900 XTX is gfx1100 (Navi 31). Most of the ROCm stack ships kernels for both, but occasionally a library or prebuilt extension only carries gfx1100 kernels and refuses to run on gfx1101 with a "no kernel image is available" / missing-gfx1101-kernel error. The standard Linux-only fallback is to mask the card as gfx1100 at runtime:

HSA_OVERRIDE_GFX_VERSION=11.0.0 python clone.py

This is a legacy fallback, not a default — the 7800 XT is an officially-supported gfx1101 card and the stable ROCm wheel runs natively on it without the override. Only reach for it if a specific library refuses to load on gfx1101.

Do not install flash-attn or pass `attn_implementation="flash_attention_2"`

HF and NVIDIA-oriented guides recommend pip install flash-attn and attn_implementation="flash_attention_2". On RDNA3 that's the wrong path: the upstream CK FlashAttention build is CDNA/MI-only and fails to compile on gfx1101. Use attn_implementation="sdpa" — PyTorch SDPA is the supported attention path on ROCm and needs no extra install.

Don't reach for FP8 to save memory

An FP8 safetensors loaded on RDNA3 upcasts to BF16 (RDNA3 WMMA has no FP8 hardware per AMD GPUOpen) — so it costs the same memory with no compute acceleration. For a ~4.5 GB model on a 16 GB card there is no memory pressure anyway: run the native BF16 weights.

Language code rejected

The language argument expects full names ("English", "French", "Japanese") — short codes like "en" or "fr" raise an error.

If you hit a problem not covered here, please report it via our submission form so the next reader benefits.