What You'll Build
A local zero-shot text-to-speech pipeline using Qwen3-TTS-12Hz-1.7B-Base on an RTX 4080 — clone any voice from a 3-second reference clip, then synthesise new sentences in Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, or Italian.
Hardware data: RTX 4080 (16GB, Ada Lovelace sm_89) · weights 4.54 GB on disk · ~5 GB VRAM idle, climbing toward ~8 GB peak during inference per archy.net and the Qwen-canonical discussion thread (both measured on RTX 3090 — the autoregressive workload's VRAM footprint is dominated by BF16 weights, not arch-specific). See benchmark data
⚠️ Variant pinned. This recipe targets the 1.7B-Base checkpoint (voice cloning). Several sibling variants live on the same Qwen HF org and are not covered here:
Qwen3-TTS-12Hz-1.7B-CustomVoice— same 1.7B parameter count and runtime VRAM envelope, but ships 9 pre-defined premium speakers (Vivian, Serena, Uncle_Fu, Dylan, Eric, Ryan, Aiden, Ono_Anna, Sohee) plus natural-language style control via theinstruct=argument. The install / runtime steps below carry over; only the inference call changes (generate_custom_voice(...)withspeaker=and optionalinstruct=arguments) per the GitHub README.Qwen3-TTS-12Hz-1.7B-VoiceDesign— generates a voice from a natural-language persona description (rather than a reference clip), per the variant table on the HF model card.- Two 0.6B variants —
Qwen3-TTS-12Hz-0.6B-BaseandQwen3-TTS-12Hz-0.6B-CustomVoice— are also released by the Qwen team with the same language coverage and streaming support per the GitHub README. Lighter footprint; same install path.
ℹ️ No Blackwell
cu128override needed. The RTX 4080 is Ada Lovelace sm_89 — FlashAttention-2 pre-built wheels include full sm_89 kernel coverage per the FlashAttention README, and the stockpip install torchalready ships Ada kernels. The Blackwell sm_120 gap tracked in Dao-AILab/flash-attention#2168 does not apply to this card; keepattn_implementation="flash_attention_2"from the canonical HF card verbatim.
Requirements
| Component | Minimum | Tested |
|---|---|---|
| GPU | 8 GB VRAM, BF16-capable (Ampere or newer) | RTX 4080 (16GB, Ada sm_89) |
| RAM | 16 GB system | — |
| Storage | 5 GB free | 4.54 GB weights (HF Files tab) |
| Software | Python 3.12, PyTorch with CUDA, ffmpeg | qwen-tts (PyPI) |
Installation
1. Create the environment
Per the official Qwen3-TTS README and the HF model card:
conda create -n qwen3-tts python=3.12 -y
conda activate qwen3-tts
2. Install PyTorch with CUDA
The default PyTorch wheel already includes sm_89 kernels (Ada Lovelace), so the stock command from the official guides works directly:
pip install -U torch
For an explicit CUDA-version pin matching the archy.net walkthrough (the guide targets an RTX 3090 but the wheel works equally on the 4080):
pip install -U torch --index-url https://download.pytorch.org/whl/cu121
3. Install the qwen-tts package
pip install -U qwen-tts
This installs the Qwen3TTSModel Python class and the qwen-tts-demo CLI entrypoint, per the GitHub README.
4. Install FlashAttention 2 (recommended on Ada)
The HF card recommends FlashAttention 2 to reduce GPU memory usage:
pip install -U flash-attn --no-build-isolation
The FlashAttention README explicitly lists "Ampere, Ada, or Hopper GPUs (e.g., A100, RTX 3090, RTX 4090, H100)" as supported by the pre-built wheels — the RTX 4080 (Ada sm_89) is covered. If flash-attn compilation runs out of RAM (it needs ~96 GB during build), cap parallel jobs:
MAX_JOBS=4 pip install -U flash-attn --no-build-isolation
5. Download the weights
First run will fetch them automatically from Qwen/Qwen3-TTS-12Hz-1.7B-Base (3.86 GB main DiT + 682 MB speech_tokenizer/model.safetensors, both visible on the HF Files tab). To pre-cache:
huggingface-cli download Qwen/Qwen3-TTS-12Hz-1.7B-Base
Running
Save as clone.py and run with python clone.py. The reference audio URL below is the official sample from the HF model card:
import torch
import soundfile as sf
from qwen_tts import Qwen3TTSModel
model = Qwen3TTSModel.from_pretrained(
"Qwen/Qwen3-TTS-12Hz-1.7B-Base",
device_map="cuda:0",
dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
)
ref_audio = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-TTS-Repo/clone.wav"
ref_text = "Okay. Yeah. I resent you. I love you. I respect you. But you know what? You blew it!"
wavs, sr = model.generate_voice_clone(
text="Local inference on a consumer GPU — clean, multilingual, in your own voice.",
language="English",
ref_audio=ref_audio,
ref_text=ref_text,
)
sf.write("output.wav", wavs[0], sr)
print(f"wrote output.wav @ {sr} Hz")
The output output.wav lands next to the script. For an interactive demo UI on port 8000, use the bundled CLI from the official README:
qwen-tts-demo Qwen/Qwen3-TTS-12Hz-1.7B-Base --ip 0.0.0.0 --port 8000
Results
- Speed: A long-form benchmark by AWS Principal Solutions Architect Gary A. Stafford on an RTX 4080 SUPER (16 GB), Windows 11 reports the 1.7B-Base model produced 15 minutes of clean audio in 23 minutes — an RTF of 1.56x — for a 16,508-character story (Medium write-up). The RTX 4080 SUPER is a close Ada sm_89 sibling of the RTX 4080 — same 16 GB envelope, same architecture, roughly 4% more compute — so expect very slightly slower numbers (marginally higher RTF) on the plain RTX 4080. This is a single-author Tier-A measurement; once a first-party RTX 4080 benchmark lands via /contribute, the live figure appears at /check/qwen3tts/rtx-4080.
- VRAM usage: ~5 GB idle, climbing toward ~8 GB peak. The 5 GB figure is from the archy.net Ubuntu Server walkthrough running on an RTX 3090; the ~8 GB peak is reported by user
Geximusin the Qwen-canonical discussion thread on the same RTX 3090. Both fit comfortably on the 16 GB RTX 4080. The numbers transfer cleanly to the 4080 because the dominant cost is BF16 weights-resident (4.54 GB on disk); the residual ~3 GB peak headroom is KV cache + activations and is autoregressive-decoder bound rather than arch-bound. Once an RTX 4080 benchmark lands via /contribute, the live number appears at /check/qwen3tts/rtx-4080. - Languages: 10 — Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian, per the HF model card.
- License: Apache-2.0 per the HF card (ungated — weights download freely). Free for commercial use.
For the live, measured benchmark data on this exact card, see /check/qwen3tts/rtx-4080.
Tradeoffs vs. siblings
| Variant | What you get | When to choose |
|---|---|---|
Qwen3-TTS-12Hz-1.7B-Base (this recipe) | Zero-shot voice cloning from a 3-second clip | You want to clone arbitrary voices |
Qwen3-TTS-12Hz-1.7B-CustomVoice | 9 curated premium speakers + natural-language style control | You want production-grade preset voices without supplying reference audio |
Qwen3-TTS-12Hz-0.6B-Base / 0.6B-CustomVoice | Lighter footprint, same 10 languages and streaming | You're packing other models alongside on a tight VRAM budget |
Troubleshooting
flash-attn build runs out of RAM
The pip install flash-attn --no-build-isolation step launches a heavy CUDA compile that can exhaust system memory. The HF card recommends MAX_JOBS=4 pip install -U flash-attn --no-build-isolation for machines with less than 96 GB of RAM. If you still hit OOM, drop to MAX_JOBS=2 or skip FlashAttention 2 entirely — the model also runs on attn_implementation="eager", and one RTX 4090 user in the Qwen discussion thread (anujchopra) reports eager attention as "same ( or somewhat better )" speed than FlashAttention 2 for this model on Ada.
Generation is slower than RTF 1.0 (audio takes longer than realtime)
Expected behaviour, not a bug. Multiple users in the Qwen discussion thread report RTF 2-4x even on RTX 4090 / 5090 with FlashAttention 2 enabled, with GPU utilisation hovering around 12-16%. The model is autoregressive and compute-light; the official QwenLM team is tracking speedup work on the GitHub repo. Keep dtype=torch.bfloat16 and pre-batch multiple sentences with create_voice_clone_prompt() to amortise the reference-encoding pass.
Voice cloning without a reference transcript
The archy.net guide documents an x_vector_only_mode=True flag that lets you clone from the reference audio without providing a transcript — useful when you don't know what the speaker said. When you do have a transcript, supplying it via ref_text gives the model prosody alignment cues and improves output quality.
Language code rejected
The archy.net guide flags that the language argument expects full names ("English", "french", "japanese") — short codes like "en" or "fr" raise an error.
If you hit a problem not covered here, please report it via our submission form so the next reader benefits.