What You'll Build
A local zero-shot text-to-speech pipeline using Qwen3-TTS-12Hz-1.7B-Base on an RTX 5060 Ti — clone any voice from a 3-second reference clip, then synthesise new sentences in Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, or Italian.
Hardware data: RTX 5060 Ti (16 GB VRAM) · weights ~4.54 GB on disk · runtime ~5 GB VRAM idle, climbing toward ~8 GB peak during inference on RTX 3090 per the Qwen-canonical discussion thread · See benchmark data
⚠️ Variant pinned. This recipe targets the 1.7B-Base checkpoint (voice cloning). Two sibling variants exist on the same Qwen HF org and are not covered here:
Qwen3-TTS-12Hz-1.7B-CustomVoice— same parameter count and same runtime VRAM envelope, but ships 9 pre-defined premium speakers (Vivian, Serena, Uncle_Fu, Dylan, Eric, Ryan, Aiden, Ono_Anna, Sohee) instead of clone-from-reference. The install / runtime steps below carry over; only the inference call changes (generate_custom_voice(...)with aspeaker=argument).- A
0.6Bvariant is referenced on the Qwen3-TTS GitHub README — lighter footprint, but the official Qwen org's published 12Hz checkpoints on Hugging Face are currently 1.7B-only, so the 0.6B is out of scope until aQwen/Qwen3-TTS-12Hz-0.6B-Baserepo lands.
⚠️ Blackwell + FlashAttention 2. The RTX 5060 Ti is a Blackwell card (compute capability
sm_120). FlashAttention 2 pre-built wheels do not include sm_120 kernels as of early 2026 (Dao-AILab/flash-attention#2168), andattn_implementation="flash_attention_2"will fail withno kernel image is available for execution on the device. The instructions below useeagerattention, which a 4090 user in the same Qwen discussion reported as "same or better speed" than FA2 for this model.
Requirements
| Component | Minimum | Tested |
|---|---|---|
| GPU | 8 GB VRAM, BF16-capable (Ampere or newer) | RTX 5060 Ti (16 GB) |
| RAM | 16 GB system | — |
| Storage | 5 GB free | 4.54 GB weights (HF Files tab) |
| Software | Python 3.12, PyTorch with CUDA, ffmpeg | qwen-tts (PyPI) |
Installation
1. Create the environment
Per the official Qwen3-TTS README and the HF model card:
conda create -n qwen3-tts python=3.12 -y
conda activate qwen3-tts
2. Install PyTorch with CUDA
For RTX 5060 Ti (Blackwell, sm_120), use a recent CUDA 12.8+ wheel — the stock CUDA 12.1 build will raise no kernel image is available for execution on the device:
pip install --upgrade torch --index-url https://download.pytorch.org/whl/cu128
(The community deploy guide uses cu121 because the target was an RTX 3090. Substitute cu128 for Blackwell.)
3. Install the qwen-tts package
pip install -U qwen-tts
This installs the Qwen3TTSModel Python class and the qwen-tts-demo CLI entrypoint, per the GitHub README.
4. (Skip on RTX 5060 Ti) FlashAttention 2
The HF card suggests pip install -U flash-attn --no-build-isolation. Skip this on Blackwell — see the warning above. The model defaults to eager / SDPA attention without it.
5. Download the weights
First run will fetch them automatically from Qwen/Qwen3-TTS-12Hz-1.7B-Base (3.86 GB main DiT + 682 MB speech_tokenizer/model.safetensors, both visible on the HF Files tab). To pre-cache:
huggingface-cli download Qwen/Qwen3-TTS-12Hz-1.7B-Base
Running
Save as clone.py and run with python clone.py. The reference audio URL below is the official sample from the HF model card:
import torch
import soundfile as sf
from qwen_tts import Qwen3TTSModel
model = Qwen3TTSModel.from_pretrained(
"Qwen/Qwen3-TTS-12Hz-1.7B-Base",
device_map="cuda:0",
dtype=torch.bfloat16,
attn_implementation="eager", # NOT flash_attention_2 on Blackwell
)
ref_audio = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-TTS-Repo/clone.wav"
ref_text = "Okay. Yeah. I resent you. I love you. I respect you. But you know what? You blew it!"
wavs, sr = model.generate_voice_clone(
text="Local inference on a consumer GPU — clean, multilingual, in your own voice.",
language="English",
ref_audio=ref_audio,
ref_text=ref_text,
)
sf.write("output.wav", wavs[0], sr)
print(f"wrote output.wav @ {sr} Hz")
The output output.wav lands next to the script. For an interactive demo UI on port 8000, use the bundled CLI from the official README:
qwen-tts-demo Qwen/Qwen3-TTS-12Hz-1.7B-Base --ip 0.0.0.0 --port 8000
Results
- VRAM usage: ~5 GB idle, climbing toward ~8 GB peak. The 5 GB figure is from the archy.net Ubuntu Server walkthrough running on an RTX 3090; the ~8 GB peak is reported by user
Geximusin the Qwen-canonical discussion thread on the same RTX 3090. Both fit comfortably on the 5060 Ti's 16 GB. Once a 5060 Ti benchmark lands, the live number appears at /check/qwen3tts/rtx-5060-ti. - Generation latency: The archy.net guide reports "under 10 seconds for typical phrases" on RTX 3090. Multiple users in the Qwen discussion report RTF 2-4x (i.e. 2-4 seconds of compute per second of audio) across RTX 3090 / 4090 / 5090. Expect similar latency on the 5060 Ti — the model is compute-light and the bottleneck is autoregressive decoding, not raw throughput.
- Languages: 10 — Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian, per the HF model card.
- License: Apache-2.0 (both Base and CustomVoice HF cards). Free for commercial use.
For the live, measured benchmark data on this exact card, see /check/qwen3tts/rtx-5060-ti.
Tradeoffs vs. siblings
| Variant | What you get | When to choose |
|---|---|---|
Qwen3-TTS-12Hz-1.7B-Base (this recipe) | Zero-shot voice cloning from a 3-second clip | You want to clone arbitrary voices |
Qwen3-TTS-12Hz-1.7B-CustomVoice | 9 curated premium speakers + natural-language style control | You want production-grade preset voices without supplying reference audio |
Troubleshooting
no kernel image is available for execution on the device
Your PyTorch build doesn't include kernels for the 5060 Ti's sm_120 compute capability, or you tried flash_attention_2. Fix: reinstall PyTorch from the cu128 index (pip install --upgrade torch --index-url https://download.pytorch.org/whl/cu128) and use attn_implementation="eager". The issue is documented at Dao-AILab/flash-attention#2168.
Generation is slower than RTF 1.0 (audio takes longer than realtime)
Expected behaviour, not a bug. Multiple users in the Qwen discussion thread report RTF 2-4x even on RTX 4090 / 5090 with FlashAttention 2 enabled, with GPU utilisation hovering around 12-16%. The model is autoregressive and compute-light; the official QwenLM team has opened issue #89 on the GitHub repo to track speedup work. Keep dtype=torch.bfloat16 and pre-batch multiple sentences with create_voice_clone_prompt() to amortise the reference-encoding pass.
Voice cloning without a reference transcript
The archy.net guide documents an x_vector_only_mode=True flag that lets you clone from the reference audio without providing a transcript — useful when you don't know what the speaker said. When you do have a transcript, supplying it via ref_text gives the model prosody alignment cues and improves output quality.
Language code rejected
The archy.net guide flags that the language argument expects full names ("English", "french", "japanese") — short codes like "en" or "fr" raise an error.
If you hit a problem not covered here, please report it via our submission form so the next reader benefits.