What You'll Build
A local zero-shot text-to-speech pipeline using Qwen3-TTS-12Hz-1.7B-Base on an RTX 4060 Ti 16GB — clone any voice from a 3-second reference clip, then synthesise new sentences in Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, or Italian.
Hardware data: RTX 4060 Ti 16GB (Ada Lovelace, sm_89) · weights 4.54 GB on disk · ~5 GB VRAM idle, climbing toward ~8 GB peak during inference per archy.net and the Qwen-canonical discussion thread (both measured on RTX 3090 — the autoregressive workload's VRAM footprint is dominated by BF16 weights, not arch-specific). See benchmark data
⚠️ Variant pinned. This recipe targets the 1.7B-Base checkpoint (voice cloning). Several sibling variants live on the same Qwen HF org and are not covered here:
Qwen3-TTS-12Hz-1.7B-CustomVoice— same 1.7B parameter count and runtime VRAM envelope, but ships 9 pre-defined premium speakers (Vivian, Serena, Uncle_Fu, Dylan, Eric, Ryan, Aiden, Ono_Anna, Sohee) plus natural-language style control via theinstruct=argument. The install / runtime steps below carry over; only the inference call changes (generate_custom_voice(...)withspeaker=and optionalinstruct=arguments) per the GitHub README.- Two 0.6B variants —
Qwen3-TTS-12Hz-0.6B-BaseandQwen3-TTS-12Hz-0.6B-CustomVoice— are also released by the Qwen team with the same language coverage and streaming support per the GitHub README. Lighter footprint; same install path.
ℹ️ No Blackwell
cu128override needed. Unlike the 5060 Ti sibling recipe, the 4060 Ti 16GB is Ada Lovelace sm_89 — FlashAttention-2 pre-built wheels include full sm_89 kernel coverage per the FlashAttention README, and the stockpip install torchalready ships Ada kernels. The Blackwell sm_120 gap tracked in Dao-AILab/flash-attention#2168 does not apply to this card; keepattn_implementation="flash_attention_2"from the canonical HF card verbatim.
Requirements
| Component | Minimum | Tested |
|---|---|---|
| GPU | 8 GB VRAM, BF16-capable (Ampere or newer) | RTX 4060 Ti 16GB (Ada sm_89) |
| RAM | 16 GB system | — |
| Storage | 5 GB free | 4.54 GB weights (HF Files tab) |
| Software | Python 3.12, PyTorch with CUDA, ffmpeg | qwen-tts (PyPI) |
Installation
1. Create the environment
Per the official Qwen3-TTS README and the HF model card:
conda create -n qwen3-tts python=3.12 -y
conda activate qwen3-tts
2. Install PyTorch with CUDA
The default PyTorch wheel already includes sm_89 kernels (Ada Lovelace), so the stock command from the official guides works directly:
pip install -U torch
For an explicit CUDA-version pin matching the archy.net walkthrough (the guide targets an RTX 3090 but the wheel works equally on the 4060 Ti):
pip install -U torch --index-url https://download.pytorch.org/whl/cu121
3. Install the qwen-tts package
pip install -U qwen-tts
This installs the Qwen3TTSModel Python class and the qwen-tts-demo CLI entrypoint, per the GitHub README.
4. Install FlashAttention 2 (recommended on Ada)
The HF card recommends:
pip install -U flash-attn --no-build-isolation
The FlashAttention README explicitly lists "Ampere, Ada, or Hopper GPUs (e.g., A100, RTX 3090, RTX 4090, H100)" as supported by the pre-built wheels — the RTX 4060 Ti 16GB (Ada sm_89) is covered. If flash-attn compilation runs out of RAM (it needs ~96 GB during build), cap parallel jobs:
MAX_JOBS=4 pip install -U flash-attn --no-build-isolation
5. Download the weights
First run will fetch them automatically from Qwen/Qwen3-TTS-12Hz-1.7B-Base (3.86 GB main DiT + 682 MB speech_tokenizer/model.safetensors, both visible on the HF Files tab). To pre-cache:
huggingface-cli download Qwen/Qwen3-TTS-12Hz-1.7B-Base
Running
Save as clone.py and run with python clone.py. The reference audio URL below is the official sample from the HF model card:
import torch
import soundfile as sf
from qwen_tts import Qwen3TTSModel
model = Qwen3TTSModel.from_pretrained(
"Qwen/Qwen3-TTS-12Hz-1.7B-Base",
device_map="cuda:0",
dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
)
ref_audio = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-TTS-Repo/clone.wav"
ref_text = "Okay. Yeah. I resent you. I love you. I respect you. But you know what? You blew it!"
wavs, sr = model.generate_voice_clone(
text="Local inference on a consumer GPU — clean, multilingual, in your own voice.",
language="English",
ref_audio=ref_audio,
ref_text=ref_text,
)
sf.write("output.wav", wavs[0], sr)
print(f"wrote output.wav @ {sr} Hz")
The output output.wav lands next to the script. For an interactive demo UI on port 8000, use the bundled CLI from the official README:
qwen-tts-demo Qwen/Qwen3-TTS-12Hz-1.7B-Base --ip 0.0.0.0 --port 8000
Results
- VRAM usage: ~5 GB idle, climbing toward ~8 GB peak. The 5 GB figure is from the archy.net Ubuntu Server walkthrough running on an RTX 3090; the ~8 GB peak is reported by user
Geximusin the Qwen-canonical discussion thread on the same RTX 3090. Both fit comfortably on the 4060 Ti 16GB. The numbers transfer cleanly to Ada because the dominant cost is BF16 weights-resident (4.54 GB on disk); the residual ~3 GB peak headroom is KV cache + activations and is autoregressive-decoder bound rather than arch-bound. Once a 4060 Ti 16GB benchmark lands via /contribute, the live number appears at /check/qwen3tts/rtx-4060-ti-16gb. - Generation latency: The archy.net guide reports "under 10 seconds for typical phrases" on RTX 3090. Multiple users in the Qwen discussion report RTF 2-4x (i.e. 2-4 seconds of compute per second of audio) across RTX 3090 (Geximus, RTF x3), RTX 4090 (anujchopra, "same or somewhat better with
attn_implementation=eager"), and RTX 5090 (intlex, RTF x2). The 4060 Ti 16GB lacks a direct measurement but the bottleneck across all these cards is autoregressive decoding (GPU utilisation ~12-16% per intlex's report), not raw throughput — expect comparable RTF. - Languages: 10 — Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian, per the HF model card.
- License: Apache-2.0 per the HF card. Free for commercial use.
For the live, measured benchmark data on this exact card, see /check/qwen3tts/rtx-4060-ti-16gb.
Tradeoffs vs. siblings
| Variant | What you get | When to choose |
|---|---|---|
Qwen3-TTS-12Hz-1.7B-Base (this recipe) | Zero-shot voice cloning from a 3-second clip | You want to clone arbitrary voices |
Qwen3-TTS-12Hz-1.7B-CustomVoice | 9 curated premium speakers + natural-language style control | You want production-grade preset voices without supplying reference audio |
Qwen3-TTS-12Hz-0.6B-Base / 0.6B-CustomVoice | Lighter footprint, same 10 languages and streaming | You're packing other models alongside on a tight VRAM budget |
Troubleshooting
flash-attn build runs out of RAM
The pip install flash-attn --no-build-isolation step launches a heavy CUDA compile that can exhaust system memory. The HF card recommends MAX_JOBS=4 pip install -U flash-attn --no-build-isolation for machines with less than 96 GB of RAM. If you still hit OOM, drop to MAX_JOBS=2 or skip FlashAttention 2 entirely — the model also runs on attn_implementation="eager" and one RTX 4090 user in the Qwen discussion thread (anujchopra) reports eager attention as "same ( or somewhat better )" speed than FlashAttention 2 for this model on Ada.
Generation is slower than RTF 1.0 (audio takes longer than realtime)
Expected behaviour, not a bug. Multiple users in the Qwen discussion thread report RTF 2-4x even on RTX 4090 / 5090 with FlashAttention 2 enabled, with GPU utilisation hovering around 12-16%. The model is autoregressive and compute-light; the official QwenLM team is tracking speedup work on the GitHub repo. Keep dtype=torch.bfloat16 and pre-batch multiple sentences with create_voice_clone_prompt() to amortise the reference-encoding pass.
Voice cloning without a reference transcript
The archy.net guide documents an x_vector_only_mode=True flag that lets you clone from the reference audio without providing a transcript — useful when you don't know what the speaker said. When you do have a transcript, supplying it via ref_text gives the model prosody alignment cues and improves output quality.
Language code rejected
The archy.net guide flags that the language argument expects full names ("English", "french", "japanese") — short codes like "en" or "fr" raise an error.
If you hit a problem not covered here, please report it via our submission form so the next reader benefits.