What You'll Build
A local zero-shot text-to-speech pipeline using Qwen3-TTS-12Hz-1.7B-Base on an RTX 4070 — clone any voice from a 3-second reference clip, then synthesise new sentences in Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, or Italian.
Hardware data: RTX 4070 (12GB, Ada Lovelace sm_89) · weights 4.54 GB on disk · ~5 GB VRAM weights-resident baseline, climbing toward ~8 GB peak during inference per the archy.net walkthrough and the Qwen-canonical discussion thread (both measured on RTX 3090 — the autoregressive workload's footprint is dominated by BF16 weights, not arch-specific). That ~8 GB peak leaves comfortable headroom inside the 4070's 12 GB envelope. See benchmark data
⚠️ Variant pinned. This recipe targets the 1.7B-Base checkpoint (voice cloning). Several sibling variants live on the same Qwen HF org and are not covered here:
Qwen3-TTS-12Hz-1.7B-CustomVoice— same 1.7B parameter count and runtime VRAM envelope, but ships 9 pre-defined premium speakers plus natural-language style control via theinstruct=argument. The install / runtime steps below carry over; only the inference call changes (generate_custom_voice(...)withspeaker=and optionalinstruct=arguments) per the GitHub README.Qwen3-TTS-12Hz-1.7B-VoiceDesign— generates a voice from a natural-language persona description (rather than a reference clip), viagenerate_voice_design, per the variant table on the HF model card.- Two 0.6B variants —
Qwen3-TTS-12Hz-0.6B-BaseandQwen3-TTS-12Hz-0.6B-CustomVoice— are also released by the Qwen team with the same 10-language coverage per the HF model card. Lighter footprint; same install path.
ℹ️ Keep FlashAttention-2 — no Blackwell override needed. The RTX 4070 is Ada Lovelace sm_89, so FlashAttention-2's pre-built wheels include full kernel coverage: the FlashAttention README lists "Ampere, Ada, or Hopper GPUs (e.g., A100, RTX 3090, RTX 4090, H100)" as supported, and the stock
pip install torchalready ships Ada kernels (no specialcu128/sm_120 wheel selection). Keepattn_implementation="flash_attention_2"from the canonical HF card verbatim — the Blackwell sm_120 kernel gap that forces RTX 50-series cards ontoeagerattention does not affect the 4070.
Requirements
| Component | Minimum | Tested |
|---|---|---|
| GPU | 8 GB VRAM, BF16-capable (Ampere or newer) | RTX 4070 (12GB, Ada sm_89) |
| RAM | 16 GB system | — |
| Storage | 5 GB free | 4.54 GB weights (HF Files tab) |
| Software | Python 3.12, PyTorch with CUDA, ffmpeg | qwen-tts (PyPI) |
Installation
1. Create the environment
Per the official Qwen3-TTS README and the HF model card:
conda create -n qwen3-tts python=3.12 -y
conda activate qwen3-tts
2. Install PyTorch with CUDA
The default PyTorch wheel already includes sm_89 kernels (Ada Lovelace), so the stock command works directly — no Blackwell cu128 wheel selection is required for the 4070:
pip install -U torch
For an explicit CUDA-version pin matching the archy.net walkthrough (the guide targets an RTX 3090 but the same wheel covers the 4070's sm_89):
pip install -U torch --index-url https://download.pytorch.org/whl/cu121
3. Install the qwen-tts package
pip install -U qwen-tts
This installs the Qwen3TTSModel Python class and the qwen-tts-demo CLI entrypoint, per the GitHub README.
4. Install FlashAttention 2 (recommended on Ada)
The HF card recommends FlashAttention 2 to reduce GPU memory usage:
pip install -U flash-attn --no-build-isolation
The FlashAttention README explicitly lists "Ampere, Ada, or Hopper GPUs (e.g., A100, RTX 3090, RTX 4090, H100)" as supported by the pre-built wheels — the RTX 4070 (Ada sm_89) is covered, and bf16 requires exactly this Ampere/Ada/Hopper class. If flash-attn compilation runs out of RAM (the HF card notes the build is RAM-heavy), cap parallel jobs:
MAX_JOBS=4 pip install -U flash-attn --no-build-isolation
5. Download the weights
First run will fetch them automatically from Qwen/Qwen3-TTS-12Hz-1.7B-Base (3.86 GB main DiT model.safetensors + 682 MB speech_tokenizer/model.safetensors, both verified live on the HF Files tab). To pre-cache:
huggingface-cli download Qwen/Qwen3-TTS-12Hz-1.7B-Base
Running
Save as clone.py and run with python clone.py. The reference audio URL below is the official sample from the HF model card:
import torch
import soundfile as sf
from qwen_tts import Qwen3TTSModel
model = Qwen3TTSModel.from_pretrained(
"Qwen/Qwen3-TTS-12Hz-1.7B-Base",
device_map="cuda:0",
dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
)
ref_audio = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-TTS-Repo/clone.wav"
ref_text = "Okay. Yeah. I resent you. I love you. I respect you. But you know what? You blew it! And thanks to you."
wavs, sr = model.generate_voice_clone(
text="Local inference on a consumer GPU — clean, multilingual, in your own voice.",
language="English",
ref_audio=ref_audio,
ref_text=ref_text,
)
sf.write("output.wav", wavs[0], sr)
print(f"wrote output.wav @ {sr} Hz")
The output output.wav lands next to the script. For an interactive demo UI on port 8000, use the bundled CLI from the official README:
qwen-tts-demo Qwen/Qwen3-TTS-12Hz-1.7B-Base --ip 0.0.0.0 --port 8000
Results
- Speed: No first-party RTX 4070 measurement exists yet, so no headline number is quoted here. The model is autoregressive and compute-light: community users in the Qwen discussion thread report real-time-factors of roughly 2-4x (i.e. audio takes 2-4x its own duration to generate) even on faster cards —
Geximuslogs RTF x3 on an RTX 3090,intlexexecution time ~x2 on an RTX 5090, andchrisoutwrightRTF 4.0 on a 3080 mobile — with GPU utilisation hovering around 10-16% in every report. Treat that range as an order-of-magnitude expectation, not a 4070 figure. Once a first-party RTX 4070 benchmark lands via /contribute, the live number appears at /check/qwen3tts/rtx-4070. - VRAM usage: ~5 GB weights-resident baseline, climbing toward ~8 GB peak. The ~5 GB baseline is from the archy.net Ubuntu Server walkthrough running on an RTX 3090; the ~8 GB allocated peak is reported by user
Geximusin the Qwen-canonical discussion thread on the same RTX 3090. Both fit with headroom inside the 4070's 12 GB envelope. The numbers transfer across cards because the dominant cost is BF16 weights-resident (4.54 GB on disk); the residual ~3 GB of peak is KV cache + activations from autoregressive decoding rather than anything arch-specific. Once an RTX 4070 benchmark lands via /contribute, the live number appears at /check/qwen3tts/rtx-4070. - Languages: 10 — Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian, per the HF model card.
- License: Apache-2.0 per the HF card frontmatter (ungated — weights download freely). Free for commercial use.
For the live, measured benchmark data on this exact card, see /check/qwen3tts/rtx-4070.
Tradeoffs vs. siblings
| Variant | What you get | When to choose |
|---|---|---|
Qwen3-TTS-12Hz-1.7B-Base (this recipe) | Zero-shot voice cloning from a 3-second clip | You want to clone arbitrary voices |
Qwen3-TTS-12Hz-1.7B-CustomVoice | 9 curated premium speakers + natural-language style control | You want preset voices without supplying reference audio |
Qwen3-TTS-12Hz-0.6B-Base / 0.6B-CustomVoice | Lighter footprint, same 10 languages | You're packing other models alongside on a tight VRAM budget |
Troubleshooting
flash-attn build runs out of RAM
The pip install flash-attn --no-build-isolation step launches a heavy CUDA compile that can exhaust system memory. The HF card recommends MAX_JOBS=4 pip install -U flash-attn --no-build-isolation for machines with limited RAM. If you still hit OOM, drop to MAX_JOBS=2 or skip FlashAttention 2 entirely — the model also runs on attn_implementation="eager", and one RTX 4090 user (anujchopra) in the Qwen discussion thread reports getting the same (or somewhat better) speed with eager attention on Ada. FlashAttention 2 stays the recommended default on the 4070, but eager is a working fallback if the build fails.
Generation is slower than RTF 1.0 (audio takes longer than realtime)
Expected behaviour, not a bug. Multiple community users in the Qwen discussion thread report RTF 2-4x even on RTX 4090 / 5090 with FlashAttention 2 enabled, with GPU utilisation hovering around 10-16%. The model is autoregressive and compute-light; the autoregressive decode loop, not raw throughput, is the bottleneck. Keep dtype=torch.bfloat16 and pre-batch multiple sentences to amortise the reference-encoding pass. The QwenLM team is tracking speedup work on the GitHub repo.
Voice cloning without a reference transcript
The archy.net guide documents an x_vector_only_mode=True flag passed to generate_voice_clone that lets you clone from the reference audio without providing a transcript — useful when you don't know what the speaker said. When you do have a transcript, supplying it via ref_text gives the model prosody alignment cues and improves output quality.
Language code rejected
The archy.net guide flags that the language argument expects full language names ("English", "french", "german") — short codes like "en" or "fr" raise an error.
If you hit a problem not covered here, please report it via our submission form so the next reader benefits.