How much VRAM does Qwen3-TTS need?

About 8 GB — the minimum this recipe targets.

How hard is this setup?

Intermediate — follow the steps above.

Qwen3-TTS 1.7B-Base on RTX 4060 Ti 16GB: Multilingual Voice Cloning in 10 Languages with FlashAttention-2

What You'll Build

A local zero-shot text-to-speech pipeline using Qwen3-TTS-12Hz-1.7B-Base on an RTX 4060 Ti 16GB — clone any voice from a 3-second reference clip, then synthesise new sentences in Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, or Italian.

Hardware data: RTX 4060 Ti 16GB (Ada Lovelace, sm_89) · weights 4.54 GB on disk · ~5 GB VRAM idle, climbing toward ~8 GB peak during inference per archy.net and the Qwen-canonical discussion thread (both measured on RTX 3090 — the autoregressive workload's VRAM footprint is dominated by BF16 weights, not arch-specific). See benchmark data

⚠️ Variant pinned. This recipe targets the 1.7B-Base checkpoint (voice cloning). Several sibling variants live on the same Qwen HF org and are not covered here:

Qwen3-TTS-12Hz-1.7B-CustomVoice — same 1.7B parameter count and runtime VRAM envelope, but ships 9 pre-defined premium speakers (Vivian, Serena, Uncle_Fu, Dylan, Eric, Ryan, Aiden, Ono_Anna, Sohee) plus natural-language style control via the instruct= argument. The install / runtime steps below carry over; only the inference call changes (generate_custom_voice(...) with speaker= and optional instruct= arguments) per the GitHub README.

Two 0.6B variants — Qwen3-TTS-12Hz-0.6B-Base and Qwen3-TTS-12Hz-0.6B-CustomVoice — are also released by the Qwen team with the same language coverage and streaming support per the GitHub README. Lighter footprint; same install path.

ℹ️ No Blackwell cu128 override needed. Unlike the 5060 Ti sibling recipe, the 4060 Ti 16GB is Ada Lovelace sm_89 — FlashAttention-2 pre-built wheels include full sm_89 kernel coverage per the FlashAttention README, and the stock pip install torch already ships Ada kernels. The Blackwell sm_120 gap tracked in Dao-AILab/flash-attention#2168 does not apply to this card; keep attn_implementation="flash_attention_2" from the canonical HF card verbatim.

Requirements

Component	Minimum	Tested
GPU	8 GB VRAM, BF16-capable (Ampere or newer)	RTX 4060 Ti 16GB (Ada sm_89)
RAM	16 GB system	—
Storage	5 GB free	4.54 GB weights (HF Files tab)
Software	Python 3.12, PyTorch with CUDA, `ffmpeg`	qwen-tts (PyPI)

Installation

1. Create the environment

Per the official Qwen3-TTS README and the HF model card:

conda create -n qwen3-tts python=3.12 -y
conda activate qwen3-tts

2. Install PyTorch with CUDA

The default PyTorch wheel already includes sm_89 kernels (Ada Lovelace), so the stock command from the official guides works directly:

pip install -U torch

For an explicit CUDA-version pin matching the archy.net walkthrough (the guide targets an RTX 3090 but the wheel works equally on the 4060 Ti):

pip install -U torch --index-url https://download.pytorch.org/whl/cu121

3. Install the qwen-tts package

pip install -U qwen-tts

This installs the Qwen3TTSModel Python class and the qwen-tts-demo CLI entrypoint, per the GitHub README.

4. Install FlashAttention 2 (recommended on Ada)

The HF card recommends:

pip install -U flash-attn --no-build-isolation

The FlashAttention README explicitly lists "Ampere, Ada, or Hopper GPUs (e.g., A100, RTX 3090, RTX 4090, H100)" as supported by the pre-built wheels — the RTX 4060 Ti 16GB (Ada sm_89) is covered. If flash-attn compilation runs out of RAM (it needs ~96 GB during build), cap parallel jobs:

MAX_JOBS=4 pip install -U flash-attn --no-build-isolation

5. Download the weights

First run will fetch them automatically from Qwen/Qwen3-TTS-12Hz-1.7B-Base (3.86 GB main DiT + 682 MB speech_tokenizer/model.safetensors, both visible on the HF Files tab). To pre-cache:

huggingface-cli download Qwen/Qwen3-TTS-12Hz-1.7B-Base

Running

Save as clone.py and run with python clone.py. The reference audio URL below is the official sample from the HF model card:

import torch
import soundfile as sf
from qwen_tts import Qwen3TTSModel

model = Qwen3TTSModel.from_pretrained(
    "Qwen/Qwen3-TTS-12Hz-1.7B-Base",
    device_map="cuda:0",
    dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
)

ref_audio = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-TTS-Repo/clone.wav"
ref_text = "Okay. Yeah. I resent you. I love you. I respect you. But you know what? You blew it!"

wavs, sr = model.generate_voice_clone(
    text="Local inference on a consumer GPU — clean, multilingual, in your own voice.",
    language="English",
    ref_audio=ref_audio,
    ref_text=ref_text,
)
sf.write("output.wav", wavs[0], sr)
print(f"wrote output.wav @ {sr} Hz")

The output output.wav lands next to the script. For an interactive demo UI on port 8000, use the bundled CLI from the official README:

qwen-tts-demo Qwen/Qwen3-TTS-12Hz-1.7B-Base --ip 0.0.0.0 --port 8000

Results

VRAM usage: ~5 GB idle, climbing toward ~8 GB peak. The 5 GB figure is from the archy.net Ubuntu Server walkthrough running on an RTX 3090; the ~8 GB peak is reported by user Geximus in the Qwen-canonical discussion thread on the same RTX 3090. Both fit comfortably on the 4060 Ti 16GB. The numbers transfer cleanly to Ada because the dominant cost is BF16 weights-resident (4.54 GB on disk); the residual ~3 GB peak headroom is KV cache + activations and is autoregressive-decoder bound rather than arch-bound. Once a 4060 Ti 16GB benchmark lands via /contribute, the live number appears at /check/qwen3tts/rtx-4060-ti-16gb.
Generation latency: The archy.net guide reports "under 10 seconds for typical phrases" on RTX 3090. Multiple users in the Qwen discussion report RTF 2-4x (i.e. 2-4 seconds of compute per second of audio) across RTX 3090 (Geximus, RTF x3), RTX 4090 (anujchopra, "same or somewhat better with attn_implementation=eager"), and RTX 5090 (intlex, RTF x2). The 4060 Ti 16GB lacks a direct measurement but the bottleneck across all these cards is autoregressive decoding (GPU utilisation ~12-16% per intlex's report), not raw throughput — expect comparable RTF.
Languages: 10 — Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian, per the HF model card.
License: Apache-2.0 per the HF card. Free for commercial use.

For the live, measured benchmark data on this exact card, see /check/qwen3tts/rtx-4060-ti-16gb.

Tradeoffs vs. siblings

Variant	What you get	When to choose
`Qwen3-TTS-12Hz-1.7B-Base` (this recipe)	Zero-shot voice cloning from a 3-second clip	You want to clone arbitrary voices
`Qwen3-TTS-12Hz-1.7B-CustomVoice`	9 curated premium speakers + natural-language style control	You want production-grade preset voices without supplying reference audio
`Qwen3-TTS-12Hz-0.6B-Base` / `0.6B-CustomVoice`	Lighter footprint, same 10 languages and streaming	You're packing other models alongside on a tight VRAM budget

Troubleshooting

`flash-attn` build runs out of RAM

The pip install flash-attn --no-build-isolation step launches a heavy CUDA compile that can exhaust system memory. The HF card recommends MAX_JOBS=4 pip install -U flash-attn --no-build-isolation for machines with less than 96 GB of RAM. If you still hit OOM, drop to MAX_JOBS=2 or skip FlashAttention 2 entirely — the model also runs on attn_implementation="eager" and one RTX 4090 user in the Qwen discussion thread (anujchopra) reports eager attention as "same ( or somewhat better )" speed than FlashAttention 2 for this model on Ada.

Generation is slower than RTF 1.0 (audio takes longer than realtime)

Expected behaviour, not a bug. Multiple users in the Qwen discussion thread report RTF 2-4x even on RTX 4090 / 5090 with FlashAttention 2 enabled, with GPU utilisation hovering around 12-16%. The model is autoregressive and compute-light; the official QwenLM team is tracking speedup work on the GitHub repo. Keep dtype=torch.bfloat16 and pre-batch multiple sentences with create_voice_clone_prompt() to amortise the reference-encoding pass.

Voice cloning without a reference transcript

The archy.net guide documents an x_vector_only_mode=True flag that lets you clone from the reference audio without providing a transcript — useful when you don't know what the speaker said. When you do have a transcript, supplying it via ref_text gives the model prosody alignment cues and improves output quality.

Language code rejected

The archy.net guide flags that the language argument expects full names ("English", "french", "japanese") — short codes like "en" or "fr" raise an error.

If you hit a problem not covered here, please report it via our submission form so the next reader benefits.