self-hosted/ai
§01·recipe · tts

Qwen3-TTS 1.7B-Base on RTX 5060 Ti: Multilingual Voice Cloning in 10 Languages

ttsintermediate8GB+ VRAMMay 19, 2026
models
tools
prerequisites
  • NVIDIA RTX 5060 Ti (16GB VRAM) or any CUDA GPU with 8 GB+ VRAM
  • Python 3.12 (fresh conda environment recommended)
  • CUDA-enabled PyTorch build (CUDA 12.1+; for Blackwell GPUs use a recent nightly that ships sm_120 kernels)
  • ~5 GB free disk for weights (3.86 GB DiT + 682 MB speech tokenizer)

What You'll Build

A local zero-shot text-to-speech pipeline using Qwen3-TTS-12Hz-1.7B-Base on an RTX 5060 Ti — clone any voice from a 3-second reference clip, then synthesise new sentences in Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, or Italian.

Hardware data: RTX 5060 Ti (16 GB VRAM) · weights ~4.54 GB on disk · runtime ~5 GB VRAM idle, climbing toward ~8 GB peak during inference on RTX 3090 per the Qwen-canonical discussion thread · See benchmark data

⚠️ Variant pinned. This recipe targets the 1.7B-Base checkpoint (voice cloning). Two sibling variants exist on the same Qwen HF org and are not covered here:

  • Qwen3-TTS-12Hz-1.7B-CustomVoice — same parameter count and same runtime VRAM envelope, but ships 9 pre-defined premium speakers (Vivian, Serena, Uncle_Fu, Dylan, Eric, Ryan, Aiden, Ono_Anna, Sohee) instead of clone-from-reference. The install / runtime steps below carry over; only the inference call changes (generate_custom_voice(...) with a speaker= argument).
  • A 0.6B variant is referenced on the Qwen3-TTS GitHub README — lighter footprint, but the official Qwen org's published 12Hz checkpoints on Hugging Face are currently 1.7B-only, so the 0.6B is out of scope until a Qwen/Qwen3-TTS-12Hz-0.6B-Base repo lands.

⚠️ Blackwell + FlashAttention 2. The RTX 5060 Ti is a Blackwell card (compute capability sm_120). FlashAttention 2 pre-built wheels do not include sm_120 kernels as of early 2026 (Dao-AILab/flash-attention#2168), and attn_implementation="flash_attention_2" will fail with no kernel image is available for execution on the device. The instructions below use eager attention, which a 4090 user in the same Qwen discussion reported as "same or better speed" than FA2 for this model.

Requirements

ComponentMinimumTested
GPU8 GB VRAM, BF16-capable (Ampere or newer)RTX 5060 Ti (16 GB)
RAM16 GB system
Storage5 GB free4.54 GB weights (HF Files tab)
SoftwarePython 3.12, PyTorch with CUDA, ffmpegqwen-tts (PyPI)

Installation

1. Create the environment

Per the official Qwen3-TTS README and the HF model card:

conda create -n qwen3-tts python=3.12 -y
conda activate qwen3-tts

2. Install PyTorch with CUDA

For RTX 5060 Ti (Blackwell, sm_120), use a recent CUDA 12.8+ wheel — the stock CUDA 12.1 build will raise no kernel image is available for execution on the device:

pip install --upgrade torch --index-url https://download.pytorch.org/whl/cu128

(The community deploy guide uses cu121 because the target was an RTX 3090. Substitute cu128 for Blackwell.)

3. Install the qwen-tts package

pip install -U qwen-tts

This installs the Qwen3TTSModel Python class and the qwen-tts-demo CLI entrypoint, per the GitHub README.

4. (Skip on RTX 5060 Ti) FlashAttention 2

The HF card suggests pip install -U flash-attn --no-build-isolation. Skip this on Blackwell — see the warning above. The model defaults to eager / SDPA attention without it.

5. Download the weights

First run will fetch them automatically from Qwen/Qwen3-TTS-12Hz-1.7B-Base (3.86 GB main DiT + 682 MB speech_tokenizer/model.safetensors, both visible on the HF Files tab). To pre-cache:

huggingface-cli download Qwen/Qwen3-TTS-12Hz-1.7B-Base

Running

Save as clone.py and run with python clone.py. The reference audio URL below is the official sample from the HF model card:

import torch
import soundfile as sf
from qwen_tts import Qwen3TTSModel

model = Qwen3TTSModel.from_pretrained(
    "Qwen/Qwen3-TTS-12Hz-1.7B-Base",
    device_map="cuda:0",
    dtype=torch.bfloat16,
    attn_implementation="eager",   # NOT flash_attention_2 on Blackwell
)

ref_audio = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-TTS-Repo/clone.wav"
ref_text = "Okay. Yeah. I resent you. I love you. I respect you. But you know what? You blew it!"

wavs, sr = model.generate_voice_clone(
    text="Local inference on a consumer GPU — clean, multilingual, in your own voice.",
    language="English",
    ref_audio=ref_audio,
    ref_text=ref_text,
)
sf.write("output.wav", wavs[0], sr)
print(f"wrote output.wav @ {sr} Hz")

The output output.wav lands next to the script. For an interactive demo UI on port 8000, use the bundled CLI from the official README:

qwen-tts-demo Qwen/Qwen3-TTS-12Hz-1.7B-Base --ip 0.0.0.0 --port 8000

Results

  • VRAM usage: ~5 GB idle, climbing toward ~8 GB peak. The 5 GB figure is from the archy.net Ubuntu Server walkthrough running on an RTX 3090; the ~8 GB peak is reported by user Geximus in the Qwen-canonical discussion thread on the same RTX 3090. Both fit comfortably on the 5060 Ti's 16 GB. Once a 5060 Ti benchmark lands, the live number appears at /check/qwen3tts/rtx-5060-ti.
  • Generation latency: The archy.net guide reports "under 10 seconds for typical phrases" on RTX 3090. Multiple users in the Qwen discussion report RTF 2-4x (i.e. 2-4 seconds of compute per second of audio) across RTX 3090 / 4090 / 5090. Expect similar latency on the 5060 Ti — the model is compute-light and the bottleneck is autoregressive decoding, not raw throughput.
  • Languages: 10 — Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian, per the HF model card.
  • License: Apache-2.0 (both Base and CustomVoice HF cards). Free for commercial use.

For the live, measured benchmark data on this exact card, see /check/qwen3tts/rtx-5060-ti.

Tradeoffs vs. siblings

VariantWhat you getWhen to choose
Qwen3-TTS-12Hz-1.7B-Base (this recipe)Zero-shot voice cloning from a 3-second clipYou want to clone arbitrary voices
Qwen3-TTS-12Hz-1.7B-CustomVoice9 curated premium speakers + natural-language style controlYou want production-grade preset voices without supplying reference audio

Troubleshooting

no kernel image is available for execution on the device

Your PyTorch build doesn't include kernels for the 5060 Ti's sm_120 compute capability, or you tried flash_attention_2. Fix: reinstall PyTorch from the cu128 index (pip install --upgrade torch --index-url https://download.pytorch.org/whl/cu128) and use attn_implementation="eager". The issue is documented at Dao-AILab/flash-attention#2168.

Generation is slower than RTF 1.0 (audio takes longer than realtime)

Expected behaviour, not a bug. Multiple users in the Qwen discussion thread report RTF 2-4x even on RTX 4090 / 5090 with FlashAttention 2 enabled, with GPU utilisation hovering around 12-16%. The model is autoregressive and compute-light; the official QwenLM team has opened issue #89 on the GitHub repo to track speedup work. Keep dtype=torch.bfloat16 and pre-batch multiple sentences with create_voice_clone_prompt() to amortise the reference-encoding pass.

Voice cloning without a reference transcript

The archy.net guide documents an x_vector_only_mode=True flag that lets you clone from the reference audio without providing a transcript — useful when you don't know what the speaker said. When you do have a transcript, supplying it via ref_text gives the model prosody alignment cues and improves output quality.

Language code rejected

The archy.net guide flags that the language argument expects full names ("English", "french", "japanese") — short codes like "en" or "fr" raise an error.

If you hit a problem not covered here, please report it via our submission form so the next reader benefits.