Qwen3-TTS 1.7B-Base on RTX 5070: Multilingual Voice Cloning in 10 Languages

What You'll Build

A local zero-shot text-to-speech pipeline using Qwen3-TTS-12Hz-1.7B-Base on an RTX 5070 — clone any voice from a 3-second reference clip, then synthesise new sentences in Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, or Italian.

Hardware data: RTX 5070 (12 GB VRAM, Blackwell sm_120) · weights 4.54 GB on disk · ~5 GB VRAM weights-resident baseline, climbing toward ~8 GB peak during inference, per the archy.net walkthrough and the Qwen-canonical discussion thread (both measured on RTX 3090 — the autoregressive workload's footprint is dominated by BF16 weights, not arch-specific). That ~8 GB peak leaves comfortable headroom inside the 5070's 12 GB envelope. See benchmark data

⚠️ Variant pinned. This recipe targets the 1.7B-Base checkpoint (voice cloning). Several sibling variants live on the same Qwen HF org and are not covered here:

Qwen3-TTS-12Hz-1.7B-CustomVoice — same 1.7B parameter count and runtime VRAM envelope, but ships 9 pre-defined premium speakers (Vivian, Serena, Uncle_Fu, Dylan, Eric, Ryan, Aiden, Ono_Anna, Sohee) plus natural-language style control via the instruct= argument. The install / runtime steps below carry over; only the inference call changes (generate_custom_voice(...) with speaker= and optional instruct= arguments) per the GitHub README.

Two 0.6B variants — Qwen3-TTS-12Hz-0.6B-Base and Qwen3-TTS-12Hz-0.6B-CustomVoice — are also released by the Qwen team with the same 10-language coverage per the HF model card. Lighter footprint; same install path.

⚠️ Blackwell + FlashAttention 2. The RTX 5070 is a Blackwell card (compute capability sm_120, GB205 die). FlashAttention 2 pre-built wheels do not include sm_120 kernels as of mid-2026 (Dao-AILab/flash-attention#2168 — still open, tracking the Blackwell/RTX 50-series CUDA error), and the canonical HF model card hardcodes attn_implementation="flash_attention_2" in every Quick Start snippet. On the 5070 that line fails at the first inference call with no kernel image is available for execution on the device. The instructions below override it to eager attention; one RTX 4090 user in the same Qwen discussion (anujchopra) reported eager attention as the same speed as — or slightly faster than — FlashAttention 2 for this model, so on Blackwell you lose nothing by dropping FA2.

Requirements

Component	Minimum	Tested
GPU	8 GB VRAM, BF16-capable (Ampere or newer)	RTX 5070 (12 GB, Blackwell sm_120)
RAM	16 GB system	—
Storage	5 GB free	4.54 GB weights (HF Files tab)
Software	Python 3.12, PyTorch with CUDA (`cu128`), `ffmpeg`	qwen-tts (PyPI)

Installation

1. Create the environment

Per the official Qwen3-TTS README and the HF model card:

conda create -n qwen3-tts python=3.12 -y
conda activate qwen3-tts

2. Install PyTorch with CUDA (cu128 for Blackwell)

The RTX 5070 is Blackwell (sm_120). The default pip install torch ships sm_120 kernels via the cu128 wheel index — but pin it explicitly so you don't land on an older build that raises no kernel image is available for execution on the device:

pip install --upgrade torch --index-url https://download.pytorch.org/whl/cu128

(The community deploy guide uses cu121 because its target was an RTX 3090. Substitute cu128 for any Blackwell sm_120 card.)

3. Install the qwen-tts package

pip install -U qwen-tts

This installs the Qwen3TTSModel Python class and the qwen-tts-demo CLI entrypoint, per the GitHub README.

4. (Skip on RTX 5070) FlashAttention 2

The HF card suggests pip install -U flash-attn --no-build-isolation. Skip this on Blackwell — the pre-built wheels have no sm_120 kernels yet (see the warning above). The model runs on eager / SDPA attention without it.

5. Download the weights

First run fetches them automatically from Qwen/Qwen3-TTS-12Hz-1.7B-Base (3.86 GB main model.safetensors + 682 MB speech_tokenizer/model.safetensors, both verified live on the HF Files tab). To pre-cache:

huggingface-cli download Qwen/Qwen3-TTS-12Hz-1.7B-Base

Running

Save as clone.py and run with python clone.py. The reference audio URL below is the official sample from the HF model card:

import torch
import soundfile as sf
from qwen_tts import Qwen3TTSModel

model = Qwen3TTSModel.from_pretrained(
    "Qwen/Qwen3-TTS-12Hz-1.7B-Base",
    device_map="cuda:0",
    dtype=torch.bfloat16,
    attn_implementation="eager",   # NOT flash_attention_2 on Blackwell sm_120
)

ref_audio = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-TTS-Repo/clone.wav"
ref_text = "Okay. Yeah. I resent you. I love you. I respect you. But you know what? You blew it!"

wavs, sr = model.generate_voice_clone(
    text="Local inference on a consumer GPU — clean, multilingual, in your own voice.",
    language="English",
    ref_audio=ref_audio,
    ref_text=ref_text,
)
sf.write("output.wav", wavs[0], sr)
print(f"wrote output.wav @ {sr} Hz")

The output output.wav lands next to the script. For an interactive demo UI on port 8000, use the bundled CLI from the official README:

qwen-tts-demo Qwen/Qwen3-TTS-12Hz-1.7B-Base --ip 0.0.0.0 --port 8000

Results

Speed: Omitted — no benchmark names the RTX 5070. The cited latency reports in the Qwen discussion thread are for RTX 3090 (Geximus, RTF x3), RTX 4090 (anujchopra), RTX 5090 (intlex, execution time ~x2), and a 3080 mobile (chrisoutwright, RTF 4.0) — none is the 5070. We do not forward the 5070 Ti / 5080 numbers either: the 5070 has ~31% fewer CUDA cores and ~25% less memory bandwidth than those siblings, so a higher-tier figure would overstate it. Because this workload is autoregressive-decoder bound (GPU utilisation hovers around 10-16% across every card reported in the thread), the 5070 will land somewhere in the RTF 2-4x band — but until a 5070-named measurement exists we leave the number out. If you run it, please contribute a benchmark so the live figure appears at /check/qwen3tts/rtx-5070.
VRAM usage: ~5 GB weights-resident baseline, climbing toward ~8 GB peak. The ~5 GB baseline figure is from the archy.net Ubuntu Server walkthrough on an RTX 3090; the ~8 GB peak is reported by user Geximus in the Qwen-canonical discussion thread, who logged roughly 8 GB of GPU memory allocated on the same RTX 3090. Both fit with headroom on the 5070's 12 GB envelope — the dominant cost is BF16 weights-resident (4.54 GB on disk), with the residual peak being KV cache + activations from autoregressive decoding rather than anything arch-specific. Once a 5070 benchmark lands via /contribute, the measured number appears at /check/qwen3tts/rtx-5070.
Languages: 10 — Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian, per the HF model card.
License: Apache-2.0 per the HF card frontmatter. Free for commercial use.

For the live, measured benchmark data on this exact card, see /check/qwen3tts/rtx-5070.

Tradeoffs vs. siblings

Variant	What you get	When to choose
`Qwen3-TTS-12Hz-1.7B-Base` (this recipe)	Zero-shot voice cloning from a 3-second clip	You want to clone arbitrary voices
`Qwen3-TTS-12Hz-1.7B-CustomVoice`	9 curated premium speakers + natural-language style control	You want production-grade preset voices without supplying reference audio
`Qwen3-TTS-12Hz-0.6B-Base` / `0.6B-CustomVoice`	Lighter footprint, same 10 languages	You're packing other models alongside on a tight VRAM budget

Troubleshooting

`no kernel image is available for execution on the device`

Your PyTorch build doesn't include kernels for the 5070's sm_120 (Blackwell) compute capability, or you left flash_attention_2 in the code. Fix: reinstall PyTorch from the cu128 index (pip install --upgrade torch --index-url https://download.pytorch.org/whl/cu128) and use attn_implementation="eager". FlashAttention 2's missing sm_120 kernels are tracked at Dao-AILab/flash-attention#2168 — until pre-built wheels ship Blackwell kernels, do not enable FA2 on this card.

Generation is slower than RTF 1.0 (audio takes longer than realtime)

Expected behaviour, not a bug. Multiple users in the Qwen discussion thread report RTF 2-4x even on RTX 4090 / 5090 with FlashAttention 2 enabled, with GPU utilisation hovering around 10-16%. The model is autoregressive and compute-light; the autoregressive decode loop, not raw throughput, is the bottleneck. Keep dtype=torch.bfloat16 and pre-batch multiple sentences with create_voice_clone_prompt() to amortise the reference-encoding pass.

Voice cloning without a reference transcript

The archy.net guide documents an x_vector_only_mode=True flag that lets you clone from the reference audio without providing a transcript — useful when you don't know what the speaker said. When you do have a transcript, supplying it via ref_text gives the model prosody alignment cues and improves output quality.

Language code rejected

The archy.net guide flags that the language argument expects full language names ("English", "French", "Japanese") — short codes like "en" or "fr" raise an error.

If you hit a problem not covered here, please report it via our submission form so the next reader benefits.