Qwen3-TTS 1.7B-Base on RTX 4070: Multilingual Voice Cloning in 10 Languages with FlashAttention-2

What You'll Build

A local zero-shot text-to-speech pipeline using Qwen3-TTS-12Hz-1.7B-Base on an RTX 4070 — clone any voice from a 3-second reference clip, then synthesise new sentences in Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, or Italian.

Hardware data: RTX 4070 (12GB, Ada Lovelace sm_89) · weights 4.54 GB on disk · ~5 GB VRAM weights-resident baseline, climbing toward ~8 GB peak during inference per the archy.net walkthrough and the Qwen-canonical discussion thread (both measured on RTX 3090 — the autoregressive workload's footprint is dominated by BF16 weights, not arch-specific). That ~8 GB peak leaves comfortable headroom inside the 4070's 12 GB envelope. See benchmark data

⚠️ Variant pinned. This recipe targets the 1.7B-Base checkpoint (voice cloning). Several sibling variants live on the same Qwen HF org and are not covered here:

Qwen3-TTS-12Hz-1.7B-CustomVoice — same 1.7B parameter count and runtime VRAM envelope, but ships 9 pre-defined premium speakers plus natural-language style control via the instruct= argument. The install / runtime steps below carry over; only the inference call changes (generate_custom_voice(...) with speaker= and optional instruct= arguments) per the GitHub README.

Qwen3-TTS-12Hz-1.7B-VoiceDesign — generates a voice from a natural-language persona description (rather than a reference clip), via generate_voice_design, per the variant table on the HF model card.

Two 0.6B variants — Qwen3-TTS-12Hz-0.6B-Base and Qwen3-TTS-12Hz-0.6B-CustomVoice — are also released by the Qwen team with the same 10-language coverage per the HF model card. Lighter footprint; same install path.

ℹ️ Keep FlashAttention-2 — no Blackwell override needed. The RTX 4070 is Ada Lovelace sm_89, so FlashAttention-2's pre-built wheels include full kernel coverage: the FlashAttention README lists "Ampere, Ada, or Hopper GPUs (e.g., A100, RTX 3090, RTX 4090, H100)" as supported, and the stock pip install torch already ships Ada kernels (no special cu128/sm_120 wheel selection). Keep attn_implementation="flash_attention_2" from the canonical HF card verbatim — the Blackwell sm_120 kernel gap that forces RTX 50-series cards onto eager attention does not affect the 4070.

Requirements

Component	Minimum	Tested
GPU	8 GB VRAM, BF16-capable (Ampere or newer)	RTX 4070 (12GB, Ada sm_89)
RAM	16 GB system	—
Storage	5 GB free	4.54 GB weights (HF Files tab)
Software	Python 3.12, PyTorch with CUDA, `ffmpeg`	qwen-tts (PyPI)

Installation

1. Create the environment

Per the official Qwen3-TTS README and the HF model card:

conda create -n qwen3-tts python=3.12 -y
conda activate qwen3-tts

2. Install PyTorch with CUDA

The default PyTorch wheel already includes sm_89 kernels (Ada Lovelace), so the stock command works directly — no Blackwell cu128 wheel selection is required for the 4070:

pip install -U torch

For an explicit CUDA-version pin matching the archy.net walkthrough (the guide targets an RTX 3090 but the same wheel covers the 4070's sm_89):

pip install -U torch --index-url https://download.pytorch.org/whl/cu121

3. Install the qwen-tts package

pip install -U qwen-tts

This installs the Qwen3TTSModel Python class and the qwen-tts-demo CLI entrypoint, per the GitHub README.

4. Install FlashAttention 2 (recommended on Ada)

The HF card recommends FlashAttention 2 to reduce GPU memory usage:

pip install -U flash-attn --no-build-isolation

The FlashAttention README explicitly lists "Ampere, Ada, or Hopper GPUs (e.g., A100, RTX 3090, RTX 4090, H100)" as supported by the pre-built wheels — the RTX 4070 (Ada sm_89) is covered, and bf16 requires exactly this Ampere/Ada/Hopper class. If flash-attn compilation runs out of RAM (the HF card notes the build is RAM-heavy), cap parallel jobs:

MAX_JOBS=4 pip install -U flash-attn --no-build-isolation

5. Download the weights

First run will fetch them automatically from Qwen/Qwen3-TTS-12Hz-1.7B-Base (3.86 GB main DiT model.safetensors + 682 MB speech_tokenizer/model.safetensors, both verified live on the HF Files tab). To pre-cache:

huggingface-cli download Qwen/Qwen3-TTS-12Hz-1.7B-Base

Running

Save as clone.py and run with python clone.py. The reference audio URL below is the official sample from the HF model card:

import torch
import soundfile as sf
from qwen_tts import Qwen3TTSModel

model = Qwen3TTSModel.from_pretrained(
    "Qwen/Qwen3-TTS-12Hz-1.7B-Base",
    device_map="cuda:0",
    dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
)

ref_audio = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-TTS-Repo/clone.wav"
ref_text = "Okay. Yeah. I resent you. I love you. I respect you. But you know what? You blew it! And thanks to you."

wavs, sr = model.generate_voice_clone(
    text="Local inference on a consumer GPU — clean, multilingual, in your own voice.",
    language="English",
    ref_audio=ref_audio,
    ref_text=ref_text,
)
sf.write("output.wav", wavs[0], sr)
print(f"wrote output.wav @ {sr} Hz")

The output output.wav lands next to the script. For an interactive demo UI on port 8000, use the bundled CLI from the official README:

qwen-tts-demo Qwen/Qwen3-TTS-12Hz-1.7B-Base --ip 0.0.0.0 --port 8000

Results

Speed: No first-party RTX 4070 measurement exists yet, so no headline number is quoted here. The model is autoregressive and compute-light: community users in the Qwen discussion thread report real-time-factors of roughly 2-4x (i.e. audio takes 2-4x its own duration to generate) even on faster cards — Geximus logs RTF x3 on an RTX 3090, intlex execution time ~x2 on an RTX 5090, and chrisoutwright RTF 4.0 on a 3080 mobile — with GPU utilisation hovering around 10-16% in every report. Treat that range as an order-of-magnitude expectation, not a 4070 figure. Once a first-party RTX 4070 benchmark lands via /contribute, the live number appears at /check/qwen3tts/rtx-4070.
VRAM usage: ~5 GB weights-resident baseline, climbing toward ~8 GB peak. The ~5 GB baseline is from the archy.net Ubuntu Server walkthrough running on an RTX 3090; the ~8 GB allocated peak is reported by user Geximus in the Qwen-canonical discussion thread on the same RTX 3090. Both fit with headroom inside the 4070's 12 GB envelope. The numbers transfer across cards because the dominant cost is BF16 weights-resident (4.54 GB on disk); the residual ~3 GB of peak is KV cache + activations from autoregressive decoding rather than anything arch-specific. Once an RTX 4070 benchmark lands via /contribute, the live number appears at /check/qwen3tts/rtx-4070.
Languages: 10 — Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian, per the HF model card.
License: Apache-2.0 per the HF card frontmatter (ungated — weights download freely). Free for commercial use.

For the live, measured benchmark data on this exact card, see /check/qwen3tts/rtx-4070.

Tradeoffs vs. siblings

Variant	What you get	When to choose
`Qwen3-TTS-12Hz-1.7B-Base` (this recipe)	Zero-shot voice cloning from a 3-second clip	You want to clone arbitrary voices
`Qwen3-TTS-12Hz-1.7B-CustomVoice`	9 curated premium speakers + natural-language style control	You want preset voices without supplying reference audio
`Qwen3-TTS-12Hz-0.6B-Base` / `0.6B-CustomVoice`	Lighter footprint, same 10 languages	You're packing other models alongside on a tight VRAM budget

Troubleshooting

`flash-attn` build runs out of RAM

The pip install flash-attn --no-build-isolation step launches a heavy CUDA compile that can exhaust system memory. The HF card recommends MAX_JOBS=4 pip install -U flash-attn --no-build-isolation for machines with limited RAM. If you still hit OOM, drop to MAX_JOBS=2 or skip FlashAttention 2 entirely — the model also runs on attn_implementation="eager", and one RTX 4090 user (anujchopra) in the Qwen discussion thread reports getting the same (or somewhat better) speed with eager attention on Ada. FlashAttention 2 stays the recommended default on the 4070, but eager is a working fallback if the build fails.

Generation is slower than RTF 1.0 (audio takes longer than realtime)

Expected behaviour, not a bug. Multiple community users in the Qwen discussion thread report RTF 2-4x even on RTX 4090 / 5090 with FlashAttention 2 enabled, with GPU utilisation hovering around 10-16%. The model is autoregressive and compute-light; the autoregressive decode loop, not raw throughput, is the bottleneck. Keep dtype=torch.bfloat16 and pre-batch multiple sentences to amortise the reference-encoding pass. The QwenLM team is tracking speedup work on the GitHub repo.

Voice cloning without a reference transcript

The archy.net guide documents an x_vector_only_mode=True flag passed to generate_voice_clone that lets you clone from the reference audio without providing a transcript — useful when you don't know what the speaker said. When you do have a transcript, supplying it via ref_text gives the model prosody alignment cues and improves output quality.

Language code rejected

The archy.net guide flags that the language argument expects full language names ("English", "french", "german") — short codes like "en" or "fr" raise an error.

If you hit a problem not covered here, please report it via our submission form so the next reader benefits.