self-hosted/ai
§01·recipe · tts

Qwen3-TTS 1.7B-Base on RTX 3060: Multilingual Voice Cloning in 10 Languages with FlashAttention-2

ttsintermediate8GB+ VRAMJun 14, 2026

This intermediate recipe sets up Qwen3-TTS on the RTX 3060, needing about 8 GB of VRAM.

models
tools
prerequisites
  • NVIDIA RTX 3060 (12GB, Ampere sm_86) or any CUDA GPU with 8 GB+ VRAM
  • Python 3.12 (fresh conda environment recommended)
  • CUDA-enabled PyTorch build (CUDA 12.x); the stock `pip install torch` already ships sm_86 kernels
  • ~5 GB free disk for weights (3.86 GB main DiT + 682 MB speech tokenizer)

What You'll Build

A local zero-shot text-to-speech pipeline using Qwen3-TTS-12Hz-1.7B-Base on an RTX 3060 — clone any voice from a 3-second reference clip, then synthesise new sentences in Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, or Italian.

Hardware data: RTX 3060 (12GB, Ampere GA106 sm_86, 360 GB/s GDDR6 per TechPowerUp) · weights 4.54 GB on disk · ~5 GB VRAM weights-resident baseline, climbing toward ~8 GB peak during inference per the archy.net walkthrough and the Qwen-canonical discussion thread — both measured on an RTX 3090, which is the same Ampere sm_86 architecture as the 3060, so the footprint transfers within-arch rather than by extrapolation. That ~8 GB peak leaves comfortable headroom inside the 3060's 12 GB envelope. See benchmark data

⚠️ Variant pinned. This recipe targets the 1.7B-Base checkpoint (voice cloning). Several sibling variants live on the same Qwen HF org and are not covered here:

  • Qwen3-TTS-12Hz-1.7B-CustomVoice — same 1.7B parameter count and runtime VRAM envelope, but ships 9 pre-defined premium speakers plus natural-language style control via the instruct= argument. The install / runtime steps below carry over; only the inference call changes (generate_custom_voice(...) with speaker= and optional instruct= arguments) per the GitHub README.
  • Qwen3-TTS-12Hz-1.7B-VoiceDesign — generates a voice from a natural-language persona description (rather than a reference clip), via generate_voice_design, per the variant table on the HF model card.
  • Two 0.6B variants — Qwen3-TTS-12Hz-0.6B-Base and Qwen3-TTS-12Hz-0.6B-CustomVoice — are also released by the Qwen team with the same 10-language coverage per the HF model card. Lighter footprint; same install path.

ℹ️ Keep FlashAttention-2 — the 3060 is a fully-supported FA2 architecture. The RTX 3060 is Ampere sm_86, and FlashAttention-2's pre-built wheels include full kernel coverage for it: the FlashAttention README lists "Ampere, Ada, or Hopper GPUs (e.g., A100, RTX 3090, RTX 4090, H100)" as supported, and notes that bf16 requires exactly this Ampere/Ada/Hopper class. The stock pip install torch already ships Ampere sm_86 kernels — no special cu128/sm_120 wheel selection. Keep attn_implementation="flash_attention_2" from the canonical HF card verbatim. The Blackwell sm_120 kernel gap that forces RTX 50-series cards onto eager attention does not affect Ampere — in fact sm_86 is one of the oldest architectures FA2 has shipped kernels for.

Requirements

ComponentMinimumTested
GPU8 GB VRAM, BF16-capable (Ampere or newer)RTX 3060 (12GB, Ampere sm_86)
RAM16 GB system
Storage5 GB free4.54 GB weights (HF Files tab)
SoftwarePython 3.12, PyTorch with CUDA, ffmpegqwen-tts (PyPI)

Installation

1. Create the environment

Per the official Qwen3-TTS README and the HF model card:

conda create -n qwen3-tts python=3.12 -y
conda activate qwen3-tts

2. Install PyTorch with CUDA

The default PyTorch wheel already includes sm_86 kernels (Ampere), so the stock command works directly — no Blackwell cu128 wheel selection is required for the 3060:

pip install -U torch

For an explicit CUDA-version pin matching the archy.net walkthrough (the guide measured on an RTX 3090, and the same cu121 wheel covers the 3060's sm_86):

pip install -U torch --index-url https://download.pytorch.org/whl/cu121

3. Install the qwen-tts package

pip install -U qwen-tts

This installs the Qwen3TTSModel Python class and the qwen-tts-demo CLI entrypoint, per the GitHub README.

4. Install FlashAttention 2 (recommended on Ampere)

The HF card recommends FlashAttention 2 to reduce GPU memory usage:

pip install -U flash-attn --no-build-isolation

The FlashAttention README explicitly lists "Ampere, Ada, or Hopper GPUs (e.g., A100, RTX 3090, RTX 4090, H100)" as supported by the pre-built wheels — the RTX 3060 (Ampere sm_86) is in the named Ampere class, and bf16 requires exactly this Ampere/Ada/Hopper class. If flash-attn compilation runs out of RAM (the HF card notes the build is RAM-heavy), cap parallel jobs:

MAX_JOBS=4 pip install -U flash-attn --no-build-isolation

5. Download the weights

First run will fetch them automatically from Qwen/Qwen3-TTS-12Hz-1.7B-Base (3.86 GB main DiT model.safetensors + 682 MB speech_tokenizer/model.safetensors, both verified live on the HF Files tab). To pre-cache:

huggingface-cli download Qwen/Qwen3-TTS-12Hz-1.7B-Base

Running

Save as clone.py and run with python clone.py. The reference audio URL below is the official sample from the HF model card:

import torch
import soundfile as sf
from qwen_tts import Qwen3TTSModel

model = Qwen3TTSModel.from_pretrained(
    "Qwen/Qwen3-TTS-12Hz-1.7B-Base",
    device_map="cuda:0",
    dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
)

ref_audio = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-TTS-Repo/clone.wav"
ref_text = "Okay. Yeah. I resent you. I love you. I respect you. But you know what? You blew it! And thanks to you."

wavs, sr = model.generate_voice_clone(
    text="Local inference on a consumer GPU — clean, multilingual, in your own voice.",
    language="English",
    ref_audio=ref_audio,
    ref_text=ref_text,
)
sf.write("output.wav", wavs[0], sr)
print(f"wrote output.wav @ {sr} Hz")

The output output.wav lands next to the script. For an interactive demo UI on port 8000, use the bundled CLI from the official README:

qwen-tts-demo Qwen/Qwen3-TTS-12Hz-1.7B-Base --ip 0.0.0.0 --port 8000

Results

  • Speed: No first-party RTX 3060 measurement exists yet, so no headline number is quoted here. The model is autoregressive and compute-light: community users in the Qwen discussion thread report real-time-factors of roughly 2-4x (i.e. audio takes 2-4x its own duration to generate) even on faster cards — user Geximus logs RTF x3 on an RTX 3090, chrisoutwright RTF 4.0 on a 3080 mobile, and intlex an execution time around x2 on an RTX 5090 — with GPU utilisation hovering around 10-16% in every report. The RTX 3060 has lower compute and memory bandwidth (360 GB/s) than any of those cards, so treat that range as a loose upper-bound expectation, not a 3060 figure. Once a first-party RTX 3060 benchmark lands via /contribute, the live number appears at /check/qwen3tts/rtx-3060.
  • VRAM usage: ~5 GB weights-resident baseline, climbing toward ~8 GB peak. The ~5 GB baseline is from the archy.net Ubuntu Server walkthrough running on an RTX 3090; the ~8 GB allocated peak is reported by user Geximus in the Qwen-canonical discussion thread on the same RTX 3090. Both the RTX 3090 and the RTX 3060 are Ampere sm_86, so this is a within-architecture transfer, not a cross-arch extrapolation — and both figures fit with headroom inside the 3060's 12 GB envelope. The dominant cost is BF16 weights-resident (4.54 GB on disk); the residual ~3 GB of peak is KV cache + activations from autoregressive decoding. Once an RTX 3060 benchmark lands via /contribute, the live number appears at /check/qwen3tts/rtx-3060.
  • Languages: 10 — Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian, per the HF model card.
  • License: Apache-2.0 per the HF card frontmatter (ungated — weights download freely). Free for commercial use.

For the live, measured benchmark data on this exact card, see /check/qwen3tts/rtx-3060.

Tradeoffs vs. siblings

VariantWhat you getWhen to choose
Qwen3-TTS-12Hz-1.7B-Base (this recipe)Zero-shot voice cloning from a 3-second clipYou want to clone arbitrary voices
Qwen3-TTS-12Hz-1.7B-CustomVoice9 curated premium speakers + natural-language style controlYou want preset voices without supplying reference audio
Qwen3-TTS-12Hz-0.6B-Base / 0.6B-CustomVoiceLighter footprint, same 10 languagesYou're packing other models alongside on a tight VRAM budget

Troubleshooting

flash-attn build runs out of RAM

The pip install flash-attn --no-build-isolation step launches a heavy CUDA compile that can exhaust system memory. The HF card recommends MAX_JOBS=4 pip install -U flash-attn --no-build-isolation for machines with limited RAM. If you still hit OOM, drop to MAX_JOBS=2 or skip FlashAttention 2 entirely — the model also runs on attn_implementation="eager", and one RTX 4090 user (anujchopra) in the Qwen discussion thread reports getting the same (or somewhat better) speed with eager attention. FlashAttention 2 stays the recommended default on the 3060 — its kernels are fully supported on Ampere sm_86 — but eager is a working fallback if the build fails.

Generation is slower than RTF 1.0 (audio takes longer than realtime)

Expected behaviour, not a bug. Multiple community users in the Qwen discussion thread report RTF 2-4x even on RTX 4090 / 5090 with FlashAttention 2 enabled, with GPU utilisation hovering around 10-16%. The model is autoregressive and compute-light; the autoregressive decode loop, not raw throughput, is the bottleneck — which is also why the 3060's lower bandwidth matters less here than it would for a throughput-bound model. Keep dtype=torch.bfloat16 and pre-batch multiple sentences to amortise the reference-encoding pass. The QwenLM team is tracking speedup work on the GitHub repo.

Voice cloning without a reference transcript

The archy.net guide documents an x_vector_only_mode=True flag passed to generate_voice_clone that lets you clone from the reference audio without providing a transcript — useful when you don't know what the speaker said. The HF card adds that this uses only the speaker embedding, so cloning quality may be reduced. When you do have a transcript, supplying it via ref_text gives the model prosody alignment cues and improves output quality.

Language code rejected

The archy.net guide flags that the language argument expects full language names ("english", "french", "german") — short codes like "en" or "fr" raise an error. The supported set is auto, Chinese, English, French, German, Italian, Japanese, Korean, Portuguese, Russian, Spanish.

If you hit a problem not covered here, please report it via our submission form so the next reader benefits.

common questions
How much VRAM does Qwen3-TTS need?

About 8 GB — the minimum this recipe targets.

Which GPUs is Qwen3-TTS tested on?

RTX 3060 (12 GB).

How hard is this setup?

Intermediate — follow the steps above.