What You'll Build
A local zero-shot text-to-speech setup on an RTX 5060 Ti 16 GB that clones any voice from a 3-5 second reference clip and speaks it back across any of 646 documented languages. The model is k2-fsa's OmniVoice, a Qwen3-0.6B-Base finetune wired into a diffusion-language-model TTS head with a discrete audio tokenizer.
Hardware data: RTX 5060 Ti (16 GB VRAM) · ~0.6 s to generate 5 s of audio on a 5060 Ti with the low-VRAM config (source) · See benchmark data
ℹ️ VRAM envelope, not a measured peak. The upstream k2-fsa card doesn't publish a VRAM number — testing was reported on Intel Arc A310 (4 GB) and Arc Pro B50 (16 GB) (README). The 4 GB figure here is the working default from the community-tested low-VRAM wrapper on a 5060 Ti; raw FP16 on a 16 GB card has plenty of headroom. Once a community 5060 Ti benchmark lands at /check/omnivoice/rtx-5060-ti we'll replace the envelope with the measured peak.
Requirements
| Component | Minimum | Tested |
|---|---|---|
| GPU | 4 GB VRAM (CUDA), any consumer NVIDIA card | RTX 5060 Ti 16 GB (Blackwell sm_120) |
| RAM | 8 GB system RAM | — |
| Storage | ~3.3 GB (2.45 GB main weights + 806 MB audio tokenizer + tokenizer) | — |
| Python | 3.10 or newer | — |
| CUDA | 12.8 (cu128 wheel required for sm_120) | — |
| Reference audio | 3-5 s WAV, mono | — |
Model weights total ~3.3 GB on disk from the HuggingFace Files tab: model.safetensors is 2.45 GB and the audio_tokenizer/model.safetensors adds 806 MB. The upstream repo ships FP32; casting to FP16 at load time roughly halves the resident footprint.
Installation
1. Create a clean Python env
python3.10 -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip
2. Install PyTorch with CUDA 12.8 (Blackwell-compatible)
pip install torch==2.8.0+cu128 torchaudio==2.8.0+cu128 \
--extra-index-url https://download.pytorch.org/whl/cu128
This is the exact wheel pin from the OmniVoice README. The 5060 Ti is Blackwell (sm_120) and requires cu128 — older cu121/cu124 wheels will not load kernels for it.
3. Install OmniVoice
pip install omnivoice
PyPI ships omnivoice 0.1.5 (Apache-2.0). The first inference call downloads weights into your HuggingFace cache from k2-fsa/OmniVoice.
4. Prepare a reference clip
Pick a 3-5 second mono WAV of the voice you want to clone and write down what's being said. Save the audio as ref.wav in your working directory. Always provide the transcript explicitly — see Troubleshooting for why auto-transcription is risky on Blackwell right now.
Running
Save this as tts.py next to your ref.wav:
from omnivoice import OmniVoice
import soundfile as sf
import torch
model = OmniVoice.from_pretrained(
"k2-fsa/OmniVoice",
device_map="cuda:0",
dtype=torch.float16,
)
audio = model.generate(
text="Hello, this is a test of zero-shot voice cloning.",
ref_audio="ref.wav",
ref_text="Transcription of the reference audio.",
)
sf.write("out.wav", audio[0], 24000)
This is the canonical snippet from the upstream model card and GitHub README. Run it:
python tts.py
You should see weights resolve from the cache, then a short delay before out.wav (24 kHz mono) lands in your working directory.
ComfyUI alternative
If you live in ComfyUI, the community node from drbaph and Saganaki22 wraps the same model:
cd ComfyUI/custom_nodes
git clone https://github.com/Saganaki22/ComfyUI-OmniVoice-TTS.git
cd ComfyUI-OmniVoice-TTS
python install.py
It exposes nodes for voice clone, voice design, multi-speaker, and longform TTS, and links back to k2-fsa/OmniVoice as the upstream FP32 source (repo).
Results
- Speed: ~0.6 s to generate 5 s of audio on an RTX 5060 Ti using the Wladastic low-VRAM wrapper with
LM_QUANT=nf4andDTYPE=float16. The same wrapper measures 0.2 s on an RTX 4080. Upstream's hardware-unspecified RTF 0.025 claim is omitted here — the 5060 Ti number above is what's been measured on this card. - VRAM usage: Working envelope ~4 GB on a 5060 Ti with the nf4 LM + fp16 TTS recipe (default
MAX_VRAM_GB=4after the wrapper author hit OOMs at 3 GB with longer reference clips). A baseline (no quantization, fp16 only) loads roughly 1.6 GB of weights plus generation buffers; see /check/omnivoice/rtx-5060-ti for the measured peak once it's seeded. - Quality notes: OmniVoice covers 646 languages totalling 581k training hours, but quality is heavily long-tailed — English has 206k hours, Chinese 111k, and many smaller languages sit below 1 hour of training data. Cross-lingual transfer is imperfect (see HF Discussion #22).
For the full benchmark data, see /check/omnivoice/rtx-5060-ti.
Troubleshooting
Output sounds like noise / garbled artifacts on a Blackwell card
A 5090 user reports exactly this in Issue #155 — open as of mid-May 2026, with the reporter explicitly confirming the bug persists even with --no-asr (so it is not the Whisper auto-transcription path). Root cause is still under investigation by the k2-fsa maintainers as of this writing. Workarounds reported on that thread vary by setup; passing ref_text explicitly (per the quick-start snippet above) is the most consistent. If you hit garbled output on a 5060 Ti, add your reproduction to the issue thread — Blackwell-specific datapoints are still being collected.
flex_attention shared-memory error during fine-tuning
If you try to fine-tune on an RTX 4090 or RTX A6000 you'll hit a 99 KB / 128 KB shared-memory wall — see Issue #83. This is fine-tuning only; inference uses smaller blocks and is unaffected. The 5060 Ti (sm_120 / Blackwell) has 228 KB of shared memory and is in the supported tier per that thread.
VRAM spikes / OOM with a long reference clip
The Wladastic wrapper author observed VRAM spiking up to 8 GB on reference audio longer than ~4 s even with a 4 GB budget, eventually requiring CPU offload to stabilise. Workaround: keep your reference clip under 3.5 s, or set CPU_OFFLOAD=1 in that wrapper to push the LM weights to system RAM at the cost of a small latency hit.
pip install fails / wrong CUDA version
You must use the +cu128 PyTorch wheel for the 5060 Ti. The default pip install torch index ships cu121 by default, which won't initialise kernels on sm_120. Re-install with the --extra-index-url https://download.pytorch.org/whl/cu128 flag from step 2.