How much VRAM does OmniVoice need?

About 4 GB — the minimum this recipe targets.

How hard is this setup?

Intermediate — follow the steps above.

OmniVoice on RTX 5060 Ti: Zero-Shot Voice Cloning Across 646 Languages

What You'll Build

A local zero-shot text-to-speech setup on an RTX 5060 Ti 16 GB that clones any voice from a 3-5 second reference clip and speaks it back across any of 646 documented languages. The model is k2-fsa's OmniVoice, a Qwen3-0.6B-Base finetune wired into a diffusion-language-model TTS head with a discrete audio tokenizer.

Hardware data: RTX 5060 Ti (16 GB VRAM) · ~0.6 s to generate 5 s of audio on a 5060 Ti with the low-VRAM config (source) · See benchmark data

ℹ️ VRAM envelope, not a measured peak. The upstream k2-fsa card doesn't publish a VRAM number — testing was reported on Intel Arc A310 (4 GB) and Arc Pro B50 (16 GB) (README). The 4 GB figure here is the working default from the community-tested low-VRAM wrapper on a 5060 Ti; raw FP16 on a 16 GB card has plenty of headroom. Once a community 5060 Ti benchmark lands at /check/omnivoice/rtx-5060-ti we'll replace the envelope with the measured peak.

Requirements

Component	Minimum	Tested
GPU	4 GB VRAM (CUDA), any consumer NVIDIA card	RTX 5060 Ti 16 GB (Blackwell sm_120)
RAM	8 GB system RAM	—
Storage	~3.3 GB (2.45 GB main weights + 806 MB audio tokenizer + tokenizer)	—
Python	3.10 or newer	—
CUDA	12.8 (cu128 wheel required for sm_120)	—
Reference audio	3-5 s WAV, mono	—

Model weights total ~3.3 GB on disk from the HuggingFace Files tab: model.safetensors is 2.45 GB and the audio_tokenizer/model.safetensors adds 806 MB. The upstream repo ships FP32; casting to FP16 at load time roughly halves the resident footprint.

Installation

1. Create a clean Python env

python3.10 -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip

2. Install PyTorch with CUDA 12.8 (Blackwell-compatible)

pip install torch==2.8.0+cu128 torchaudio==2.8.0+cu128 \
  --extra-index-url https://download.pytorch.org/whl/cu128

This is the exact wheel pin from the OmniVoice README. The 5060 Ti is Blackwell (sm_120) and requires cu128 — older cu121/cu124 wheels will not load kernels for it.

3. Install OmniVoice

pip install omnivoice

PyPI ships omnivoice 0.1.5 (Apache-2.0). The first inference call downloads weights into your HuggingFace cache from k2-fsa/OmniVoice.

4. Prepare a reference clip

Pick a 3-5 second mono WAV of the voice you want to clone and write down what's being said. Save the audio as ref.wav in your working directory. Always provide the transcript explicitly — see Troubleshooting for why auto-transcription is risky on Blackwell right now.

Running

Save this as tts.py next to your ref.wav:

from omnivoice import OmniVoice
import soundfile as sf
import torch

model = OmniVoice.from_pretrained(
    "k2-fsa/OmniVoice",
    device_map="cuda:0",
    dtype=torch.float16,
)

audio = model.generate(
    text="Hello, this is a test of zero-shot voice cloning.",
    ref_audio="ref.wav",
    ref_text="Transcription of the reference audio.",
)

sf.write("out.wav", audio[0], 24000)

This is the canonical snippet from the upstream model card and GitHub README. Run it:

python tts.py

You should see weights resolve from the cache, then a short delay before out.wav (24 kHz mono) lands in your working directory.

ComfyUI alternative

If you live in ComfyUI, the community node from drbaph and Saganaki22 wraps the same model:

cd ComfyUI/custom_nodes
git clone https://github.com/Saganaki22/ComfyUI-OmniVoice-TTS.git
cd ComfyUI-OmniVoice-TTS
python install.py

It exposes nodes for voice clone, voice design, multi-speaker, and longform TTS, and links back to k2-fsa/OmniVoice as the upstream FP32 source (repo).

Results

Speed: ~0.6 s to generate 5 s of audio on an RTX 5060 Ti using the Wladastic low-VRAM wrapper with LM_QUANT=nf4 and DTYPE=float16. The same wrapper measures 0.2 s on an RTX 4080. Upstream's hardware-unspecified RTF 0.025 claim is omitted here — the 5060 Ti number above is what's been measured on this card.
VRAM usage: Working envelope ~4 GB on a 5060 Ti with the nf4 LM + fp16 TTS recipe (default MAX_VRAM_GB=4 after the wrapper author hit OOMs at 3 GB with longer reference clips). A baseline (no quantization, fp16 only) loads roughly 1.6 GB of weights plus generation buffers; see /check/omnivoice/rtx-5060-ti for the measured peak once it's seeded.
Quality notes: OmniVoice covers 646 languages totalling 581k training hours, but quality is heavily long-tailed — English has 206k hours, Chinese 111k, and many smaller languages sit below 1 hour of training data. Cross-lingual transfer is imperfect (see HF Discussion #22).

For the full benchmark data, see /check/omnivoice/rtx-5060-ti.

Troubleshooting

Output sounds like noise / garbled artifacts on a Blackwell card

A 5090 user reports exactly this in Issue #155 — open as of mid-May 2026, with the reporter explicitly confirming the bug persists even with --no-asr (so it is not the Whisper auto-transcription path). Root cause is still under investigation by the k2-fsa maintainers as of this writing. Workarounds reported on that thread vary by setup; passing ref_text explicitly (per the quick-start snippet above) is the most consistent. If you hit garbled output on a 5060 Ti, add your reproduction to the issue thread — Blackwell-specific datapoints are still being collected.

`flex_attention` shared-memory error during fine-tuning

If you try to fine-tune on an RTX 4090 or RTX A6000 you'll hit a 99 KB / 128 KB shared-memory wall — see Issue #83. This is fine-tuning only; inference uses smaller blocks and is unaffected. The 5060 Ti is the same boat: consumer and workstation Blackwell (sm_120) is capped at ~99 KB shared memory per block, so it hits this wall too — a commenter on that thread confirmed an RTX PRO 6000 Blackwell Workstation (sm_120) fails identically. The 228 KB shared-memory tier is datacenter Blackwell (sm_100, e.g. B200), not the sm_120 silicon in this card. Use the maintainers' SDPA fine-tuning path (examples/config/train_config_finetune_sdpa.json) if you need to train on the 5060 Ti.

VRAM spikes / OOM with a long reference clip

The Wladastic wrapper author observed VRAM spiking up to 8 GB on reference audio longer than ~4 s even with a 4 GB budget, eventually requiring CPU offload to stabilise. Workaround: keep your reference clip under 3.5 s, or set CPU_OFFLOAD=1 in that wrapper to push the LM weights to system RAM at the cost of a small latency hit.

`pip install` fails / wrong CUDA version

You must use the +cu128 PyTorch wheel for the 5060 Ti. The default pip install torch index ships cu121 by default, which won't initialise kernels on sm_120. Re-install with the --extra-index-url https://download.pytorch.org/whl/cu128 flag from step 2.