How much VRAM does OmniVoice need?

About 4 GB — the minimum this recipe targets.

How hard is this setup?

Intermediate — follow the steps above.

OmniVoice on RTX 5060: Zero-Shot Voice Cloning Across 646 Languages in 8 GB

What You'll Build

A local zero-shot text-to-speech setup on an RTX 5060 8 GB that clones any voice from a 3-5 second reference clip and speaks it back across 646 documented languages (per the HuggingFace model-card metadata; the GitHub README phrases it as "600+"). The model is k2-fsa's OmniVoice, a Qwen3-0.6B-Base finetune wired into a diffusion-language-model TTS head with a discrete audio tokenizer.

Hardware data: RTX 5060 (8 GB VRAM) · ~4 GB working envelope per the community-tested low-VRAM wrapper · See benchmark data

ℹ️ VRAM envelope, not a measured peak. The upstream k2-fsa card doesn't publish a VRAM number — testing was reported on Intel Arc A310 (4 GB) and Arc Pro B50 (16 GB) (README). The 4 GB figure here is the working default from the community-tested low-VRAM wrapper (MAX_VRAM_GB=4) on consumer NVIDIA, plus the bf16/fp16 4-6 GB band documented by the Saganaki22 ComfyUI node. On the 5060's 8 GB that leaves ~4 GB of headroom — enough to absorb the occasional spike, but tighter than the 16 GB cards both sources measured on. Once a 5060 benchmark lands at /check/omnivoice/rtx-5060 we'll replace the envelope with the measured peak.

Requirements

Component	Minimum	Tested
GPU	4 GB VRAM (CUDA), any consumer NVIDIA card	RTX 5060 8 GB (Blackwell sm_120)
RAM	8 GB system RAM	—
Storage	~3.3 GB total (`model.safetensors` 2.45 GB + audio tokenizer + tokenizer)	—
Python	3.10 or newer	—
CUDA	12.8 (cu128 wheel required for sm_120)	—
Reference audio	3-5 s WAV, mono	—

Model weight totals come from the HuggingFace Files tab — model.safetensors is 2.45 GB and the repo's reported total is 3.27 GB, with the remainder split between the audio tokenizer and tokenizer JSON. The upstream repo ships FP32; casting to FP16 at load time roughly halves the resident footprint, which is what fits comfortably under 4 GB on the 5060.

Installation

1. Create a clean Python env

python3.10 -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip

2. Install PyTorch with CUDA 12.8 (Blackwell-required)

pip install torch==2.8.0+cu128 torchaudio==2.8.0+cu128 \
  --extra-index-url https://download.pytorch.org/whl/cu128

This is the exact wheel pin from the OmniVoice README. The 5060 is Blackwell (sm_120) and requires cu128 — older cu121/cu124 wheels won't load kernels for it, so here the upstream pin is mandatory rather than merely recommended.

3. Install OmniVoice

pip install omnivoice

PyPI ships the canonical omnivoice 0.1.5 package (Apache-2.0). The first inference call downloads weights into your HuggingFace cache from k2-fsa/OmniVoice.

4. Prepare a reference clip

Pick a 3-5 second mono WAV of the voice you want to clone and write down what's being said. Save the audio as ref.wav in your working directory. Always provide the transcript explicitly — see Troubleshooting for why auto-transcription is risky on Blackwell right now.

Running

Save this as tts.py next to your ref.wav:

from omnivoice import OmniVoice
import soundfile as sf
import torch

model = OmniVoice.from_pretrained(
    "k2-fsa/OmniVoice",
    device_map="cuda:0",
    dtype=torch.float16,
)

audio = model.generate(
    text="Hello, this is a test of zero-shot voice cloning.",
    ref_audio="ref.wav",
    ref_text="Transcription of the reference audio.",
)

sf.write("out.wav", audio[0], 24000)

This is the canonical snippet from the upstream model card and GitHub README. Run it:

python tts.py

You should see weights resolve from the cache, then a short delay before out.wav (24 kHz mono) lands in your working directory.

ComfyUI alternative

If you live in ComfyUI, the community node from drbaph and Saganaki22 wraps the same model and ships bf16 quantization that brings the working set under ~2 GB:

cd ComfyUI/custom_nodes
git clone https://github.com/Saganaki22/ComfyUI-OmniVoice-TTS.git
cd ComfyUI-OmniVoice-TTS
python install.py

It exposes nodes for voice clone, voice design, multi-speaker, and longform TTS, and links back to k2-fsa/OmniVoice as the upstream FP32 source (repo).

Results

VRAM usage: Working envelope ~4 GB on consumer NVIDIA with the Wladastic wrapper's nf4 LM + fp16 TTS recipe (default MAX_VRAM_GB=4 after the author hit OOMs at 3 GB with longer reference clips). The same author observed VRAM spiking up to 8 GB on reference audio longer than ~4 s — on the 5060's 8 GB card this means the spike consumes essentially all available memory, so keep reference clips short. The Saganaki22 ComfyUI node documents ~4-6 GB at bf16/fp16 and ~2-4 GB with CPU offload, corroborating the same band. See /check/omnivoice/rtx-5060 for the measured peak once it's seeded.
Speed: Not cited here. The only consumer-NVIDIA measurements available are the Wladastic wrapper's 0.6 s / 5 s on an RTX 5060 Ti and 0.2 s / 5 s on an RTX 4080 — both are higher-tier cards than the 5060, so quoting either as the 5060's expected speed would mislead. Upstream's hardware-unspecified "RTF as low as 0.025" claim is omitted for the same reason. Submit your own measurement to /check/omnivoice/rtx-5060 to seed the empirical data — or add it via /contribute.
Quality notes: OmniVoice covers 646 languages, but quality is heavily long-tailed and cross-lingual transfer is imperfect — see HF Discussion #22 for the maintainer's note on imperfect transfer across the long tail of languages.

For the full benchmark data, see /check/omnivoice/rtx-5060.

Troubleshooting

VRAM spikes / OOM with a long reference clip

This is the most likely issue on an 8 GB card. The Wladastic wrapper author observed VRAM spiking up to 8 GB on reference audio longer than ~4 s even with a 4 GB budget, eventually requiring CPU offload to stabilise. Workaround: keep your reference clip under 3.5 s, or set CPU_OFFLOAD=1 in that wrapper to push the LM weights to system RAM (the same discussion documents 1.3 GB GPU + 2.4 GB CPU after offload). The Saganaki22 ComfyUI node documents the same offload path with a ~2-4 GB working set.

Garbled / noisy output on RTX 50-series (Blackwell)

A 5090 user reports audio corruption in Issue #155 and explicitly confirmed the bug persists even with --no-asr (so it's not the Whisper auto-transcription path). The reporter speculates about Blackwell-specific kernel issues (RTX 5090 / Blackwell, CUDA 12.8, or attention kernels), and as of mid-May 2026 root cause is still open and under investigation by the k2-fsa maintainers. The RTX 5060 is the same Blackwell generation (sm_120) as the 5090, so this is a datapoint worth watching on this card. If you hit garbled output, passing ref_text explicitly (per the quick-start snippet above) is the most consistent reported workaround; if that fails, add your reproduction to the issue thread — Blackwell-specific datapoints are still being collected.

Fine-tuning fails with a shared-memory error (does not fire on the 5060)

OmniVoice's flex_attention training path needs ~128 KB of shared memory per block, exceeding the ~99 KB hardware limit on Ada Lovelace (sm_89) and Ampere (sm_86) consumer/workstation cards — see Issue #83, which names the RTX 4090 and A6000 and states "fine-tuning is currently only possible on A100 / H100 / Blackwell — every other consumer or workstation card will fail." The RTX 5060 is Blackwell (sm_120), which sits in that supported tier (228 KB shared memory per block), so this error does not fire on the 5060 — the issue is scoped to Ampere/Ada only. Inference works on all of these cards regardless; this wall is fine-tuning only.

`pip install` fails / wrong CUDA version

You must use the +cu128 PyTorch wheel for the 5060. The default pip install torch index ships cu121, which won't initialise kernels on sm_120 — you'll see a "CUDA error: no kernel image" at the first inference call. Force-reinstall PyTorch with the --extra-index-url https://download.pytorch.org/whl/cu128 flag from step 2 to match the upstream pin exactly.