How much VRAM does OmniVoice need?

About 4 GB — the minimum this recipe targets.

How hard is this setup?

Intermediate — follow the steps above.

OmniVoice on RTX 4060: Zero-Shot Voice Cloning Across 646 Languages in 8 GB

What You'll Build

A local zero-shot text-to-speech setup on an RTX 4060 8 GB that clones any voice from a 3-5 second reference clip and speaks it back across 646 documented languages (per the HuggingFace model-card metadata; the GitHub README phrases it as "600+"). The model is k2-fsa's OmniVoice, a Qwen3-0.6B-Base finetune wired into a diffusion-language-model TTS head with a discrete audio tokenizer.

Hardware data: RTX 4060 (8 GB VRAM) · ~4 GB working envelope per the community-tested low-VRAM wrapper · See benchmark data

ℹ️ VRAM envelope, not a measured peak. The upstream k2-fsa card doesn't publish a VRAM number — testing was reported on Intel Arc A310 (4 GB) and Arc Pro B50 (16 GB) (README). The 4 GB figure here is the working default from the community-tested low-VRAM wrapper (MAX_VRAM_GB=4) on consumer NVIDIA, plus the bf16/fp16 4-6 GB band documented by the Saganaki22 ComfyUI node. On the 4060's 8 GB that leaves ~4 GB of headroom — enough to absorb the occasional spike, but tighter than the 16 GB cards both sources measured on. Once a 4060 benchmark lands at /check/omnivoice/rtx-4060 we'll replace the envelope with the measured peak.

Requirements

Component	Minimum	Tested
GPU	4 GB VRAM (CUDA), any consumer NVIDIA card	RTX 4060 8 GB (Ada Lovelace sm_89)
RAM	8 GB system RAM	—
Storage	~3.3 GB total (`model.safetensors` 2.45 GB + audio tokenizer + tokenizer)	—
Python	3.10 or newer	—
CUDA	12.x (cu128 per upstream pin; cu121/cu124 also load on Ada Lovelace)	—
Reference audio	3-5 s WAV, mono	—

Model weight totals come from the HuggingFace Files tab — model.safetensors is 2.45 GB and the repo's reported total is 3.27 GB, with the remainder split between the audio tokenizer and tokenizer JSON. The upstream repo ships FP32; casting to FP16 at load time roughly halves the resident footprint, which is what fits comfortably under 4 GB on the 4060.

Installation

1. Create a clean Python env

python3.10 -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip

2. Install PyTorch (CUDA 12.8 per upstream, or cu121/cu124 if you already have it)

pip install torch==2.8.0+cu128 torchaudio==2.8.0+cu128 \
  --extra-index-url https://download.pytorch.org/whl/cu128

This is the exact wheel pin from the OmniVoice README. The 4060 is Ada Lovelace (sm_89) so cu121/cu124 wheels also work — but matching upstream avoids surprises when their kernel set changes.

3. Install OmniVoice

pip install omnivoice

PyPI ships the canonical omnivoice package (Apache-2.0). The first inference call downloads weights into your HuggingFace cache from k2-fsa/OmniVoice.

4. Prepare a reference clip

Pick a 3-5 second mono WAV of the voice you want to clone and write down what's being said. Save the audio as ref.wav in your working directory. Always provide the transcript explicitly — see Troubleshooting for why auto-transcription is risky right now.

Running

Save this as tts.py next to your ref.wav:

from omnivoice import OmniVoice
import soundfile as sf
import torch

model = OmniVoice.from_pretrained(
    "k2-fsa/OmniVoice",
    device_map="cuda:0",
    dtype=torch.float16,
)

audio = model.generate(
    text="Hello, this is a test of zero-shot voice cloning.",
    ref_audio="ref.wav",
    ref_text="Transcription of the reference audio.",
)

sf.write("out.wav", audio[0], 24000)

This is the canonical snippet from the upstream model card and GitHub README. Run it:

python tts.py

You should see weights resolve from the cache, then a short delay before out.wav (24 kHz mono) lands in your working directory.

ComfyUI alternative

If you live in ComfyUI, the community node from drbaph and Saganaki22 wraps the same model and ships bf16 quantization that brings the working set under ~2 GB:

cd ComfyUI/custom_nodes
git clone https://github.com/Saganaki22/ComfyUI-OmniVoice-TTS.git
cd ComfyUI-OmniVoice-TTS
python install.py

It exposes nodes for voice clone, voice design, multi-speaker, and longform TTS, and links back to k2-fsa/OmniVoice as the upstream FP32 source (repo).

Results

VRAM usage: Working envelope ~4 GB on consumer NVIDIA with the Wladastic wrapper's nf4 LM + fp16 TTS recipe (default MAX_VRAM_GB=4 after the author hit OOMs at 3 GB with longer reference clips). The same author observed VRAM spiking up to 8 GB on reference audio longer than ~4 s — on the 4060's 8 GB card this means the spike consumes essentially all available memory, so keep reference clips short. The Saganaki22 ComfyUI node documents ~4-6 GB at bf16/fp16 and ~2-4 GB with CPU offload, corroborating the same band. See /check/omnivoice/rtx-4060 for the measured peak once it's seeded.
Speed: Not cited here. The only consumer-NVIDIA measurements available are the Wladastic wrapper's 0.6 s / 5 s on an RTX 5060 Ti and 0.2 s / 5 s on an RTX 4080 — both are faster cards than the 4060, so quoting either as the 4060's expected speed would mislead. Upstream's hardware-unspecified "RTF as low as 0.025" claim is omitted for the same reason. Submit your own measurement to /check/omnivoice/rtx-4060 to seed the empirical data.
Quality notes: OmniVoice covers 646 languages, but quality is heavily long-tailed and cross-lingual transfer is imperfect — see HF Discussion #22 for the maintainer's note on imperfect transfer across the long tail of languages.

For the full benchmark data, see /check/omnivoice/rtx-4060.

Troubleshooting

VRAM spikes / OOM with a long reference clip

This is the most likely issue on an 8 GB card. The Wladastic wrapper author observed VRAM spiking up to 8 GB on reference audio longer than ~4 s even with a 4 GB budget, eventually requiring CPU offload to stabilise. Workaround: keep your reference clip under 3.5 s, or set CPU_OFFLOAD=1 in that wrapper to push the LM weights to system RAM (the same discussion documents 1.3 GB GPU + 2.4 GB CPU after offload). The Saganaki22 ComfyUI node documents the same offload path with a ~2-4 GB working set.

Garbled / noisy output on RTX 50-series (not confirmed on 4060)

A 5090 user reports audio corruption in Issue #155 and explicitly confirmed the bug persists even with --no-asr (so it's not the Whisper auto-transcription path). The reporter speculates about Blackwell-specific kernel issues, and as of mid-May 2026 root cause is still under investigation by the k2-fsa maintainers — they haven't ruled out Ada Lovelace yet, but no 40-series reports have surfaced on the issue. If you hit garbled output on a 4060, passing ref_text explicitly (per the quick-start snippet above) is the most consistent reported workaround; if that fails, add your reproduction to the issue thread.

Fine-tuning fails with a shared-memory error

OmniVoice's flex_attention training path needs ~128 KB of shared memory per block, exceeding the 99-100 KB hardware limit on Ada Lovelace and Ampere cards — see Issue #83, which explicitly states "fine-tuning is currently only possible on A100 / H100 / Blackwell — every other consumer or workstation card will fail." The 4060 is sm_89 (same family as RTX 4090, also affected). This is fine-tuning only; inference uses smaller blocks and works on the 4060.

`pip install` fails / wrong CUDA version

The upstream pin is the +cu128 wheel. On the 4060 (sm_89) the cu121 and cu124 wheels also load — but if you see a "CUDA error: no kernel image" at the first inference call, force-reinstall PyTorch with the --extra-index-url https://download.pytorch.org/whl/cu128 flag from step 2 to match upstream exactly.