self-hosted/ai
§01·recipe · tts

OmniVoice on RTX 4060 Ti 16GB: Zero-Shot Voice Cloning Across 646 Languages with Room to Spare

ttsintermediate4GB+ VRAMMay 21, 2026
models
tools
prerequisites
  • NVIDIA RTX 4060 Ti 16 GB (or any CUDA GPU with ~4 GB free VRAM)
  • Python 3.10 or newer
  • CUDA 12.x toolkit / driver (the upstream wheel is cu128, but the 4060 Ti 16 GB is sm_89 — cu121/cu124 wheels work equally)
  • A short reference clip (3-5 s WAV, mono, 16-24 kHz) plus its transcription

What You'll Build

A local zero-shot text-to-speech setup on an RTX 4060 Ti 16 GB that clones any voice from a 3-5 second reference clip and speaks it back across 646 documented languages (per the HuggingFace model-card metadata; the GitHub README phrases it as "600+"). The model is k2-fsa's OmniVoice, a Qwen3-0.6B-Base finetune wired into a diffusion-language-model TTS head with a discrete audio tokenizer.

The 4060 Ti 16 GB is the easiest tier for this model — the working envelope is around 4 GB, so you have roughly 12 GB of headroom to stack a second model (an ASR for live transcription, a small LLM for chat, etc.) in the same process or alongside it.

Hardware data: RTX 4060 Ti 16 GB · ~4 GB working envelope per the community-tested low-VRAM wrapper · See benchmark data

ℹ️ VRAM envelope, not a measured peak. The upstream k2-fsa card doesn't publish a VRAM number — testing was reported on Intel Arc A310 (4 GB) and Arc Pro B50 (16 GB) (README). The 4 GB figure here is the working default from the community-tested low-VRAM wrapper (MAX_VRAM_GB=4, raised from 3 GB after the author hit OOMs) plus the bf16/fp16 4-6 GB band documented by the Saganaki22 ComfyUI node. On the 4060 Ti's 16 GB that leaves ~12 GB of headroom — easily enough to absorb the spike documented below and to keep a second model resident. Once a 4060 Ti 16 GB benchmark lands at /check/omnivoice/rtx-4060-ti-16gb we'll replace the envelope with the measured peak.

Requirements

ComponentMinimumTested
GPU4 GB VRAM (CUDA), any consumer NVIDIA cardRTX 4060 Ti 16 GB (Ada Lovelace sm_89)
RAM8 GB system RAM
Storage~3.3 GB total (model.safetensors 2.45 GB + audio tokenizer 806 MB + tokenizer JSON)
Python3.10 or newer
CUDA12.x (cu128 per upstream pin; cu121/cu124 also load on Ada Lovelace)
Reference audio3-5 s WAV, mono

Model weight totals come from the HuggingFace Files tabmodel.safetensors is 2.45 GB and audio_tokenizer/model.safetensors is 806 MB, with the remainder split between tokenizer JSON and the chat template. The upstream repo ships FP32; casting to FP16 at load time roughly halves the resident footprint, which is what produces the ~4 GB working envelope.

Installation

1. Create a clean Python env

python3.10 -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip

2. Install PyTorch (CUDA 12.8 per upstream, or cu121/cu124 if you already have it)

pip install torch==2.8.0+cu128 torchaudio==2.8.0+cu128 \
  --extra-index-url https://download.pytorch.org/whl/cu128

This is the exact wheel pin from the OmniVoice README. The 4060 Ti 16 GB is Ada Lovelace (sm_89) so cu121/cu124 wheels also work — Ada has been supported in stock PyTorch wheels for several releases and doesn't need anything Blackwell-specific. Matching upstream's cu128 pin just avoids surprises when their kernel set changes.

3. Install OmniVoice

pip install omnivoice

PyPI ships the canonical omnivoice package (Apache-2.0). The first inference call downloads weights into your HuggingFace cache from k2-fsa/OmniVoice.

4. Prepare a reference clip

Pick a 3-5 second mono WAV of the voice you want to clone and write down what's being said. Save the audio as ref.wav in your working directory. Always provide the transcript explicitly — see Troubleshooting for why auto-transcription is risky right now.

Running

Save this as tts.py next to your ref.wav:

from omnivoice import OmniVoice
import soundfile as sf
import torch

model = OmniVoice.from_pretrained(
    "k2-fsa/OmniVoice",
    device_map="cuda:0",
    dtype=torch.float16,
)

audio = model.generate(
    text="Hello, this is a test of zero-shot voice cloning.",
    ref_audio="ref.wav",
    ref_text="Transcription of the reference audio.",
)

sf.write("out.wav", audio[0], 24000)

This is the canonical snippet from the upstream model card and GitHub README. Run it:

python tts.py

You should see weights resolve from the cache, then a short delay before out.wav (24 kHz mono) lands in your working directory.

ComfyUI alternative

If you live in ComfyUI, the community node from drbaph and Saganaki22 wraps the same model and ships bf16 quantization that brings the working set under ~2 GB:

cd ComfyUI/custom_nodes
git clone https://github.com/Saganaki22/ComfyUI-OmniVoice-TTS.git
cd ComfyUI-OmniVoice-TTS
python install.py

It exposes nodes for voice clone, voice design, multi-speaker, and longform TTS, and links back to k2-fsa/OmniVoice as the upstream FP32 source (repo).

Results

  • VRAM usage: Working envelope ~4 GB on consumer NVIDIA with the Wladastic wrapper's nf4 LM + fp16 TTS recipe (default MAX_VRAM_GB=4 after the author hit OOMs at 3 GB with longer reference clips). The Saganaki22 ComfyUI node explicitly documents a ~4-6 GB band at bf16/fp16 and ~2-4 GB with CPU offload, corroborating the same band from an independent runtime path. On the 4060 Ti's 16 GB that leaves substantial headroom — the same author observed VRAM spiking up to 8 GB on reference audio longer than ~4 s, and even that spike consumes only half of this card's memory. See /check/omnivoice/rtx-4060-ti-16gb for the measured peak once it's seeded.
  • Speed: Not cited here for the 4060 Ti 16 GB specifically. The closest consumer-NVIDIA measurements are the Wladastic wrapper's 0.6 s to generate 5 s of audio on an RTX 5060 Ti and 0.2 s on an RTX 4080 — different architectures and compute tiers than the 4060 Ti 16 GB (Ada Lovelace, fewer SMs than the 4080, similar memory bandwidth to the 5060 Ti), so quoting either as the 4060 Ti 16 GB's expected speed would mislead. Upstream's hardware-unspecified "RTF as low as 0.025" claim is omitted for the same reason. Submit your own measurement to /check/omnivoice/rtx-4060-ti-16gb to seed the empirical data.
  • Quality notes: OmniVoice covers 646 languages, but quality is heavily long-tailed and cross-lingual transfer is imperfect — see HF Discussion #22 for the maintainer's "cross-lingual transfer is not perfect" note. English and Chinese dominate the training mix; many smaller languages sit on minimal data.

For the full benchmark data, see /check/omnivoice/rtx-4060-ti-16gb.

Troubleshooting

VRAM spikes / OOM with a long reference clip

This is the most likely VRAM-related issue if you push the working set hard. The Wladastic wrapper author observed VRAM spiking up to 8 GB on reference audio longer than ~4 s even with a 4 GB budget set, eventually requiring CPU offload to stabilise. On a 4060 Ti's 16 GB this still fits comfortably — the spike is half the card — but if you're stacking other models in the same process and watching nvidia-smi, plan for that headroom. Workaround if you want to keep VRAM tight: keep your reference clip under 3.5 s (the Wladastic wrapper's documented threshold), or set CPU_OFFLOAD=true in that wrapper to push the LM weights to system RAM (the same discussion documents 1.3 GB GPU + 2.4 GB CPU after offload). The Saganaki22 ComfyUI node documents the same offload path with a ~2-4 GB working set.

Fine-tuning fails with a shared-memory error

OmniVoice's flex_attention training path needs ~128 KB of shared memory per block, exceeding the 99 KB hardware limit on Ada Lovelace and Ampere cards — see Issue #83, which lists the RTX 4090 (Ada sm_89) and RTX A6000 (Ampere sm_86) as failing and the A100/H100/Blackwell tier (228 KB shared memory) as the only supported hardware for fine-tuning. The 4060 Ti 16 GB is sm_89 — same family as the 4090 — and is affected. This is fine-tuning only; inference uses smaller blocks and works on the 4060 Ti 16 GB without modification.

pip install fails / wrong CUDA version

The upstream pin is the +cu128 wheel. On the 4060 Ti 16 GB (sm_89) the cu121 and cu124 wheels also load — Ada Lovelace has been part of stock PyTorch wheels for several releases. If you see a CUDA error: no kernel image is available for execution on the device at the first inference call, force-reinstall PyTorch with the --extra-index-url https://download.pytorch.org/whl/cu128 flag from step 2 to match upstream exactly.

Garbled / noisy output (Blackwell-only)

A 5090 user reports audio corruption in Issue #155 and explicitly confirmed it persists even with --no-asr (ruling out the Whisper auto-transcription path). As of mid-May 2026 the issue is open, no 40-series reports have surfaced on the thread, and root cause is still under investigation by the k2-fsa maintainers. The 4060 Ti 16 GB is Ada Lovelace (sm_89), not Blackwell — this bug has not been reproduced on this architecture and you should not expect to hit it. If you do, passing ref_text explicitly (per the quick-start snippet above) is the most consistent reported workaround on adjacent threads, and adding your reproduction to that issue helps the maintainers narrow it.