OmniVoice on RTX 4070 Ti SUPER: Zero-Shot Voice Cloning Across 600+ Languages with Room to Spare

What You'll Build

A local zero-shot text-to-speech setup on an RTX 4070 Ti SUPER that clones any voice from a 3-5 second reference clip and speaks it back across the 600+ languages the model card advertises (the HuggingFace card's language: metadata carries 646 ISO codes; the README phrases the feature as "600+ Languages Supported"). The model is k2-fsa's OmniVoice, a Qwen3-0.6B-based finetune wired into a diffusion-language-model-style TTS architecture with a discrete audio tokenizer.

The RTX 4070 Ti SUPER is wildly over-provisioned for this 0.6B-class model — the working envelope is around 4 GB, so on the card's 16 GB you have roughly 12 GB of headroom to stack a second model (an ASR for live transcription, a small LLM for chat, etc.) in the same process or alongside it. The recipe leads with the single-model install and then shows how to spend that spare VRAM.

Hardware data: RTX 4070 Ti SUPER (16 GB VRAM) · ~4 GB working envelope per the community-tested low-VRAM wrapper · See benchmark data

ℹ️ VRAM envelope, not a measured peak. The upstream k2-fsa card doesn't publish a VRAM number. The ~4 GB figure here is the working default from the community-tested low-VRAM wrapper (MAX_VRAM_GB=4, raised from 3 GB after the author hit OOMs with longer reference clips). On the 4070 Ti SUPER's 16 GB that leaves ~12 GB of headroom — easily enough to keep a second model resident. Once an RTX 4070 Ti SUPER benchmark lands at /check/omnivoice/rtx-4070-ti-super we'll replace the envelope with the measured peak.

Requirements

Component	Minimum	Tested
GPU	4 GB VRAM (CUDA), any consumer NVIDIA card	RTX 4070 Ti SUPER (16 GB, Ada Lovelace sm_89)
RAM	8 GB system RAM	—
Storage	~3.3 GB total (`model.safetensors` 2.45 GB + audio tokenizer 806 MB + tokenizer JSON)	—
Python	3.10 or newer	—
CUDA	12.x (cu128 per upstream pin; cu121/cu124 also load on Ada Lovelace)	—
Reference audio	3-5 s WAV, mono	—

Model weight totals come from the HuggingFace Files tab — model.safetensors is 2.45 GB and audio_tokenizer/model.safetensors is 806 MB, with the remainder split between tokenizer JSON and the chat template. The upstream repo ships FP32; casting to FP16 at load time roughly halves the resident footprint, which is what produces the ~4 GB working envelope.

Installation

1. Create a clean Python env

python3.10 -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip

2. Install PyTorch (CUDA 12.8 per upstream, or cu121/cu124 if you already have it)

pip install torch==2.8.0+cu128 torchaudio==2.8.0+cu128 \
  --extra-index-url https://download.pytorch.org/whl/cu128

This is the wheel pin from the OmniVoice README. The RTX 4070 Ti SUPER is Ada Lovelace (sm_89) so cu121/cu124 wheels also work — Ada has been supported in stock PyTorch wheels for several releases and needs nothing Blackwell-specific. Matching upstream's cu128 pin just avoids surprises when their kernel set changes.

3. Install OmniVoice

pip install omnivoice

PyPI ships the canonical omnivoice package (Apache-2.0). The first inference call downloads weights into your HuggingFace cache from k2-fsa/OmniVoice.

4. Prepare a reference clip

Pick a 3-5 second mono WAV of the voice you want to clone and write down what's being said. Save the audio as ref.wav in your working directory. Always provide the transcript explicitly — see Troubleshooting for why auto-transcription is risky right now.

Running

Save this as tts.py next to your ref.wav:

from omnivoice import OmniVoice
import soundfile as sf
import torch

model = OmniVoice.from_pretrained(
    "k2-fsa/OmniVoice",
    device_map="cuda:0",
    dtype=torch.float16,
)

audio = model.generate(
    text="Hello, this is a test of zero-shot voice cloning.",
    ref_audio="ref.wav",
    ref_text="Transcription of the reference audio.",
)

sf.write("out.wav", audio[0], 24000)

This is the canonical snippet from the upstream model card and GitHub README. Run it:

python tts.py

You should see weights resolve from the cache, then a short delay before out.wav (24 kHz mono) lands in your working directory.

ComfyUI alternative

If you live in ComfyUI, the community node from drbaph and Saganaki22 wraps the same model and ships bf16 quantization that brings the working set under ~2 GB:

cd ComfyUI/custom_nodes
git clone https://github.com/Saganaki22/ComfyUI-OmniVoice-TTS.git
cd ComfyUI-OmniVoice-TTS
python install.py

It exposes nodes for voice clone, voice design, multi-speaker, and longform TTS, and links back to k2-fsa/OmniVoice as the upstream FP32 source (repo).

Spending the headroom — colocating a second model

Because OmniVoice's working set is ~4 GB and the RTX 4070 Ti SUPER has 16 GB, the genuinely card-specific story isn't "does it fit" (it fits on a 4 GB card) — it's what to do with the ~12 GB of spare VRAM. Concrete next steps:

Live transcribe-then-clone. Keep a Whisper-class ASR resident on the same card to transcribe the reference clip on the fly, then feed its text into OmniVoice's ref_text. The two models share the 16 GB comfortably.
Chat-to-speech. Load a 7-8B LLM at a 4-bit quant (~5-6 GB) alongside OmniVoice and pipe generated text straight into TTS, all in one process.
Batch multi-speaker. The ComfyUI node above exposes multi-speaker and longform nodes; the 4070 Ti SUPER's headroom lets you run several speaker contexts without offloading.

When you stack models, watch nvidia-smi and keep OmniVoice's reference clips short (see Troubleshooting) so a transient spike doesn't collide with a colocated model's peak.

Results

Speed: No RTX 4070 Ti SUPER benchmark is in our database yet (/check/omnivoice/rtx-4070-ti-super is currently unknown), and no community write-up names this exact card for a timing figure. The upstream card's "RTF as low as 0.025 (40x faster than real-time)" claim names no GPU, so it is omitted here, and the only hands-on timing figures from the community low-VRAM wrapper are reported on an RTX 4080 and an RTX 5060 Ti — different cards — so they are not reproduced as a 4070 Ti SUPER number. Submit your own measurement to /check/omnivoice/rtx-4070-ti-super to seed the empirical data.
VRAM usage: Working envelope ~4 GB on consumer NVIDIA with the Wladastic wrapper's nf4 LM + fp16 TTS recipe (default MAX_VRAM_GB=4 after the author hit OOMs at 3 GB with longer reference clips). The same author reports an aggressive CPU-offload path measured at 1.3 GB on the GPU plus 2.4 GB on system RAM (Discussion #20). On the 4070 Ti SUPER's 16 GB even the un-offloaded ~4 GB envelope leaves substantial headroom. See /check/omnivoice/rtx-4070-ti-super for the measured peak once it's seeded.
Quality notes: OmniVoice advertises 600+ languages, but coverage is uneven across them — community users have flagged that cross-lingual transfer (cloning a voice from one language into another) is imperfect, with audible accent leakage (see HF Discussion #22). English and Chinese are the best-supported. Always pass ref_text explicitly rather than relying on auto-transcription.

For the full benchmark data, see /check/omnivoice/rtx-4070-ti-super.

Troubleshooting

VRAM spikes / OOM with a long reference clip

The most likely VRAM-related issue is pushing the working set with a long reference clip. The Wladastic wrapper author raised the default MAX_VRAM_GB from 3 to 4 after hitting out-of-memory errors with longer reference clips, and separately reported in HF Discussion #20 that VRAM "spikes up to 8gb" on samples beyond ~4 s even with a 4 GB cap set. On the 4070 Ti SUPER's 16 GB even an 8 GB transient spike is only half the card, so a single-model run has plenty of room — but if you're colocating models per the section above, keep reference clips short or enable the wrapper's CPU offload (CPU_OFFLOAD=true MAX_VRAM_GB=3 CPU_OFFLOAD_GB=8). After offload the same author measured 1.3 GB on the GPU and 2.4 GB on system RAM (Discussion #20).

Fine-tuning fails with a shared-memory error

OmniVoice's flex_attention training path requests ~128 KB of shared memory per block, above the ~99 KB limit on Ada Lovelace and Ampere cards — see Issue #83 (now closed), whose title and body name the RTX 4090 (Ada sm_89) and the RTX A6000 (Ampere sm_86) as failing identically. The RTX 4070 Ti SUPER is Ada sm_89 — same family as the 4090 — and is affected the same way. This is fine-tuning only; inference uses smaller blocks and works on the RTX 4070 Ti SUPER without modification.

`pip install` fails / wrong CUDA version

The upstream pin is the +cu128 wheel. On the RTX 4070 Ti SUPER (sm_89) the cu121 and cu124 wheels also load — Ada Lovelace has been part of stock PyTorch wheels for several releases. If you see CUDA error: no kernel image is available for execution on the device at the first inference call, force-reinstall PyTorch with the --extra-index-url https://download.pytorch.org/whl/cu128 flag from step 2 to match upstream exactly.

Garbled / noisy output (reported on Blackwell, not on Ada)

A 5090 user reports audio corruption in Issue #155 that persists even with --no-asr (ruling out the Whisper auto-transcription path). As of late May 2026 the issue is open; a maintainer has asked the reporter for exact reproduction steps, and root cause is still under investigation. The RTX 4070 Ti SUPER is Ada Lovelace (sm_89), not Blackwell — this bug has not been reproduced on this architecture and you should not expect to hit it. If you do, passing ref_text explicitly (per the quick-start snippet above) is the most consistent reported workaround, and adding your reproduction to that issue helps the maintainers narrow it.