OmniVoice on RTX 5070 Ti: Zero-Shot Voice Cloning Across 646 Languages

What You'll Build

A local zero-shot text-to-speech setup on an RTX 5070 Ti 16 GB that clones any voice from a short reference clip and speaks it back across 646 documented languages (the GitHub README phrases it as "600+"; the structured language list pins the exact figure at 646). The model is k2-fsa's OmniVoice, a Qwen3-0.6B-Base finetune wired into a diffusion-language-model TTS head with a discrete audio tokenizer.

The 5070 Ti's 16 GB is wildly over-provisioned for this ~4 GB workload — the install and runtime below are identical to any 16 GB consumer card. The genuinely 5070-Ti-specific angle is the ~12 GB of headroom left over: enough to keep a second model resident (an ASR for live transcription, a small chat LLM, etc.) in the same process. See the headroom note in Results.

Hardware data: RTX 5070 Ti (16 GB VRAM) · ~4 GB working envelope per the community-tested low-VRAM wrapper · See benchmark data

ℹ️ VRAM envelope, not a measured peak. The upstream k2-fsa card doesn't publish a VRAM number — upstream testing was reported on Intel Arc A310 (4 GB) and Arc Pro B50 (16 GB) (README). The 4 GB figure here is the working default from the community-tested low-VRAM wrapper (MAX_VRAM_GB=4, raised from 3 GB after the author hit OOMs) plus the on-disk weight math below. On the 5070 Ti's 16 GB that leaves ~12 GB of headroom. Once a 5070 Ti benchmark lands at /check/omnivoice/rtx-5070-ti we'll replace the envelope with the measured peak.

Requirements

Component	Minimum	Tested
GPU	4 GB VRAM (CUDA), any consumer NVIDIA card	RTX 5070 Ti 16 GB (Blackwell sm_120)
RAM	8 GB system RAM	—
Storage	~3.3 GB (2.45 GB main weights + 806 MB audio tokenizer + tokenizer JSON)	—
Python	3.10 or newer	—
CUDA	12.8 (cu128 wheel required for sm_120)	—
Reference audio	3-10 s WAV, mono	—

Model weights total ~3.3 GB on disk from the HuggingFace Files tab: model.safetensors is 2.45 GB and audio_tokenizer/model.safetensors adds 806 MB, with the remainder split between tokenizer.json and the chat template. The upstream repo ships FP32; casting to FP16 at load time roughly halves the resident footprint, which is what produces the ~4 GB working envelope.

Installation

1. Create a clean Python env

python3.10 -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip

2. Install PyTorch with CUDA 12.8 (Blackwell-compatible)

pip install torch==2.8.0+cu128 torchaudio==2.8.0+cu128 --extra-index-url https://download.pytorch.org/whl/cu128

This is the exact wheel pin from the OmniVoice README. The 5070 Ti is Blackwell (sm_120) and requires cu128 — older cu121/cu124 wheels will not load kernels for it.

3. Install OmniVoice

pip install omnivoice

PyPI ships omnivoice 0.1.5 (requires-python >=3.10). The first inference call downloads weights into your HuggingFace cache from k2-fsa/OmniVoice.

4. Prepare a reference clip

Pick a 3-10 second mono WAV of the voice you want to clone and write down what's being said. Save the audio as ref.wav in your working directory. Always provide the transcript explicitly via ref_text — see Troubleshooting for why leaning on auto-transcription is risky on Blackwell right now.

Running

Save this as tts.py next to your ref.wav:

from omnivoice import OmniVoice
import soundfile as sf
import torch

model = OmniVoice.from_pretrained(
    "k2-fsa/OmniVoice",
    device_map="cuda:0",
    dtype=torch.float16,
)

audio = model.generate(
    text="Hello, this is a test of zero-shot voice cloning.",
    ref_audio="ref.wav",
    ref_text="Transcription of the reference audio.",
)

sf.write("out.wav", audio[0], 24000)

This is the canonical Voice Cloning snippet from the upstream model card and GitHub README. Run it:

python tts.py

You should see weights resolve from the cache, then a short delay before out.wav (24 kHz mono) lands in your working directory. Per the upstream tips, keep the reference clip to 3-10 seconds — longer audio slows inference and can degrade cloning quality.

ComfyUI alternative

If you live in ComfyUI, the community node from drbaph and Saganaki22 wraps the same model:

cd ComfyUI/custom_nodes
git clone https://github.com/Saganaki22/ComfyUI-OmniVoice-TTS.git
cd ComfyUI-OmniVoice-TTS
python install.py

It exposes nodes for voice clone, voice design, multi-speaker, and longform TTS, and links back to k2-fsa/OmniVoice as the upstream source (repo).

Results

Speed: Not cited for the 5070 Ti specifically. The only named-GPU community measurements are the Wladastic wrapper's 0.6 s to generate 5 s of audio on an RTX 5060 Ti and 0.2 s on an RTX 4080. The 5070 Ti has roughly 2× the memory bandwidth of the 5060 Ti (896 GB/s vs 448 GB/s) and more compute, so it should land at or below the 5060 Ti's time — but quoting either card's number as the 5070 Ti's would be a guess, not a measurement. Upstream's hardware-unspecified "RTF as low as 0.025" claim is omitted for the same reason. Submit your own measurement to /check/omnivoice/rtx-5070-ti to seed the empirical data.
VRAM usage: Working envelope ~4 GB on consumer NVIDIA with the Wladastic wrapper's nf4 LM + fp16 TTS recipe (default MAX_VRAM_GB=4 after the author hit OOMs at 3 GB with longer reference clips). The same wrapper's experimental CPU-offload path pushes resident GPU memory below 1.5 GB by offloading 2-3 GB of weights to system RAM. On the 5070 Ti's 16 GB this all fits with room to spare — see /check/omnivoice/rtx-5070-ti for the measured peak once it's seeded.
Headroom — the real 5070 Ti angle: at ~4 GB resident you have roughly 12 GB free on this card. That's enough to keep a Whisper-large ASR model loaded for live transcription alongside the TTS head, run a small Q4 chat LLM in the same process, or batch several voice-clone requests without unloading between calls. The single-model install above is identical across the 16 GB tier; the spare VRAM is what the bigger card actually buys you.
Quality notes: OmniVoice covers 646 languages totalling 581k hours of training data, but coverage is heavily long-tailed — a handful of languages dominate the hours and many sit on single-digit hours. Cross-lingual cloning is imperfect: per the upstream README tips, when the reference audio and target speech are in different languages the output carries an accent from the reference audio's language. See HF Discussion #22 for community notes on this.

For the full benchmark data, see /check/omnivoice/rtx-5070-ti.

Troubleshooting

Garbled / noisy output on a Blackwell card

A RTX 5090 user reports corrupted, noise-like output instead of intelligible speech in Issue #155 (open as of mid-2026; reporter and commenters are community users, not maintainers). The reporter's environment runs Whisper (whisper-large-v3-turbo) for reference transcription on a Windows RTX 5090 with the cu128 PyTorch wheel. A maintainer has asked for an exact reproduction command and the root cause is still under investigation — so treat this as an open community report, not a confirmed Blackwell defect. The 5070 Ti is the same Blackwell sm_120 architecture as the 5090 in that report, so it is in scope. Two community workarounds surfaced on the thread without maintainer confirmation: the reporter found that long input text (beyond roughly 785 characters) triggered the corruption and that auto-chunking the text fixed it for them, and another user suspected the Voice Design (instruct) path specifically. If you hit garbled output, try shorter inputs / chunking first, always pass ref_text explicitly, and add your reproduction to the issue — Blackwell datapoints are still being collected.

Fine-tuning fails with a shared-memory error

Inference works fine on the 5070 Ti; this only affects training. OmniVoice's default flex_attention training kernel needs ~128 KB of shared memory per block, but consumer and workstation cards — including Blackwell sm_120 — are capped at ~99 KB per block. Issue #83 names the RTX 4090 (Ada sm_89) and RTX A6000 (Ampere sm_86) as failing at the 99 KB limit, and a commenter confirmed an RTX PRO 6000 Blackwell Workstation (sm_120) hits the same 99 KB wall — workstation/consumer Blackwell does not get the datacenter-class shared memory (comment). The maintainers have since added an SDPA fine-tuning path: use the examples/config/train_config_finetune_sdpa.json config per the maintainer comment if you need to fine-tune. The community monkey-patch (pinning all Triton block axes to 32×32) is the other documented route. None of this affects inference on the 5070 Ti.

VRAM spikes / OOM with a long reference clip

The Wladastic wrapper author observed VRAM spiking up to 8 GB on reference audio longer than ~4 s even with a 4 GB budget set, eventually requiring CPU offload to stabilise. On the 5070 Ti's 16 GB the spike is only half the card, so it fits — but if you're stacking other models in the same process, plan for that headroom. To keep VRAM tight, keep the reference clip under 3.5 s (the wrapper's documented threshold) or set CPU_OFFLOAD=true in that wrapper to push the LM weights to system RAM at a small latency cost.

`pip install` fails / wrong CUDA version

You must use the +cu128 PyTorch wheel for the 5070 Ti. The default pip install torch index ships an older CUDA build that won't initialise kernels on sm_120, surfacing as CUDA error: no kernel image is available for execution on the device at the first inference call. Re-install with the --extra-index-url https://download.pytorch.org/whl/cu128 flag from step 2.