self-hosted/ai
§01·recipe · tts

OmniVoice on RTX 5070: Zero-Shot Voice Cloning Across 646 Languages

ttsintermediate4GB+ VRAMJun 5, 2026
models
tools
prerequisites
  • NVIDIA RTX 5070 12 GB (or any CUDA GPU with ~4 GB free VRAM)
  • Python 3.10 or newer
  • CUDA 12.8 toolkit / driver (Blackwell sm_120 needs the cu128 PyTorch wheel)
  • A short reference clip (3-10 s WAV, mono) plus its transcription

What You'll Build

A local zero-shot text-to-speech setup on an RTX 5070 12 GB that clones any voice from a short reference clip and speaks it back across 646 documented languages (the GitHub README phrases it as "600+"; the structured language list pins the exact figure at 646 across 581k hours of training data). The model is k2-fsa's OmniVoice, a Qwen3-0.6B finetune wired into a diffusion-language-model TTS head with a discrete audio tokenizer, released under Apache-2.0.

The RTX 5070's 12 GB comfortably fits this ~4 GB workload — at roughly a 3x ratio of card-to-model, the install and runtime below are the standard path with light headroom left over. The install steps are identical to the 16 GB Blackwell siblings (same sm_120 architecture); what changes on the 12 GB card is the display-headroom math, covered in Results.

Hardware data: RTX 5070 (12 GB VRAM) · ~4 GB working envelope per the community-tested low-VRAM wrapper · See benchmark data

ℹ️ VRAM envelope, not a measured peak. The upstream k2-fsa card doesn't publish a VRAM number. The ~4 GB figure here is the working default from the community-tested low-VRAM wrapper (MAX_VRAM_GB=4, raised from 3 GB after the author hit OOMs) plus the on-disk weight math below. On the 5070's 12 GB that leaves comfortable headroom even with a display attached. Once a 5070 benchmark lands at /check/omnivoice/rtx-5070 we'll replace the envelope with the measured peak.

Requirements

ComponentMinimumTested
GPU4 GB VRAM (CUDA), any consumer NVIDIA cardRTX 5070 12 GB (Blackwell sm_120)
RAM8 GB system RAM
Storage~3.3 GB (2.45 GB main weights + 806 MB audio tokenizer + tokenizer JSON)
Python3.10 or newer
CUDA12.8 (cu128 wheel required for sm_120)
Reference audio3-10 s WAV, mono

Model weights total ~3.3 GB on disk from the HuggingFace Files tab: model.safetensors is 2.45 GB and audio_tokenizer/model.safetensors adds 806 MB, with the remainder split between tokenizer.json and the chat template. The upstream repo ships FP32; casting to FP16 at load time roughly halves the resident footprint, which is what produces the ~4 GB working envelope.

Installation

1. Create a clean Python env

python3.10 -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip

2. Install PyTorch with CUDA 12.8 (Blackwell-compatible)

pip install torch==2.8.0+cu128 torchaudio==2.8.0+cu128 --extra-index-url https://download.pytorch.org/whl/cu128

This is the exact wheel pin from the OmniVoice README. The 5070 is Blackwell (sm_120) and requires cu128 — older cu121/cu124 wheels will not load kernels for it.

3. Install OmniVoice

pip install omnivoice

PyPI ships omnivoice 0.1.5 (requires-python >=3.10). The first inference call downloads weights into your HuggingFace cache from k2-fsa/OmniVoice.

4. Prepare a reference clip

Pick a 3-10 second mono WAV of the voice you want to clone and write down what's being said. Save the audio as ref.wav in your working directory. Always provide the transcript explicitly via ref_text — see Troubleshooting for why leaning on auto-transcription is risky on Blackwell right now.

Running

Save this as tts.py next to your ref.wav:

from omnivoice import OmniVoice
import soundfile as sf
import torch

model = OmniVoice.from_pretrained(
    "k2-fsa/OmniVoice",
    device_map="cuda:0",
    dtype=torch.float16,
)

audio = model.generate(
    text="Hello, this is a test of zero-shot voice cloning.",
    ref_audio="ref.wav",
    ref_text="Transcription of the reference audio.",
)

sf.write("out.wav", audio[0], 24000)

This is the canonical voice-cloning snippet from the upstream model card and GitHub README. Run it:

python tts.py

You should see weights resolve from the cache, then a short delay before out.wav (24 kHz mono) lands in your working directory. Per the upstream tips, keep the reference clip to 3-10 seconds — longer audio slows inference and can degrade cloning quality.

ComfyUI alternative

If you live in ComfyUI, the community node from Saganaki22 wraps the same model and credits k2-fsa/OmniVoice as the upstream fp32 source:

cd ComfyUI/custom_nodes
git clone https://github.com/Saganaki22/ComfyUI-OmniVoice-TTS.git
cd ComfyUI-OmniVoice-TTS
python install.py

It exposes nodes for voice clone, voice design, multi-speaker, and longform TTS (repo). The install.py script installs omnivoice with --no-deps to avoid downgrading the cu128 PyTorch you set up in step 2.

Results

  • Speed: Not cited for the RTX 5070. The only named-GPU community measurements come from the Wladastic wrapper on an RTX 5060 Ti and an RTX 4080 — different cards from this recipe's target, so quoting either as the 5070's number would be a guess rather than a measurement. The 5070 has materially less memory bandwidth and compute than those Blackwell/Ada cards, so their figures aren't a safe upper bound either. Upstream's hardware-unspecified "RTF as low as 0.025" claim is omitted for the same reason. Submit your own measurement to /check/omnivoice/rtx-5070 to seed the empirical data.
  • VRAM usage: Working envelope ~4 GB on consumer NVIDIA with the Wladastic wrapper's nf4 LM + fp16 TTS recipe (default MAX_VRAM_GB=4 after the author hit OOMs at 3 GB with longer reference clips). The same wrapper's experimental CPU-offload path pushes resident GPU memory below 1.5 GB by offloading 2-3 GB of weights to system RAM. On the 5070's 12 GB this fits with light headroom — see /check/omnivoice/rtx-5070 for the measured peak once it's seeded.
  • Headroom on the 12 GB card: at ~4 GB resident you have roughly 7-8 GB free on the 5070 after the desktop's display reservation (a 12 GB desktop card with a monitor attached exposes about 10.5-11.3 GB usable). That is enough to keep a small Whisper ASR model loaded for live transcription alongside the TTS head, or to batch a few voice-clone requests without unloading between calls — but it is tighter than the 16 GB siblings, so plan around the reference-clip VRAM spike noted in Troubleshooting before stacking other models.
  • Quality notes: OmniVoice covers 646 languages totalling 581k hours of training data, but coverage is heavily long-tailed — a handful of languages dominate the hours and many sit on single-digit hours. Cross-lingual cloning is imperfect: per the upstream README tips, when the reference audio and target speech are in different languages the output can carry an accent from the reference audio's language. See HF Discussion #22 for a community note on this.

For the full benchmark data, see /check/omnivoice/rtx-5070.

Troubleshooting

Garbled / noisy output on a Blackwell card

A RTX 5090 user reports corrupted, noise-like output instead of intelligible speech in Issue #155 (open as of mid-2026; the reporter and commenters are community users, not maintainers). The reporter notes the failure persisted even after disabling the Whisper reference-transcription step with --no-asr, so it is not simply an ASR problem. A maintainer has asked for an exact reproduction command and the root cause is still under investigation — so treat this as an open community report, not a confirmed Blackwell defect. The 5070 is the same Blackwell sm_120 architecture as the 5090 in that report, so it is in scope. The reporter's own workaround on the thread (not maintainer-confirmed) was that long input text — beyond roughly 785 characters — triggered the corruption, and auto-chunking the text into ~750-character pieces fixed it for them; a second user suspected the Voice Design (instruct) path specifically. If you hit garbled output, try shorter inputs / chunking first, always pass ref_text explicitly, and add your reproduction to the issue — Blackwell datapoints are still being collected.

Fine-tuning fails with a shared-memory error

Inference works fine on the 5070; this only affects training. OmniVoice's default flex_attention training kernel needs roughly 128 KB of shared memory per block, but consumer and workstation cards — including Blackwell sm_120 — are capped at about 99 KB per block. Issue #83 reports the RTX 4090 (Ada sm_89) and RTX A6000 (Ampere sm_86) failing at the 99 KB limit, and a commenter confirmed an RTX PRO 6000 Blackwell Workstation (sm_120) hits the same 99 KB wall — workstation and consumer Blackwell do not get the datacenter-class shared-memory budget. The maintainers have since added an SDPA fine-tuning path: use the examples/config/train_config_finetune_sdpa.json config per the maintainer comment if you need to fine-tune. The community monkey-patch (pinning all Triton block axes to 32x32) is the other documented route. None of this affects inference on the 5070.

VRAM spikes / OOM with a long reference clip

The Wladastic wrapper author observed VRAM spiking up to 8 GB on reference audio longer than ~4 s even with a 4 GB budget set, eventually requiring CPU offload to stabilise. On the 5070's 12 GB the spike still fits, but it eats a large share of the card's display-adjusted usable memory — so if you are stacking other models in the same process, keep the reference clip under ~3.5 s (the wrapper's documented threshold) or set CPU_OFFLOAD=true in that wrapper to push the LM weights to system RAM at a small latency cost.

pip install fails / wrong CUDA version

You must use the +cu128 PyTorch wheel for the 5070. The default pip install torch index ships an older CUDA build that won't initialise kernels on sm_120, surfacing as CUDA error: no kernel image is available for execution on the device at the first inference call. Re-install with the --extra-index-url https://download.pytorch.org/whl/cu128 flag from step 2.