What You'll Build
A local zero-shot text-to-speech setup on an RTX 4070 12 GB that clones any voice from a short reference clip and speaks it back across 646 documented languages (the README phrases it as "600+ Languages Supported"; the structured language list pins the exact figure at 646 across 581k hours of training data). The model is k2-fsa's OmniVoice, a Qwen3-0.6B finetune wired into a diffusion-language-model TTS head with a discrete audio tokenizer, released under Apache-2.0.
The RTX 4070's 12 GB comfortably fits this ~4 GB workload — at roughly a 3x ratio of card-to-model, the install and runtime below are the standard path with light headroom left over. What you get on the 12 GB card is a clean single-model setup plus enough spare VRAM to keep one small companion model resident (an ASR for live transcription, for example), covered in Results.
Hardware data: RTX 4070 (12 GB VRAM) · ~4 GB working envelope per the community-tested low-VRAM wrapper · See benchmark data
ℹ️ VRAM envelope, not a measured peak. The upstream k2-fsa card doesn't publish a VRAM number. The ~4 GB figure here is the working default from the community-tested low-VRAM wrapper (
MAX_VRAM_GB=4, raised from 3 GB after the author hit OOMs with longer reference clips) plus the on-disk weight math below. On the 4070's 12 GB that leaves comfortable headroom even with a display attached. Once an RTX 4070 benchmark lands at /check/omnivoice/rtx-4070 we'll replace the envelope with the measured peak.
Requirements
| Component | Minimum | Tested |
|---|---|---|
| GPU | 4 GB VRAM (CUDA), any consumer NVIDIA card | RTX 4070 12 GB (Ada Lovelace sm_89) |
| RAM | 8 GB system RAM | — |
| Storage | ~3.3 GB (2.45 GB main weights + 806 MB audio tokenizer + tokenizer JSON) | — |
| Python | 3.10 or newer | — |
| CUDA | 12.x (default cu124 wheel works on Ada sm_89; cu121/cu128 also load) | — |
| Reference audio | 3-10 s WAV, mono | — |
Model weights total ~3.3 GB on disk from the HuggingFace Files tab: model.safetensors is 2.45 GB and audio_tokenizer/model.safetensors adds 806 MB, with the remainder split between tokenizer.json and the chat template. The upstream repo ships FP32; casting to FP16 at load time roughly halves the resident footprint, which is what produces the ~4 GB working envelope.
Installation
1. Create a clean Python env
python3.10 -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip
2. Install PyTorch (CUDA 12.x)
pip install torch==2.8.0 torchaudio==2.8.0
The RTX 4070 is Ada Lovelace (sm_89), which has shipped in stock PyTorch wheels for several releases — the default CUDA build (currently cu124) already includes sm_89 kernels, so no special wheel selection is required. This is the key difference from Blackwell (RTX 50-series, sm_120) cards, which need the +cu128 wheel; the 4070 does not. If you prefer to match the OmniVoice README's example exactly, pip install torch==2.8.0+cu128 torchaudio==2.8.0+cu128 --extra-index-url https://download.pytorch.org/whl/cu128 also loads on Ada, but it is not required.
3. Install OmniVoice
pip install omnivoice
PyPI ships the omnivoice package (Apache-2.0). The first inference call downloads weights into your HuggingFace cache from k2-fsa/OmniVoice.
4. Prepare a reference clip
Pick a 3-10 second mono WAV of the voice you want to clone and write down what's being said. Save the audio as ref.wav in your working directory. Always provide the transcript explicitly via ref_text — see Troubleshooting for why leaning on auto-transcription is risky.
Running
Save this as tts.py next to your ref.wav:
from omnivoice import OmniVoice
import soundfile as sf
import torch
model = OmniVoice.from_pretrained(
"k2-fsa/OmniVoice",
device_map="cuda:0",
dtype=torch.float16,
)
audio = model.generate(
text="Hello, this is a test of zero-shot voice cloning.",
ref_audio="ref.wav",
ref_text="Transcription of the reference audio.",
)
sf.write("out.wav", audio[0], 24000)
This is the canonical voice-cloning snippet from the upstream model card and GitHub README. Run it:
python tts.py
You should see weights resolve from the cache, then a short delay before out.wav (24 kHz mono) lands in your working directory. Per the upstream tips, keep the reference clip to 3-10 seconds — longer audio slows inference and can degrade cloning quality.
ComfyUI alternative
If you live in ComfyUI, the community node from Saganaki22 wraps the same model and credits k2-fsa/OmniVoice as the upstream fp32 source:
cd ComfyUI/custom_nodes
git clone https://github.com/Saganaki22/ComfyUI-OmniVoice-TTS.git
cd ComfyUI-OmniVoice-TTS
python install.py
It exposes nodes for voice clone, voice design, multi-speaker, and longform TTS (repo). The install.py script installs omnivoice with --no-deps to avoid disturbing the PyTorch you set up in step 2.
Results
- Speed: Not cited for the RTX 4070. Our database has no RTX 4070 benchmark yet (/check/omnivoice/rtx-4070 is currently
unknown), and the only named-GPU community measurements come from the Wladastic wrapper on an RTX 5060 Ti and an RTX 4080 — different cards from this recipe's target, so quoting either as the 4070's number would be a guess rather than a measurement. Upstream's hardware-unspecified "RTF as low as 0.025" claim names no GPU, so it is omitted here too. Submit your own measurement to /check/omnivoice/rtx-4070 to seed the empirical data. - VRAM usage: Working envelope ~4 GB on consumer NVIDIA with the Wladastic wrapper's nf4 LM + fp16 TTS recipe (default
MAX_VRAM_GB=4after the author hit OOMs at 3 GB with longer reference clips). The same author reports an aggressive CPU-offload path measured at 1.3 GB on the GPU plus 2.4 GB on system RAM (Discussion #20). On the 4070's 12 GB even the un-offloaded ~4 GB envelope leaves comfortable headroom — see /check/omnivoice/rtx-4070 for the measured peak once it's seeded. - Headroom on the 12 GB card: at ~4 GB resident you have roughly 6-7 GB free on the 4070 after the desktop's display reservation (a 12 GB desktop card with a monitor attached exposes about 10.5-11.3 GB usable). That is enough to keep a small Whisper ASR model loaded for live transcription alongside the TTS head, or to batch a few voice-clone requests without unloading between calls — but plan around the reference-clip VRAM spike noted in Troubleshooting before stacking other models, and note that on the 4070's PCIe Gen4 x16 link any CPU-offload path streams weights at roughly half the bandwidth of a Gen5 card.
- Quality notes: OmniVoice covers 646 languages totalling 581k hours of training data, but coverage is heavily long-tailed — a handful of languages dominate the hours and many sit on single-digit hours. Cross-lingual cloning is imperfect: per the upstream README tips, when the reference audio and the target speech are in different languages the output can carry an accent from the reference audio's language. See HF Discussion #22 for a community note on this. English and Chinese are the best-supported; always pass
ref_textexplicitly rather than relying on auto-transcription.
For the full benchmark data, see /check/omnivoice/rtx-4070.
Troubleshooting
VRAM spikes / OOM with a long reference clip
The most likely VRAM-related issue is pushing the working set with a long reference clip. The Wladastic wrapper author raised the default MAX_VRAM_GB from 3 to 4 after hitting out-of-memory errors with longer reference clips, and separately reported in HF Discussion #20 that VRAM spiked up to 8 GB on samples beyond ~4 s even with a 4 GB cap set. On the 4070's 12 GB an 8 GB transient spike still fits, but it eats a large share of the card's display-adjusted usable memory — so if you are colocating models per the headroom note above, keep reference clips under ~3.5 s (the wrapper's documented threshold) or enable the wrapper's CPU offload (CPU_OFFLOAD=true MAX_VRAM_GB=3 CPU_OFFLOAD_GB=8). After offload the same author measured 1.3 GB on the GPU and 2.4 GB on system RAM (Discussion #20).
Fine-tuning fails with a shared-memory error
This affects training only — inference works fine on the 4070. OmniVoice's default flex_attention training kernel requests roughly 128 KB of shared memory per block, but consumer and workstation cards are capped at about 99 KB per block. Issue #83 (now closed) is titled "Fine-tuning flex_attention kernel exceeds shared memory limit on Ampere/Ada GPUs (sm_86, sm_89)" and names the RTX 4090 (Ada sm_89) and RTX A6000 (Ampere sm_86) as failing identically at the 99 KB limit. The RTX 4070 is Ada sm_89 — the same architecture as the 4090 in that report — so the training path hits the same wall. The maintainer has since added an SDPA fine-tuning path: use the examples/config/train_config_finetune_sdpa.json config per the maintainer comment if you need to fine-tune. A community monkey-patch that pins all Triton block axes to 32x32 (comment) is the other documented route. None of this affects inference on the 4070.
pip install fails / wrong CUDA version
The RTX 4070 (sm_89) loads stock PyTorch wheels — the default pip install torch index already ships sm_89 kernels, unlike Blackwell sm_120 cards which require the +cu128 build. If you nonetheless see CUDA error: no kernel image is available for execution on the device at the first inference call, your environment has an unusually old torch; force-reinstall a recent release (e.g. pip install --upgrade --force-reinstall torch==2.8.0 torchaudio==2.8.0), which carries Ada kernels.