How much VRAM does OmniVoice need?

About 4 GB — the minimum this recipe targets.

How hard is this setup?

Intermediate — follow the steps above.

OmniVoice on RX 7900 XTX: Zero-Shot Voice Cloning Across 646 Languages on ROCm (BF16)

What You'll Build

A local zero-shot text-to-speech setup on a 24 GB Radeon RX 7900 XTX (RDNA3, Navi 31, gfx1100) running through the ROCm stack. It clones any voice from a 3-5 second reference clip and speaks it back across the 646 languages the model card advertises (the HuggingFace card tags 646 language codes and describes the model as "a massively multilingual zero-shot text-to-speech (TTS) model supporting over 600 languages"). The model is k2-fsa's OmniVoice, a Qwen3-0.6B-Base-based finetune wired into a diffusion-language-model-style TTS architecture with a discrete audio tokenizer.

OmniVoice is pure PyTorch — there is no flash-attn dependency and no custom CUDA kernel to compile, so it ports to AMD cleanly. The upstream README itself confirms the attention path falls back to PyTorch's scaled-dot-product attention when a flash-attn kernel isn't present (it documents "flash_attn is not available on XPU; the model automatically falls back to SDPA" for Intel Arc) — the exact same SDPA path the model takes on ROCm. The RX 7900 XTX is wildly over-provisioned for this 0.6B-class model: the working envelope is around 4 GB, so on the 7900 XTX's 24 GB you have ~20 GB of headroom to stack a second model (an ASR for live transcription, a small LLM for chat) in the same process.

Hardware data: RX 7900 XTX (24 GB VRAM) · BF16 on ROCm · ~4 GB working envelope · See benchmark data

⚠️ This is a ROCm recipe, not CUDA. The RX 7900 XTX runs on AMD's ROCm/HIP stack — there is no cu12x wheel, no flash-attn wheel, and no FP8/FP4 path here. RDNA3's WMMA units accept FP16, BF16, INT8, and INT4 only — there is no FP8/FP4 hardware — so an FP8 checkpoint would just upcast to BF16 with no memory saving, and at 24 GB you don't need any quantization anyway. The attention path is PyTorch SDPA, which is exactly what OmniVoice falls back to when no flash-attn kernel is available. Under ROCm/HIP, PyTorch still exposes the GPU through the cuda device namespace (torch.cuda.is_available() returns True, device_map="cuda:0" targets the AMD card) — that is expected, not a CUDA install.

ℹ️ VRAM envelope, not a measured peak. The upstream k2-fsa card doesn't publish a VRAM number, and we have no first-party RX 7900 XTX benchmark yet. The ~4 GB figure is the model's working default at BF16/FP16: the two weight files total ~3.27 GB (model.safetensors 2.45 GB + audio_tokenizer/model.safetensors 806 MB), and inference activations for this 0.6B-class model add little on top. On the 7900 XTX's 24 GB that leaves ~20 GB free. Once a measurement lands at /check/omnivoice/rx-7900-xtx we'll replace the envelope with the measured peak — contribute one if you run it.

Requirements

Component	Minimum	Tested
GPU	4 GB VRAM (ROCm-supported AMD card)	RX 7900 XTX (24 GB, RDNA3 / gfx1100)
RAM	8 GB system RAM	—
Storage	~3.3 GB total (`model.safetensors` 2.45 GB + audio tokenizer 806 MB + tokenizer JSON)	—
Driver	AMD ROCm on Linux	—
Python	3.10 or newer	—
Reference audio	3-5 s WAV, mono	—

Model weight totals come from the HuggingFace Files tab — model.safetensors is 2.45 GB and audio_tokenizer/model.safetensors is 806 MB, with the remainder split between the tokenizer JSON and the chat template. The package is released under the Apache-2.0 License and the weights are not gated on Hugging Face — no access request or login is required. Loading at BF16/FP16 (the snippet below passes dtype) keeps the resident footprint near the ~4 GB working envelope.

Installation

1. Create a clean Python env

python3.10 -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip

2. Install PyTorch for ROCm

The RX 7900 XTX (gfx1100) is an officially ROCm-supported GPU on Linux, so it uses the stable ROCm PyTorch wheel — the ROCm equivalent of the cu12x wheel the upstream README pins for NVIDIA. Select the ROCm option from the PyTorch "Get Started" selector and use the generated command, which is of the form:

pip install torch torchaudio --index-url https://download.pytorch.org/whl/rocm6.3

ℹ️ Verify the ROCm tag before you copy it. The rocmX.Y tag in that URL moves over time (6.3 → 6.4 → 7.x) as new ROCm releases land — the value above reflects the stable option visible on the PyTorch selector at the time of writing. Always read the current line from the live PyTorch "Get Started" page and match the wheel's ROCm version to the ROCm stack you have installed on the host. AMD also publishes its own Radeon-tuned wheels at repo.radeon.com if you prefer the vendor build.

3. Install OmniVoice

pip install omnivoice

PyPI ships the canonical omnivoice package (Apache-2.0). It is pure PyTorch — it pulls in no flash-attn wheel and builds no custom CUDA/HIP kernel, so this pip install works the same on ROCm as on CUDA. The first inference call downloads weights into your HuggingFace cache from k2-fsa/OmniVoice.

4. Prepare a reference clip

Pick a 3-5 second mono WAV of the voice you want to clone and write down what's being said. Save the audio as ref.wav in your working directory. Always provide the transcript explicitly — see Troubleshooting for why auto-transcription is risky right now.

Running

Save this as tts.py next to your ref.wav:

from omnivoice import OmniVoice
import soundfile as sf
import torch

model = OmniVoice.from_pretrained(
    "k2-fsa/OmniVoice",
    device_map="cuda:0",
    dtype=torch.bfloat16,
)

audio = model.generate(
    text="Hello, this is a test of zero-shot voice cloning.",
    ref_audio="ref.wav",
    ref_text="Transcription of the reference audio.",
)

sf.write("out.wav", audio[0], 24000)

This is the canonical snippet from the upstream model card and GitHub README, with one ROCm-appropriate change: dtype=torch.bfloat16 instead of float16. RDNA3 has native BF16 in its WMMA units, BF16 has a wider exponent range than FP16 (less prone to overflow), and at 24 GB there is no reason to chase the smaller-but-narrower format — FP16 works too if you prefer. device_map="cuda:0" is correct on AMD: under ROCm/HIP, PyTorch routes the cuda device namespace to the Radeon card, so this targets the 7900 XTX. Run it:

python tts.py

You should see weights resolve from the cache, then a short delay before out.wav (24 kHz mono) lands in your working directory.

Spending the headroom — colocating a second model

Because OmniVoice's working set is ~4 GB and the RX 7900 XTX has 24 GB, the genuinely card-specific story isn't "does it fit" (it fits on a 4 GB card) — it's what to do with the ~20 GB of spare VRAM. Concrete next steps:

Live transcribe-then-clone. Keep a Whisper-class ASR resident on the same card to transcribe the reference clip on the fly, then feed its text into OmniVoice's ref_text. The two models share the 24 GB comfortably.
Chat-to-speech. Load a 7-8B LLM (Ollama on ROCm is the cleanest "just works" surface on RDNA3) alongside OmniVoice and pipe generated text straight into TTS.
Batch multi-speaker / longform. The 7900 XTX's headroom lets you run several speaker contexts without offloading.

When you stack models, watch rocm-smi and keep OmniVoice's reference clips short (see Troubleshooting) so a transient spike doesn't collide with a colocated model's peak.

Results

Speed: No RX-7900-XTX-named OmniVoice benchmark is in our database yet (/check/omnivoice/rx-7900-xtx is currently unknown), and no verifiable RX 7900 XTX measurement was found in research — so the Speed figure is omitted rather than transferred from a different GPU or vendor. Upstream's hardware-unspecified RTF claim names no GPU and is not quoted. If you've measured OmniVoice generation time on a 7900 XTX, please contribute it so it lands on /check/omnivoice/rx-7900-xtx.
VRAM usage: Working envelope ~4 GB at BF16/FP16 — the two weight files total ~3.27 GB (HF Files tree) and this 0.6B-class model's inference activations add little. On the 7900 XTX's 24 GB even a transient spike on a long reference clip (see Troubleshooting) sits comfortably within the card. See /check/omnivoice/rx-7900-xtx for the measured peak once it's seeded.
Quality notes: OmniVoice advertises 646 languages, but coverage is uneven across them — community users have flagged that cross-lingual transfer (cloning a voice from one language into another) is imperfect, with audible accent leakage (see HF Discussion #22). English and Chinese are the best-supported. Always pass ref_text explicitly rather than relying on auto-transcription. There is no quantization tradeoff to weigh on this card — run the native BF16/FP16 weights.

For the full benchmark data, see /check/omnivoice/rx-7900-xtx.

Troubleshooting

"Torch not compiled with CUDA enabled"

This means a CUDA build of PyTorch got installed instead of the ROCm build. Uninstall and reinstall against the ROCm wheel index from step 2:

pip uninstall torch torchaudio
pip install torch torchaudio --index-url https://download.pytorch.org/whl/rocm6.3

Confirm the installed build is the ROCm one: python -c "import torch; print(torch.__version__)" should print a +rocm-style suffix, and torch.cuda.is_available() returns True (under HIP, ROCm masquerades as the cuda device namespace — that is why the device_map="cuda:0" snippet above is correct on an AMD card).

VRAM spikes / OOM with a long reference clip

The most likely VRAM-related issue is pushing the working set with a long reference clip — community reports note the resident set can transiently spike well above the ~4 GB idle envelope on samples beyond ~4 seconds (see HF Discussion #20). On the 7900 XTX's 24 GB a single-model run has enormous headroom for that, but if you're colocating models per the section above, keep reference clips short (3-5 s) so a spike doesn't collide with a colocated model's peak. Monitor with rocm-smi.

Garbled / noisy output

A 5090 (Blackwell) user reported audio corruption in Issue #155, persisting even with --no-asr; as of this writing root cause is still under investigation upstream and it has not been tied to RDNA3 (it was reported on a CUDA/Blackwell card). The most consistent reported workaround is to pass ref_text explicitly (per the quick-start snippet above) rather than relying on the auto-transcription path, which removes the ASR step as a failure source. This is not RDNA3-specific — it is a general OmniVoice robustness note.

Don't reach for flash-attn, xformers, or an FP8 build

Guides written for NVIDIA frequently suggest installing a flash-attn wheel, pip install xformers, or loading an FP8 checkpoint to save memory. On RDNA3 these are the wrong path: there is no flash-attn wheel for gfx1100 (OmniVoice already falls back to PyTorch SDPA when none is present — exactly what you want here), the ROCm xformers fork is limited, and RDNA3 has no FP8 hardware so an FP8 checkpoint upcasts to BF16 with no memory win. Stick with the plain pip install omnivoice + ROCm PyTorch wheel above and the BF16 dtype in the snippet.