How much VRAM does OmniVoice need?

About 4 GB — the minimum this recipe targets.

How hard is this setup?

Intermediate — follow the steps above.

OmniVoice on RX 7800 XT: Zero-Shot Voice Cloning Across 646 Languages on ROCm (BF16)

What You'll Build

A local zero-shot text-to-speech setup on a 16 GB Radeon RX 7800 XT (RDNA3, Navi 32, gfx1101) running through the ROCm stack. It clones any voice from a 3-5 second reference clip and speaks it back across the 646 languages the model card tags (the HuggingFace card lists 646 language codes and describes the model as a "massively multilingual zero-shot text-to-speech (TTS) model supporting over 600 languages"). The model is k2-fsa's OmniVoice, a Qwen3-0.6B-Base-based finetune wired into a diffusion-language-model-style TTS architecture with a discrete audio tokenizer.

OmniVoice is pure PyTorch — there is no flash-attn dependency and no custom CUDA kernel to compile, so it ports to AMD cleanly. With no flash-attn kernel present, PyTorch falls back to its scaled-dot-product attention (SDPA) — exactly the attention path the model takes on ROCm. At ~4 GB, OmniVoice is a small model for this 16 GB card: the working envelope leaves ~12 GB of headroom, enough to stack a second model (an ASR for live transcription, a small LLM for chat) in the same process.

Hardware data: RX 7800 XT (16 GB VRAM) · BF16 on ROCm · ~4 GB working envelope · See benchmark data

⚠️ This is a ROCm recipe, not CUDA. The RX 7800 XT runs on AMD's ROCm/HIP stack — there is no cu12x wheel, no flash-attn wheel, and no FP8/FP4 path here. RDNA3's WMMA units accept FP16, BF16, INT8, and INT4 only — there is no FP8/FP4 hardware — so an FP8 checkpoint would just upcast to BF16 with no memory saving, and at this model's ~4 GB envelope you don't need any quantization anyway. The attention path is PyTorch SDPA, which is what OmniVoice uses when no flash-attn kernel is available. Under ROCm/HIP, PyTorch still exposes the GPU through the cuda device namespace (torch.cuda.is_available() returns True, device_map="cuda:0" targets the AMD card) — that is expected, not a CUDA install.

ℹ️ VRAM envelope, not a measured peak. The upstream k2-fsa card doesn't publish a VRAM number, and we have no first-party RX 7800 XT benchmark yet. The ~4 GB figure is the model's working default at BF16/FP16: the two weight files total ~3.27 GB (model.safetensors 2.45 GB + audio_tokenizer/model.safetensors 806 MB), and inference activations for this 0.6B-class model add little on top. On the 7800 XT's 16 GB that leaves ~12 GB free. Once a measurement lands at /check/omnivoice/rx-7800-xt we'll replace the envelope with the measured peak — contribute one if you run it.

Requirements

Component	Minimum	Tested
GPU	4 GB VRAM (ROCm-supported AMD card)	RX 7800 XT (16 GB, RDNA3 / gfx1101)
RAM	8 GB system RAM	—
Storage	~3.3 GB total (`model.safetensors` 2.45 GB + audio tokenizer 806 MB + tokenizer JSON)	—
Driver	AMD ROCm on Linux	—
Python	3.10 or newer	—
Reference audio	3-5 s WAV, mono	—

Model weight totals come from the HuggingFace Files tree — model.safetensors is 2.45 GB (2,450,344,112 bytes) and audio_tokenizer/model.safetensors is 806 MB (805,665,628 bytes), with the remainder split between the tokenizer JSON and the chat template. The package is released under the Apache-2.0 License and the weights are not gated on Hugging Face — no access request or login is required. Loading at BF16/FP16 (the snippet below passes dtype) keeps the resident footprint near the ~4 GB working envelope.

Installation

1. Create a clean Python env

python3.10 -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip

2. Install PyTorch for ROCm

The RX 7800 XT (gfx1101) is an officially ROCm-supported GPU on Linux, so it uses the stable ROCm PyTorch wheel — the ROCm equivalent of the cu12x wheel the upstream README pins for NVIDIA. Select the ROCm option from the PyTorch "Get Started" selector and use the generated command, which is of the form:

pip install torch torchaudio --index-url https://download.pytorch.org/whl/rocm6.3

ℹ️ Verify the ROCm tag before you copy it. The rocmX.Y tag in that URL moves over time (6.3 → 6.4 → 7.x) as new ROCm releases land — the value above reflects the stable option visible on the PyTorch selector at the time of writing. Always read the current line from the live PyTorch "Get Started" page and match the wheel's ROCm version to the ROCm stack you have installed on the host. AMD also publishes its own Radeon-tuned wheels at repo.radeon.com if you prefer the vendor build. There is also a separate experimental RDNA3-specific nightly index — pip install --pre torch torchaudio --index-url https://rocm.nightlies.amd.com/v2/gfx110X-all/ — whose gfx110X-all name covers the whole RDNA3 family (gfx1100/gfx1101/gfx1102), so it is the right experimental index for the 7800 XT's gfx1101 if you ever need it. On officially-supported Linux you do not — the stable wheel above is the canonical path.

3. Install OmniVoice

pip install omnivoice

PyPI ships the canonical omnivoice package (Apache-2.0). It is pure PyTorch — it pulls in no flash-attn wheel and builds no custom CUDA/HIP kernel, so this pip install works the same on ROCm as on CUDA. The first inference call downloads weights into your HuggingFace cache from k2-fsa/OmniVoice.

4. Prepare a reference clip

Pick a 3-5 second mono WAV of the voice you want to clone and write down what's being said. Save the audio as ref.wav in your working directory. Always provide the transcript explicitly — see Troubleshooting for why auto-transcription is risky right now.

Running

Save this as tts.py next to your ref.wav:

from omnivoice import OmniVoice
import soundfile as sf
import torch

model = OmniVoice.from_pretrained(
    "k2-fsa/OmniVoice",
    device_map="cuda:0",
    dtype=torch.bfloat16,
)

audio = model.generate(
    text="Hello, this is a test of zero-shot voice cloning.",
    ref_audio="ref.wav",
    ref_text="Transcription of the reference audio.",
)

sf.write("out.wav", audio[0], 24000)

This is the canonical snippet from the upstream model card and GitHub README, with one ROCm-appropriate change: dtype=torch.bfloat16 instead of the card's float16. RDNA3 has native BF16 in its WMMA units, BF16 has a wider exponent range than FP16 (less prone to overflow), and there is no reason to chase the smaller-but-narrower format here — FP16 works too if you prefer. device_map="cuda:0" is correct on AMD: under ROCm/HIP, PyTorch routes the cuda device namespace to the Radeon card, so this targets the 7800 XT. Run it:

python tts.py

You should see weights resolve from the cache, then a short delay before out.wav (24 kHz mono) lands in your working directory.

Spending the headroom — colocating a second model

Because OmniVoice's working set is ~4 GB and the RX 7800 XT has 16 GB, the genuinely card-specific story isn't "does it fit" (it fits on a 4 GB card) — it's what to do with the ~12 GB of spare VRAM. Concrete next steps:

Live transcribe-then-clone. Keep a Whisper-class ASR resident on the same card to transcribe the reference clip on the fly, then feed its text into OmniVoice's ref_text. The two models share the 16 GB comfortably.
Chat-to-speech. Load a 7-8B LLM (Ollama on ROCm is the cleanest "just works" surface on RDNA3) alongside OmniVoice and pipe generated text straight into TTS — a Q4_K_M-quant 8B (~5-6 GB) plus OmniVoice's ~4 GB still leaves room on the 16 GB card.
Batch multi-speaker / longform. The 7800 XT's headroom lets you run several short speaker contexts without offloading.

When you stack models, watch rocm-smi and keep OmniVoice's reference clips short (see Troubleshooting) — a transient spike on a long clip can climb to ~8 GB, which leaves less margin for a colocated model on 16 GB than it would on a 24 GB card.

Results

Speed: No RX-7800-XT-named OmniVoice benchmark is in our database yet (/check/omnivoice/rx-7800-xt is currently unknown), and no verifiable RX 7800 XT measurement was found in research — so the Speed figure is omitted rather than transferred from a different GPU or vendor. Upstream's RTF claim ("RTF as low as 0.025") names no GPU and is not quoted as this card's speed. If you've measured OmniVoice generation time on a 7800 XT, please contribute it so it lands on /check/omnivoice/rx-7800-xt.
VRAM usage: Working envelope ~4 GB at BF16/FP16 — the two weight files total ~3.27 GB (HF Files tree) and this 0.6B-class model's inference activations add little. On the 7800 XT's 16 GB a single-model run is never memory-bound, even with the transient spike on a long reference clip described in Troubleshooting. See /check/omnivoice/rx-7800-xt for the measured peak once it's seeded.
Quality notes: OmniVoice advertises 646 languages, but coverage is uneven across them — a community user has flagged that cross-lingual transfer (cloning a voice from one language into another) is imperfect, with audible accent leakage (see HF Discussion #22). English and Chinese are the best-supported. Always pass ref_text explicitly rather than relying on auto-transcription. There is no quantization tradeoff to weigh on this card — run the native BF16/FP16 weights.

For the full benchmark data, see /check/omnivoice/rx-7800-xt.

Troubleshooting

"Torch not compiled with CUDA enabled"

This means a CUDA build of PyTorch got installed instead of the ROCm build. Uninstall and reinstall against the ROCm wheel index from step 2:

pip uninstall torch torchaudio
pip install torch torchaudio --index-url https://download.pytorch.org/whl/rocm6.3

Confirm the installed build is the ROCm one: python -c "import torch; print(torch.__version__)" should print a +rocm-style suffix, and torch.cuda.is_available() returns True (under HIP, ROCm masquerades as the cuda device namespace — that is why the device_map="cuda:0" snippet above is correct on an AMD card).

A library ships only gfx1100 kernels and won't load on the 7800 XT

The 7800 XT is gfx1101 (Navi 32), while the flagship 7900 XTX is gfx1100 (Navi 31). Most of the ROCm stack ships kernels for both, but occasionally a library or prebuilt extension only carries gfx1100 kernels and refuses to run on gfx1101 (a "no kernel image is available" / missing-gfx1101-kernel error). The standard Linux-only fallback is to mask the card as gfx1100 at runtime:

HSA_OVERRIDE_GFX_VERSION=11.0.0 python tts.py

This is a legacy fallback, not a default — OmniVoice is pure PyTorch on the stable ROCm wheel and runs natively on gfx1101 without it. Only reach for it if a specific dependency refuses to load on the 7800 XT.

VRAM spikes / OOM with a long reference clip

The most likely VRAM-related issue is pushing the working set with a long reference clip — a community user reports the resident set can transiently spike to ~8 GB on reference samples longer than ~4 seconds, even after capping the budget and clearing the cache between inferences (see HF Discussion #20). On the 7800 XT's 16 GB a single-model run absorbs that spike with room to spare, but if you're colocating models per the section above, keep reference clips short (3-5 s) so a spike doesn't collide with a colocated model's peak. Monitor with rocm-smi. The same thread notes offloading parts of the model to CPU RAM as a way to cut resident VRAM further if you need it.

Garbled / noisy output

A 5090 (Blackwell) user reported audio corruption in Issue #155, persisting even with --no-asr; as of this writing root cause is still under investigation upstream and it has not been tied to RDNA3 (it was reported on a CUDA/Blackwell card, not an AMD one). The most consistent reported workaround is to pass ref_text explicitly (per the quick-start snippet above) rather than relying on the auto-transcription path, which removes the ASR step as a failure source. This is not RDNA3-specific — it is a general OmniVoice robustness note.

Don't reach for flash-attn, xformers, or an FP8 build

Guides written for NVIDIA frequently suggest installing a flash-attn wheel, pip install xformers, or loading an FP8 checkpoint to save memory. On RDNA3 these are the wrong path: there is no flash-attn wheel for gfx1101 (OmniVoice already falls back to PyTorch SDPA when none is present — exactly what you want here), the ROCm xformers fork is limited, and RDNA3 has no FP8 hardware so an FP8 checkpoint upcasts to BF16 with no memory win. Stick with the plain pip install omnivoice + ROCm PyTorch wheel above and the BF16 dtype in the snippet.