How much VRAM does VoxCPM need?

About 5 GB — the minimum this recipe targets.

How hard is this setup?

Beginner — follow the steps above.

VoxCPM-0.5B on RX 7800 XT: Zero-Shot Voice Cloning TTS on ROCm (BF16)

What You'll Build

A local zero-shot voice-cloning text-to-speech setup running VoxCPM-0.5B on a 16 GB Radeon RX 7800 XT (RDNA3, Navi 32, gfx1101) through the ROCm stack. Clone any voice from a few seconds of reference audio and synthesize natural, context-aware speech — all offline, no API calls. VoxCPM-0.5B is a pure-PyTorch model, so it runs through ROCm's standard PyTorch path with PyTorch SDPA attention and the native BF16/FP16 weights; at ~5 GB peak it is never close to memory-bound on a 16 GB card.

Hardware data: RX 7800 XT (16GB VRAM) · BF16 · PyTorch on ROCm 7.2 · See benchmark data

⚠️ This is a ROCm recipe, not CUDA. The RX 7800 XT runs on AMD's ROCm/HIP stack — there is no cu124/cu128 wheel, no flash-attn build, no xformers install, and no FP8/FP4 path here. RDNA3 has no FP8/FP4 hardware (its WMMA units accept FP16, BF16, INT8, INT4 only), so an FP8 checkpoint would just upcast to BF16/FP16 with no memory saving — and at 0.5B / ~5 GB you don't need quantization at all. VoxCPM is pure PyTorch, so attention runs through PyTorch's scaled-dot-product attention (SDPA) — do not install or build FlashAttention for this. If a guide tells you to pip install xformers, build a flash-attn wheel, or pick a cu12x wheel for this card, it's written for the wrong vendor.

ℹ️ This recipe targets openbmb/VoxCPM-0.5B (the original Sept-2025 release), not the newer VoxCPM1.5 or VoxCPM2. OpenBMB now ships several checkpoints under the same voxcpm package, and the upstream GitHub README's examples now default to VoxCPM2 — VoxCPM-0.5B is listed there as "Legacy." So always pass "openbmb/VoxCPM-0.5B" explicitly to from_pretrained to stay on the variant this recipe documents. The "0.5B" is the parameter count, not the VRAM cost — at inference it uses ~5 GB, leaving most of the 7800 XT's 16 GB free. (VoxCPM2's vllm serve --omni OpenAI-compatible server is a VoxCPM2-only path and is out of scope here; the legacy 0.5B uses the plain Python API below.)

Requirements

Component	Minimum	Tested
GPU	6 GB VRAM (ROCm-supported AMD card)	RX 7800 XT (16 GB)
RAM	8 GB system	—
Storage	2 GB (model weights)	~1.6 GB
Driver	AMD ROCm 7.2.x on Linux	—
Software	Python 3.10+, PyTorch (ROCm 7.2 build)	—

The checkpoint on the canonical HuggingFace card is two files — pytorch_model.bin (~1.3 GB) plus the audio VAE audiovae.pth (~301 MB), about 1.6 GB on disk total — so 2 GB of free storage covers it. VoxCPM-0.5B is released under the Apache-2.0 license (commercial use permitted) and the weights are not gated on Hugging Face — no access request or login is required to download them.

Installation

1. Install PyTorch for ROCm

The RX 7800 XT (gfx1101) is an officially ROCm-supported GPU on Linux, so it uses the stable ROCm PyTorch wheel. Install PyTorch from the ROCm wheel index (this is the AMD equivalent of the CUDA wheel — same package names, different index URL):

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm7.2

ℹ️ Verify the ROCm tag before you copy it. The rocmX.Y tag in the index URL moves over time (6.3 → 6.4 → 7.x). Read the current stable ROCm line from the live PyTorch "Get Started Locally" selector (choose Stable · Linux · Pip · Python · ROCm) before running, and use whatever whl/rocmX.Y tag it prints. Match the wheel's ROCm version to the ROCm runtime you have installed on the system.

Confirm the installed build is the ROCm one and the GPU is visible:

python -c "import torch; print(torch.__version__); print(torch.cuda.is_available())"

The version string should carry a +rocm7.2-style suffix, and torch.cuda.is_available() returns True (ROCm masquerades as the cuda device namespace under HIP — there is no separate torch.rocm API).

2. Install VoxCPM

The canonical HuggingFace card and the official OpenBMB/VoxCPM repo both install the package via PyPI. Install it after PyTorch so it picks up your ROCm Torch build rather than pulling a CUDA one:

pip install voxcpm

This pulls in the voxcpm package and its remaining dependencies. To install from source instead (needed for the Gradio web demo):

git clone https://github.com/OpenBMB/VoxCPM.git
cd VoxCPM
pip install -e .

3. Download the model weights

The weights download automatically on first run, or you can pre-fetch them — pass the full openbmb/VoxCPM-0.5B repo id so you don't pull the newer VoxCPM2 variant:

from huggingface_hub import snapshot_download
snapshot_download(repo_id="openbmb/VoxCPM-0.5B", local_dir="./voxcpm-0.5b")

About 1.6 GB lands in ./voxcpm-0.5b (pytorch_model.bin + audiovae.pth).

Running

The Python API follows the canonical VoxCPM-0.5B model card — load the checkpoint and synthesize a clip with a cloned voice. Nothing here is AMD-specific: VoxCPM runs the same PyTorch code on ROCm as on CUDA, because PyTorch maps the cuda device onto your Radeon GPU under HIP:

import soundfile as sf
from voxcpm import VoxCPM

model = VoxCPM.from_pretrained("openbmb/VoxCPM-0.5B")

wav = model.generate(
    text="Hello! This is a zero-shot voice cloning demo.",
    prompt_wav_path="reference.wav",   # a few seconds of the target voice
    prompt_text="This is the reference transcript.",
    cfg_value=2.0,                     # LM guidance — higher tracks the prompt more closely
    inference_timesteps=10,            # higher = better quality, lower = faster
    normalize=True,                    # enable external text-normalization
    denoise=True,                      # enable external denoiser for the reference clip
    retry_badcase=True,                # auto re-run unstable generations
)

sf.write("output.wav", wav, 16000)

The card notes the 16000 sample rate — VoxCPM-0.5B emits 16 kHz audio. A command-line entry point is also available (voxcpm --help, or python -m voxcpm.cli --help). First run downloads the weights; subsequent runs load from cache. Output lands in output.wav.

Results

Speed: No first-party RX 7800 XT benchmark exists yet — contribute one here. For reference, the canonical model card reports streaming synthesis with a Real-Time Factor (RTF) as low as 0.17 on a consumer-grade NVIDIA RTX 4090 — but that is a different vendor and a different stack (CUDA, not ROCm), so it is not quoted here as the 7800 XT's speed. Rather than transfer a number across vendors, the Speed figure is omitted. VoxCPM is built for real-time use — streaming synthesis is designed to run faster than playback — but the exact RTF on this card needs a first-party measurement. If you've measured VoxCPM RTF on a 7800 XT, please contribute it so it lands on /check/voxcpm/rx-7800-xt.
VRAM usage: ~5 GB peak at inference — trivially within the 7800 XT's 16 GB envelope, see benchmark data. The ~1.6 GB of on-disk weights load alongside the audio VAE, activations, and runtime overhead with the bulk of the card free. There is no quantization tradeoff to consider on this card: run the native BF16/FP16 weights.
Quality notes: VoxCPM produces context-aware prosody — it infers emphasis and intonation from sentence structure (per the model card, "context-aware speech generation"). Reference audio of 3–10 seconds works best for cloning. Bilingual: Chinese and English. Apache-2.0 license — commercial use permitted.

For the full benchmark data and other-GPU comparisons, see /check/voxcpm/rx-7800-xt.

Troubleshooting

"Torch not compiled with CUDA enabled" / GPU not used

This means a CUDA build of PyTorch got installed instead of the ROCm build (often because pip install voxcpm pulled a default CUDA Torch). Uninstall and reinstall PyTorch against the ROCm wheel index, then reinstall voxcpm:

pip uninstall torch torchvision torchaudio
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm7.2

python -c "import torch; print(torch.__version__)" should print a +rocm7.2-style suffix, and torch.cuda.is_available() should return True (HIP exposes the GPU through the cuda namespace).

A library ships only gfx1100 kernels and won't load on the 7800 XT

The 7800 XT is gfx1101 (Navi 32), while the flagship 7900 XTX is gfx1100 (Navi 31). Most of the ROCm stack ships kernels for both, but occasionally a library or prebuilt extension only carries gfx1100 kernels and refuses to run on gfx1101. The standard Linux-only fallback is to mask the card as gfx1100 at runtime:

HSA_OVERRIDE_GFX_VERSION=11.0.0 python your_script.py

This is a legacy fallback, not a default — VoxCPM is pure PyTorch and runs natively on gfx1101 through the stable ROCm wheel without it. Only reach for it if a specific dependency throws a "no kernel image is available" / missing-gfx1101-kernel error.

Do not install xformers or FlashAttention

HuggingFace and TTS guides written for NVIDIA frequently suggest pip install xformers or building a flash-attn wheel. On RDNA3 these are the wrong path: the ROCm xformers fork is limited, and upstream FlashAttention's C++/CK build does not reliably compile on gfx1101. VoxCPM is pure PyTorch and already routes attention through PyTorch SDPA on this stack — leave it on the default and install nothing extra for attention.

Out of memory on a smaller card

VoxCPM-0.5B needs ~5 GB. The 16 GB 7800 XT has ample headroom — OOM here almost always means another process is holding VRAM (check rocm-smi). On smaller 6–8 GB ROCm cards, close other GPU apps first.

First run is slow / downloads weights

The first from_pretrained / generate call downloads ~1.6 GB of weights from HuggingFace. Subsequent runs use the local cache. Pre-fetch with snapshot_download (see Installation) to control timing.

Voice clone quality is poor

Use clean reference audio (3–10 s, minimal background noise). Provide an accurate prompt_text transcript — mismatched transcripts degrade prosody. Lower cfg_value (toward 1.5) if the voice sounds strained; raise it for tighter text adherence. The API's retry_badcase=True mode (documented on the model card) re-runs unstable clips automatically.

Accidentally pulled VoxCPM2 weights

The voxcpm package's newer examples default from_pretrained to openbmb/VoxCPM2. This is a feature-mix surprise, not an OOM on a 16 GB card — VoxCPM2 emits 48 kHz across 30 languages, while VoxCPM-0.5B emits 16 kHz across Chinese/English. Always pass "openbmb/VoxCPM-0.5B" verbatim to stay on the variant this recipe documents.

Want to add your own benchmark? Contribute here.