How much VRAM does VoxCPM2 need?

About 8 GB — the minimum this recipe targets.

How hard is this setup?

Beginner — follow the steps above.

VoxCPM2 on RX 7800 XT: 30-Language 48kHz Voice Cloning on ROCm (BF16)

What You'll Build

A local text-to-speech pipeline using OpenBMB's VoxCPM2 — a 2B-parameter, tokenizer-free, diffusion-autoregressive TTS model built on the MiniCPM-4 backbone — running on a 16 GB Radeon RX 7800 XT (RDNA3, Navi 32, gfx1101) through the ROCm stack. It synthesises 48 kHz studio-quality audio in 30 languages, supports zero-shot voice cloning from a short reference clip, and adds "voice design" — generating a voice from a natural-language description alone (gender, age, tone, emotion, pace). VoxCPM2 is the successor to VoxCPM (the original 0.5B model): the v2 jump moves to 2B parameters, upgrades audio output from 16 kHz to 48 kHz, and expands language coverage to 30 languages including several Chinese dialects.

The voxcpm package is pure PyTorch — it does not depend on FlashAttention, custom CUDA kernels, or transformers.pipeline, so it runs on whatever PyTorch build is installed. On this card that means the ROCm PyTorch wheel and PyTorch's scaled-dot-product attention (SDPA). At ~8 GB the model lands at half of the 7800 XT's 16 GB, so the native BF16 weights run with no quantization and comfortable headroom — VoxCPM2 fits this 16 GB card the same way it fits a 24 GB one, with no GGUF or offload squeeze needed.

Hardware data: RX 7800 XT (16 GB VRAM) · BF16 · voxcpm on ROCm 7.2 · ~8 GB VRAM per the official model card · See benchmark data

⚠️ This is a ROCm recipe, not CUDA. The RX 7800 XT runs on AMD's ROCm/HIP stack — there is no cu124/cu128 wheel, no flash-attn install, and no FP8/FP4 path here. RDNA3 has no FP8/FP4 hardware (its WMMA units accept FP16, BF16, INT8, INT4 only), so an FP8 checkpoint would just upcast to BF16/FP16 with no memory saving — and at ~8 GB on a 16 GB card you don't need it anyway. VoxCPM2's official quickstart shows no attn_implementation / FlashAttention step at all; the model runs on PyTorch SDPA out of the box. If a guide tells you to pip install flash-attn or pick a cu12x wheel for this card, it's written for the wrong vendor.

Requirements

Component	Minimum	Tested
GPU	8 GB VRAM (ROCm-supported AMD card) — model card lists `VRAM: ~8 GB` in the Model Details table	RX 7800 XT (RDNA3, Navi 32, gfx1101, 16 GB)
RAM	8 GB	—
Storage	~5 GB (`model.safetensors` 4.58 GB + `audiovae.pth` 377 MB, per the HF Files tree)	—
Driver	AMD ROCm 7.2.x on Linux	—
Software	Python ≥ 3.10 (<3.13), PyTorch ≥ 2.5.0 (ROCm build) (source)	—

VoxCPM2 is released under the Apache-2.0 license — free for commercial use, per the model card License section. The weights are not gated; no access request or login is required to download them.

Installation

1. Install PyTorch for ROCm

The RX 7800 XT (gfx1101) is an officially ROCm-supported GPU on Linux, so it uses the stable ROCm PyTorch wheel. Install PyTorch from the ROCm wheel index before installing voxcpm, so that pip doesn't pull a default CUDA build:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm7.2

ℹ️ Verify the ROCm tag before you copy it. As of this writing the stable ROCm PyTorch wheel is tagged rocm7.2 — but the rocmX.Y tag moves over time (6.3 → 6.4 → 7.x). Read the current line in the live PyTorch "Get Started" selector (or the ComfyUI README "AMD GPUs (Linux)" section) before running. AMD also publishes its own Radeon-recommended wheels at repo.radeon.com if you prefer the vendor build. There is a separate experimental RDNA-3 wheel index (https://rocm.nightlies.amd.com/v2/gfx110X-all/) whose gfx110X-all name covers the whole RDNA3 family (gfx1100/gfx1101/gfx1102) — you do not need it for the officially-supported 7800 XT; the stable whl/rocm7.2 wheel above is the canonical path.

Confirm you got the ROCm build, not a CUDA one:

python -c "import torch; print(torch.__version__); print(torch.cuda.is_available())"

The version string should carry a +rocm7.2-style suffix, and torch.cuda.is_available() returns True (ROCm masquerades as the cuda device namespace under HIP).

2. Install the `voxcpm` package

The canonical install is the published PyPI package, identical on both the Hugging Face model card and the official OpenBMB/VoxCPM GitHub README:

pip install voxcpm

Because PyTorch is already present from step 1, this installs only the voxcpm package and its remaining (non-Torch) dependencies — it will not override your ROCm build.

3. (Optional) Install from source for the web demo

If you want the Gradio playground, clone the repo and install in editable mode:

git clone https://github.com/OpenBMB/VoxCPM.git
cd VoxCPM
pip install -e .

4. (Optional) Pre-download model weights

Weights download on first inference automatically, but you can pre-fetch them to control where they land — about 5 GB (model.safetensors 4.58 GB + audiovae.pth 377 MB):

from huggingface_hub import snapshot_download
snapshot_download("openbmb/VoxCPM2")

Running

Python — basic synthesis

The loader is taken verbatim from the VoxCPM2 model card — from voxcpm import VoxCPM, then VoxCPM.from_pretrained(...). There is no attn_implementation argument and no FlashAttention step; the model runs on PyTorch SDPA, which is the right and only attention path on RDNA3:

import soundfile as sf
from voxcpm import VoxCPM

model = VoxCPM.from_pretrained(
    "openbmb/VoxCPM2",
    load_denoiser=False,
)

wav = model.generate(
    text="VoxCPM2 brings multilingual support, creative voice design, and controllable voice cloning.",
    cfg_value=2.0,
    inference_timesteps=10,
)
sf.write("output.wav", wav, model.tts_model.sample_rate)

Two notes on this snippet, both straight from the HF model card:

load_denoiser=False skips loading the optional denoiser. Keep this off unless you plan to use voice cloning with prompt/reference audio — it saves memory and download time.
The sample rate is read off the loaded model (model.tts_model.sample_rate) rather than hardcoded — VoxCPM2 emits 48 kHz audio, up from VoxCPM v1's 16 kHz.

Higher inference_timesteps trades speed for quality.

Python — zero-shot voice cloning

Supply a short reference clip and the model mimics the speaker's timbre, accent, and pacing. Per the HF card, basic cloning passes reference_wav_path; "Ultimate Cloning" additionally passes prompt_wav_path plus the reference's exact transcript via prompt_text for maximum fidelity:

wav = model.generate(
    text="This is a cloned voice generated by VoxCPM2.",
    reference_wav_path="speaker.wav",
)
sf.write("clone.wav", wav, model.tts_model.sample_rate)

Python — voice design (new in v2)

VoxCPM2 lets you describe the desired voice in natural language inside the text itself, as shown on the HF card:

wav = model.generate(
    text="(A young woman, gentle and sweet voice)Hello, welcome to VoxCPM2!",
    cfg_value=2.0,
    inference_timesteps=10,
)
sf.write("designed.wav", wav, model.tts_model.sample_rate)

Streaming

The package exposes a streaming generator that yields audio chunks as they are produced, per the HF card:

import numpy as np

chunks = []
for chunk in model.generate_streaming(text="Streaming is easy with VoxCPM2!"):
    chunks.append(chunk)
wav = np.concatenate(chunks)
sf.write("streaming.wav", wav, model.tts_model.sample_rate)

Results

Speed: No RX 7800 XT-specific VoxCPM2 benchmark exists yet, and no first-party RTF figure for this card could be verified on the source page itself — so the Speed figure is omitted rather than transferred from a different GPU. (The model card's published Real-Time Factor of ~0.30 was measured on an NVIDIA RTX 4090, a different vendor and architecture; carrying that number to a ROCm RDNA3 card would be misleading.) If you've measured VoxCPM2 RTF on a 7800 XT, please contribute it so it lands on /check/voxcpm2/rx-7800-xt.
VRAM usage: ~8 GB per the official Model Details table. On a 16 GB RX 7800 XT that leaves roughly 8 GB free — enough to colocate a small LLM or an ASR front-end on the same card, or to raise TTS concurrency.
Quality notes: 48 kHz studio-quality output via the AudioVAE V2 path, tokenizer-free diffusion-autoregressive architecture on the MiniCPM-4 backbone, 30 supported languages plus several Chinese dialects. The route on this card is the native BF16 path through the voxcpm package — there is no quantization tradeoff to consider, and no FP8/FP4 path on RDNA3 (the hardware has no such formats; an FP8 checkpoint would only upcast). Apache-2.0 licensed — free for commercial use per the model card License section.

For the full benchmark data, see /check/voxcpm2/rx-7800-xt.

Troubleshooting

"Torch not compiled with CUDA enabled" / no GPU detected

This means a CUDA build of PyTorch got installed instead of the ROCm build (easy to do if pip install voxcpm ran first and pulled a default Torch wheel). Uninstall and reinstall against the ROCm wheel index:

pip uninstall torch torchvision torchaudio
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm7.2

Confirm the installed build with python -c "import torch; print(torch.__version__)" — it should print a +rocm7.2-style suffix, and torch.cuda.is_available() should return True under HIP.

A library ships only gfx1100 kernels and won't load on the 7800 XT

The 7800 XT is gfx1101 (Navi 32), while the flagship 7900 XTX is gfx1100 (Navi 31). Most of the ROCm stack — including the stable PyTorch ROCm wheel — ships kernels for both, so voxcpm runs natively on gfx1101 with no extra flags. Occasionally a prebuilt extension only carries gfx1100 kernels and refuses to run on gfx1101, surfacing as a "no kernel image is available" / missing-gfx1101-kernel error. The standard Linux-only fallback is to mask the card as gfx1100 at runtime:

HSA_OVERRIDE_GFX_VERSION=11.0.0 python your_script.py

This is a legacy fallback, not a default — you should not need it for the pure-PyTorch voxcpm path. Only reach for it if a specific dependency complains about a missing gfx1101 kernel.

`torch.compile` / Triton errors during warm-up

torch._dynamo.exc.Unsupported or Triton import failures on launch can occur on the ROCm Triton stack for some kernels. Disable compile optimisation when loading:

model = VoxCPM.from_pretrained("openbmb/VoxCPM2", optimize=False)

The mainstream transformer blocks VoxCPM2 uses are well within the ROCm Triton coverage band, but if warm-up trips, optimize=False falls back to eager execution. See the GitHub repo for version-pinning guidance.

Do not install flash-attn or xformers

HF and GitHub guides written for NVIDIA frequently suggest pip install flash-attn or pip install xformers. On RDNA3 these are the wrong path: the upstream CK FlashAttention build targets CDNA/MI accelerators and typically fails to compile on consumer RDNA3 cards, and the ROCm xformers fork is limited. VoxCPM2 needs neither — it routes attention through PyTorch SDPA on this stack by default. Don't add an attention dependency.

Garbled Chinese audio in third-party runtimes

VoxCPM2 ships a custom VoxCPM2Tokenizer (tokenization_voxcpm2.py) that splits multi-character Chinese tokens into single-character IDs before embedding — the model was trained that way, and a stock LlamaTokenizerFast produces multi-character tokens the model never saw, yielding garbled Chinese output. Using the voxcpm package as shown above applies this splitting automatically; if you wire VoxCPM2 into a different inference framework, make sure the bundled tokenizer is loaded so Chinese synthesis stays correct.

Tight on VRAM with other models loaded

A 16 GB RX 7800 XT has ~8 GB of headroom over VoxCPM2's ~8 GB footprint, so OOM here almost always means another process is holding VRAM (check rocm-smi). If you deliberately stack VoxCPM2 alongside an LLM, ASR model, or another TTS pipeline, set load_denoiser=False (as in every snippet above) to skip the optional reference-audio denoiser and keep VoxCPM2 at its baseline footprint.