VoxCPM-0.5B on RTX 4080: Zero-Shot Voice Cloning TTS in ~5 GB VRAM

What You'll Build

A local zero-shot voice-cloning text-to-speech setup running VoxCPM-0.5B on your RTX 4080. Clone any voice from a few seconds of reference audio and synthesize natural, context-aware speech — all offline, no API calls.

Hardware data: RTX 4080 (16GB VRAM) · ~5 GB peak · See benchmark data

ℹ️ This recipe targets openbmb/VoxCPM-0.5B (the original Sept-2025 release), not the newer VoxCPM1.5 or VoxCPM2. OpenBMB now ships three checkpoints under the same voxcpm package — VoxCPM2 (2B, 30 languages, 48 kHz), VoxCPM1.5 (0.6B, bilingual, 44.1 kHz), and VoxCPM-0.5B (0.5B, Chinese/English, 16 kHz). The voxcpm package's newer GitHub examples default to VoxCPM2, so always pass "openbmb/VoxCPM-0.5B" explicitly to from_pretrained to stay on the variant this recipe documents. All three fit a 16 GB card; the choice is feature trade-offs, not headroom. The "0.5B" is the parameter count, not the VRAM cost — at inference it uses ~5 GB, leaving most of your 16 GB card free.

Requirements

Component	Minimum	Tested
GPU	6GB+ VRAM	RTX 4080 (16GB)
RAM	8GB	—
Storage	2 GB (model weights)	~1.6 GB
Software	Python 3.10+, PyTorch 2.5+, CUDA 12.0+	—

The checkpoint on the canonical HuggingFace card is two files — pytorch_model.bin (~1.3 GB) plus the audio VAE audiovae.pth (~0.3 GB), about 1.6 GB on disk total — so 2 GB of free storage covers it.

Installation

1. Install VoxCPM

The canonical HuggingFace card and the official OpenBMB/VoxCPM repo both install via PyPI:

pip install voxcpm

This pulls in the voxcpm package and its dependencies. To install from source instead (needed for the Gradio web demo):

git clone https://github.com/OpenBMB/VoxCPM.git
cd VoxCPM
pip install -e .

2. Download the model weights

The weights download automatically on first run, or you can pre-fetch them — pass the full openbmb/VoxCPM-0.5B repo id so you don't pull a newer variant:

from huggingface_hub import snapshot_download
snapshot_download(repo_id="openbmb/VoxCPM-0.5B", local_dir="./voxcpm-0.5b")

About 1.6 GB lands in ./voxcpm-0.5b (pytorch_model.bin + audiovae.pth).

Running

The Python API follows the canonical VoxCPM-0.5B model card — load the checkpoint and synthesize a clip with a cloned voice:

import soundfile as sf
from voxcpm import VoxCPM

model = VoxCPM.from_pretrained("openbmb/VoxCPM-0.5B")

wav = model.generate(
    text="Hello! This is a zero-shot voice cloning demo.",
    prompt_wav_path="reference.wav",   # a few seconds of the target voice
    prompt_text="This is the reference transcript.",
    cfg_value=2.0,                     # LM guidance — higher tracks the prompt more closely
    inference_timesteps=10,            # higher = better quality, lower = faster
    normalize=True,                    # enable external text-normalization
    denoise=True,                      # enable external denoiser for the reference clip
)

sf.write("output.wav", wav, 16000)

The card notes the 16000 sample rate — VoxCPM-0.5B emits 16 kHz audio. A command-line entry point is also available (voxcpm --help, or python -m voxcpm.cli --help). First run downloads the weights; subsequent runs load from cache. Output lands in output.wav.

Results

Speed: No first-party RTX 4080 benchmark exists yet — contribute one here. For reference, the canonical model card reports streaming synthesis with a Real-Time Factor (RTF) as low as 0.17 on a consumer-grade NVIDIA RTX 4090 — but the 4090 is a different card (24 GB, ~1.4× the 4080's memory bandwidth), so treat that as an upper-bound reference, not an RTX 4080 measurement. Same-arch (Ada sm_89) install, so the workflow is identical; only wall-clock differs.
VRAM usage: ~5 GB peak at inference — comfortably within the RTX 4080's 16 GB envelope, see benchmark data. The ~1.6 GB of on-disk weights load alongside the audio VAE, activations, and runtime overhead well under the card's capacity.
Quality notes: VoxCPM produces context-aware prosody — it infers emphasis and intonation from sentence structure (per the model card, "context-aware speech generation"). Reference audio of 3–10 seconds works best for cloning. Bilingual: Chinese and English. Apache-2.0 license — commercial use permitted.

For the full benchmark data, see /check/voxcpm/rtx-4080.

Troubleshooting

Out of memory on a smaller card

VoxCPM-0.5B needs ~5 GB. On 6–8 GB cards, close other GPU apps. The 16 GB RTX 4080 has ample headroom — OOM here usually means another process is holding VRAM (check nvidia-smi).

First run is slow / downloads weights

The first from_pretrained / generate call downloads ~1.6 GB of weights from HuggingFace. Subsequent runs use the local cache. Pre-fetch with snapshot_download (see Installation) to control timing.

Voice clone quality is poor

Use clean reference audio (3–10 s, minimal background noise). Provide an accurate prompt_text transcript — mismatched transcripts degrade prosody. Lower cfg_value (toward 1.5) if the voice sounds strained; raise it for tighter text adherence. If a clip comes out unstable, the API's retry_badcase=True mode (documented on the model card) re-runs bad cases automatically.

Accidentally pulled VoxCPM2 weights

The voxcpm package's newer examples default from_pretrained to openbmb/VoxCPM2. On a 16 GB card this is a feature-mix surprise, not an OOM — VoxCPM2 emits 48 kHz across 30 languages, while VoxCPM-0.5B emits 16 kHz across Chinese/English. Always pass "openbmb/VoxCPM-0.5B" verbatim to stay on the variant this recipe documents.

Want to add your own benchmark? Contribute here.