What You'll Build
A local zero-shot voice-cloning text-to-speech setup running VoxCPM-0.5B on your RTX 4080 SUPER. Clone any voice from a few seconds of reference audio and synthesize natural, context-aware speech — all offline, no API calls.
Hardware data: RTX 4080 SUPER (16GB VRAM) · ~5 GB peak · See benchmark data
ℹ️ This recipe targets
openbmb/VoxCPM-0.5B(the original Sept-2025 release), not the newer VoxCPM1.5 or VoxCPM2. OpenBMB now ships three checkpoints under the samevoxcpmpackage — VoxCPM2 (2B, 30 languages, 48 kHz), VoxCPM1.5 (0.6B, bilingual, 44.1 kHz), and VoxCPM-0.5B (0.5B, Chinese/English, 16 kHz). Thevoxcpmpackage's newer GitHub examples default to VoxCPM2, so always pass"openbmb/VoxCPM-0.5B"explicitly tofrom_pretrainedto stay on the variant this recipe documents. All three fit a 16 GB card; the choice is feature trade-offs, not headroom. The "0.5B" is the parameter count, not the VRAM cost — at inference it uses ~5 GB, leaving most of your 16 GB card free.
Requirements
| Component | Minimum | Tested |
|---|---|---|
| GPU | 6GB+ VRAM | RTX 4080 SUPER (16GB) |
| RAM | 8GB | — |
| Storage | 2 GB (model weights) | ~1.6 GB |
| Software | Python 3.10+, PyTorch 2.5+, CUDA 12.0+ | — |
The checkpoint on the canonical HuggingFace card is two files — pytorch_model.bin (~1.3 GB) plus the audio VAE audiovae.pth (~0.3 GB), about 1.6 GB on disk total — so 2 GB of free storage covers it.
Installation
1. Install VoxCPM
The canonical HuggingFace card and the official OpenBMB/VoxCPM repo both install via PyPI:
pip install voxcpm
This pulls in the voxcpm package and its dependencies. To install from source instead (needed for the Gradio web demo):
git clone https://github.com/OpenBMB/VoxCPM.git
cd VoxCPM
pip install -e .
2. Download the model weights
The weights download automatically on first run, or you can pre-fetch them — pass the full openbmb/VoxCPM-0.5B repo id so you don't pull a newer variant:
from huggingface_hub import snapshot_download
snapshot_download(repo_id="openbmb/VoxCPM-0.5B", local_dir="./voxcpm-0.5b")
About 1.6 GB lands in ./voxcpm-0.5b (pytorch_model.bin + audiovae.pth).
Running
The Python API follows the canonical VoxCPM-0.5B model card — load the checkpoint and synthesize a clip with a cloned voice:
import soundfile as sf
from voxcpm import VoxCPM
model = VoxCPM.from_pretrained("openbmb/VoxCPM-0.5B")
wav = model.generate(
text="Hello! This is a zero-shot voice cloning demo.",
prompt_wav_path="reference.wav", # a few seconds of the target voice
prompt_text="This is the reference transcript.",
cfg_value=2.0, # LM guidance — higher tracks the prompt more closely
inference_timesteps=10, # higher = better quality, lower = faster
normalize=True, # enable external text-normalization
denoise=True, # enable external denoiser for the reference clip
)
sf.write("output.wav", wav, 16000)
The card notes the 16000 sample rate — VoxCPM-0.5B emits 16 kHz audio. A command-line entry point is also available (voxcpm --help, or python -m voxcpm.cli --help). First run downloads the weights; subsequent runs load from cache. Output lands in output.wav.
Results
- Speed: No first-party RTX 4080 SUPER benchmark exists yet — contribute one here. For reference, the canonical model card reports streaming synthesis with a Real-Time Factor (RTF) as low as 0.17 on a consumer-grade NVIDIA RTX 4090 — but the 4090 is a faster, different card (24 GB, ~1.4× the 4080 SUPER's memory bandwidth), so treat that as an upper-bound reference, not an RTX 4080 SUPER measurement. Same-arch (Ada sm_89) install, so the workflow is identical; only wall-clock differs.
- VRAM usage: ~5 GB peak at inference — comfortably within the RTX 4080 SUPER's 16 GB envelope, see benchmark data. The ~1.6 GB of on-disk weights load alongside the audio VAE, activations, and runtime overhead well under the card's capacity.
- Quality notes: VoxCPM produces context-aware prosody — per the model card it infers appropriate emphasis and intonation from sentence structure, adapting speaking style to the content. Reference audio of 3–10 seconds works best for cloning. Bilingual: Chinese and English. Apache-2.0 license — the model weights and code are open-sourced under Apache-2.0.
For the full benchmark data, see /check/voxcpm/rtx-4080-super.
Troubleshooting
Out of memory on a smaller card
VoxCPM-0.5B needs ~5 GB. On 6–8 GB cards, close other GPU apps. The 16 GB RTX 4080 SUPER has ample headroom — OOM here usually means another process is holding VRAM (check nvidia-smi).
First run is slow / downloads weights
The first from_pretrained / generate call downloads ~1.6 GB of weights from HuggingFace. Subsequent runs use the local cache. Pre-fetch with snapshot_download (see Installation) to control timing.
Voice clone quality is poor
Use clean reference audio (3–10 s, minimal background noise). Provide an accurate prompt_text transcript — mismatched transcripts degrade prosody. Lower cfg_value (toward 1.5) if the voice sounds strained; raise it for tighter text adherence. If a clip comes out unstable, the API's retry_badcase=True mode (documented on the model card) re-runs bad cases automatically.
KV cache fills on very long inputs
A community user hit an OutOfMemoryError with a KV cache is full message when synthesizing very long single inputs (~1500 characters in one call) — see OpenBMB/VoxCPM issue #52. This is a per-call context limit, not a card-fit problem; the RTX 4080 SUPER's 16 GB is not the bottleneck. Split long passages into shorter sentences or paragraphs and synthesize them in a loop, then concatenate the resulting clips.
Accidentally pulled VoxCPM2 weights
The voxcpm package's newer examples default from_pretrained to openbmb/VoxCPM2. On a 16 GB card this is a feature-mix surprise, not an OOM — VoxCPM2 emits 48 kHz across 30 languages, while VoxCPM-0.5B emits 16 kHz across Chinese/English. Always pass "openbmb/VoxCPM-0.5B" verbatim to stay on the variant this recipe documents.
Want to add your own benchmark? Contribute here.