What You'll Build
A local zero-shot voice-cloning text-to-speech setup running VoxCPM-0.5B on your RTX 4070 Ti SUPER. Clone any voice from a few seconds of reference audio and synthesize natural, context-aware speech — all offline, no API calls.
Hardware data: RTX 4070 Ti SUPER (16GB VRAM) · ~5 GB peak · See benchmark data
ℹ️ This recipe targets
openbmb/VoxCPM-0.5B(the original Sept-2025 release), not the newer VoxCPM1.5 or VoxCPM2. OpenBMB now ships several checkpoints under the samevoxcpmpackage, and the package's newer GitHub examples default to VoxCPM2 — so always pass"openbmb/VoxCPM-0.5B"explicitly tofrom_pretrainedto stay on the variant this recipe documents. The "0.5B" is the parameter count, not the VRAM cost — at inference it uses ~5 GB, leaving most of your 16 GB card free.
Requirements
| Component | Minimum | Tested |
|---|---|---|
| GPU | 6GB+ VRAM | RTX 4070 Ti SUPER (16GB) |
| RAM | 8GB | — |
| Storage | 2 GB (model weights) | ~1.6 GB |
| Software | Python 3.10+, PyTorch 2.1+, CUDA 12.0+ | — |
The checkpoint on the canonical HuggingFace card is two files — pytorch_model.bin (~1.3 GB) plus the audio VAE audiovae.pth (~0.3 GB), about 1.6 GB on disk total — so 2 GB of free storage covers it. (There is no model.safetensors; the weights ship as a .bin plus the .pth VAE.)
Installation
1. Install VoxCPM
The canonical HuggingFace card and the official OpenBMB/VoxCPM repo both install via PyPI:
pip install voxcpm
This pulls in the voxcpm package and its dependencies — everything the Python workflow below needs.
2. Download the model weights
The weights download automatically on first run, or you can pre-fetch them — pass the full openbmb/VoxCPM-0.5B repo id so you don't pull a newer variant:
from huggingface_hub import snapshot_download
snapshot_download("openbmb/VoxCPM-0.5B")
About 1.6 GB lands in your HuggingFace cache (pytorch_model.bin + audiovae.pth).
Running
The Python API follows the canonical VoxCPM-0.5B model card — load the checkpoint and synthesize a clip with a cloned voice:
import soundfile as sf
from voxcpm import VoxCPM
model = VoxCPM.from_pretrained("openbmb/VoxCPM-0.5B")
wav = model.generate(
text="VoxCPM is an innovative end-to-end TTS model, designed to generate highly expressive speech.",
prompt_wav_path="reference.wav", # a few seconds of the target voice (None to skip cloning)
prompt_text="This is the reference transcript.",
cfg_value=2.0, # LM guidance — higher tracks the prompt more closely
inference_timesteps=10, # higher = better quality, lower = faster
normalize=True, # enable external text-normalization
denoise=True, # enable external denoiser for the reference clip
retry_badcase=True, # auto-retry on unstable ("unstoppable") generations
)
sf.write("output.wav", wav, 16000)
The card writes the output at the 16000 sample rate — VoxCPM-0.5B emits 16 kHz audio. A command-line entry point is also available after install (voxcpm --help, or python -m voxcpm.cli --help); for example:
voxcpm --text "Hello VoxCPM" --output out.wav
First run downloads the weights; subsequent runs load from cache. Output lands in output.wav.
Results
- Speed: No first-party RTX 4070 Ti SUPER benchmark exists yet — contribute one here. For reference, the canonical model card reports streaming synthesis with a Real-Time Factor (RTF) as low as 0.17 on a consumer-grade NVIDIA RTX 4090 — but the 4090 is a faster, larger card (24 GB, more CUDA cores and memory bandwidth), so treat that 0.17 as an upper-bound reference, not an RTX 4070 Ti SUPER measurement. The install and workflow are identical (both Ada sm_89); only wall-clock differs.
- VRAM usage: ~5 GB peak at inference — comfortably within the RTX 4070 Ti SUPER's 16 GB envelope, see benchmark data. The ~1.6 GB of on-disk weights load alongside the audio VAE, activations, and runtime overhead well under the card's capacity.
- Quality notes: VoxCPM produces context-aware prosody — per the model card, it "comprehends text to infer and generate appropriate prosody." Reference audio of a few seconds works best for cloning. Bilingual: Chinese and English. Apache-2.0 license (per the model card) — commercial use permitted, though the card notes the release is intended for research and development and recommends testing before production use.
For the full benchmark data, see /check/voxcpm/rtx-4070-ti-super.
Troubleshooting
Accidentally pulled VoxCPM2 weights
The voxcpm package's newer examples default from_pretrained to the larger VoxCPM2 checkpoint. On a 16 GB card this is a feature-mix surprise rather than an OOM — VoxCPM2 and VoxCPM-0.5B differ in language coverage and output sample rate. Always pass "openbmb/VoxCPM-0.5B" verbatim to stay on the variant this recipe documents.
Out of memory on a smaller card or with very long input
VoxCPM-0.5B needs ~5 GB; the 16 GB RTX 4070 Ti SUPER has ample headroom, so a fresh-run OOM here usually means another process is holding VRAM (check nvidia-smi). Note that very long single-shot inputs grow the KV cache and can exhaust memory on smaller cards — issue #52 reports a KV cache is full failure on a 15 GB Colab GPU when generating very long passages; split long text into shorter segments.
Voice clone quality is poor
Use clean reference audio (a few seconds, minimal background noise) and an accurate prompt_text transcript — mismatched transcripts degrade prosody. Lower cfg_value (toward 1.5) if the voice sounds strained; raise it for tighter text adherence. If a clip comes out unstable, the API's retry_badcase=True mode (documented on the model card) re-runs bad cases automatically.
First run is slow / downloads weights
The first from_pretrained / generate call downloads ~1.6 GB of weights from HuggingFace. Subsequent runs use the local cache. Pre-fetch with snapshot_download (see Installation) to control timing.
Want to add your own benchmark? Contribute here.