What You'll Build
A local text-to-speech pipeline using OpenBMB's VoxCPM-0.5B — a 0.5B-parameter TTS model built on the MiniCPM-4 backbone that does zero-shot voice cloning from a short reference clip. You'll generate 16 kHz audio in Chinese or English, either from text alone or by cloning a voice you supply as a few-second .wav.
Hardware data: RTX 5060 Ti (16 GB VRAM) · VoxCPM-0.5B fits in ~5 GB leaving plenty of headroom · See benchmark data
Requirements
| Component | Minimum | Tested |
|---|---|---|
| GPU | 6 GB VRAM (model needs ~5 GB per official comparison table) | RTX 5060 Ti (16 GB) |
| RAM | 8 GB | — |
| Storage | ~2 GB (model weights + denoiser/ASR helpers) | — |
| Software | Python ≥ 3.10 (<3.13), PyTorch ≥ 2.5.0, CUDA ≥ 12.0 (source) | — |
Installation
1. Install the voxcpm package
The simplest path is the published PyPI package:
pip install voxcpm
This is the canonical install per both the Hugging Face model card and the official OpenBMB/VoxCPM GitHub README.
2. (Optional) Install from source for the web demo
If you want the Gradio playground, clone the repo and install in editable mode:
git clone https://github.com/OpenBMB/VoxCPM.git
cd VoxCPM
pip install -e .
Source: VoxCPM Quick Start docs.
3. (Optional) Pre-download model weights
Weights download on first inference automatically, but you can pre-fetch them to control where they land:
from huggingface_hub import snapshot_download
snapshot_download("openbmb/VoxCPM-0.5B")
Running
Python — basic synthesis
import soundfile as sf
from voxcpm import VoxCPM
model = VoxCPM.from_pretrained("openbmb/VoxCPM-0.5B")
wav = model.generate(
text="VoxCPM is an innovative end-to-end TTS model.",
prompt_wav_path=None,
cfg_value=2.0,
inference_timesteps=10,
normalize=True,
denoise=True,
)
sf.write("output.wav", wav, 16000)
Note the 16000 sample rate — VoxCPM-0.5B emits 16 kHz audio. Higher inference_timesteps trades speed for quality; lower trades the opposite.
Python — zero-shot voice cloning
Supply a short reference clip plus its transcript and the model will mimic the speaker's timbre, accent, and pacing:
wav = model.generate(
text="Hello — this is your cloned voice speaking.",
prompt_wav_path="reference.wav",
prompt_text="reference transcript matching the wav",
cfg_value=2.0,
inference_timesteps=10,
denoise=True,
)
sf.write("cloned.wav", wav, 16000)
Gradio web demo
If you installed from source (step 2), launch the local UI:
python app.py
Results
- Speed: RTF (Real-Time Factor) as low as
0.17on a consumer-grade NVIDIA RTX 4090 — i.e. 1 s of audio synthesised in ~0.17 s of GPU time — per the official OpenBMB README and HF model card. RTX 5060 Ti numbers are not yet community-benchmarked; track /check/voxcpm/rtx-5060-ti for when they land. - VRAM usage: ~5 GB per the official model comparison table. Leaves the 16 GB RTX 5060 Ti free for other workloads.
- Quality notes: 16 kHz output sample rate, 12.5 Hz acoustic frame rate, 2 supported languages (Chinese, English) per the official Models & Versions table. Voice cloning is "continuation-only" — the model continues from your reference clip rather than fully replacing the speaker's identity from scratch. Trained on a 1.8 M-hour bilingual corpus.
For the full benchmark data, see /check/voxcpm/rtx-5060-ti.
Troubleshooting
torch.compile errors on launch
The Quick Start docs recommend passing optimize=False to VoxCPM.from_pretrained if you hit torch.compile platform issues — common on Windows or older CUDA stacks.
Strained or weird-sounding voice
Per the Hugging Face card, lower the cfg_value (e.g. from 2.0 toward 1.5) to relax adherence to the text; raise it for tighter prompt following. For long or expressive inputs the model may exhibit instability — chunk the text or reduce inference_timesteps if needed.
Background noise in cloned voices
Set denoise=True (the default in the snippets above) or enable "Prompt Speech Enhancement" in the Gradio UI — this pipes the reference clip through iic/speech_zipenhancer_ans_multiloss_16k_base before cloning. Documented on the HF model card.