What You'll Build
A local text-to-speech pipeline using OpenBMB's VoxCPM-0.5B — a 0.5B-parameter, tokenizer-free TTS model built on the MiniCPM-4 backbone that does zero-shot voice cloning from a short reference clip. You'll generate 16 kHz audio in Chinese or English, either from text alone or by cloning a voice you supply as a few-second .wav. On the RTX 5070's 12 GB the model uses well under half the card, so it fits comfortably even with a display attached.
Hardware data: RTX 5070 (12 GB VRAM) · VoxCPM-0.5B fits in ~5 GB leaving ~7 GB free · See benchmark data
ℹ️ This recipe pins VoxCPM-0.5B (legacy), not VoxCPM1.5 or VoxCPM2. The OpenBMB repo ships three versions per the official Models & Versions table: VoxCPM2 (2B backbone, ~8 GB VRAM, 48 kHz, 30 languages — the latest, now the repo's default), VoxCPM1.5 (0.6B, ~6 GB, 44.1 kHz, 2 languages — stable), and VoxCPM-0.5B (0.5B, ~5 GB, 16 kHz, 2 languages — legacy). All three fit on the RTX 5070's 12 GB, so the choice here is feature trade-offs, not headroom. This recipe pins 0.5B because it is the original, most-cited release for Chinese/English voice cloning. The upstream Quick Start and CLI have moved their default to
openbmb/VoxCPM2— the GitHub README now headlines VoxCPM2 and its sample code callsfrom_pretrained("openbmb/VoxCPM2"), and the ReadTheDocs quickstart advises new projects to start with VoxCPM 2. So you must pass"openbmb/VoxCPM-0.5B"explicitly tofrom_pretrained(shown below) or you will silently load the 2B VoxCPM2 checkpoint instead. If you want 48 kHz audio or one of the 30 languages, switch to VoxCPM2 deliberately.
Requirements
| Component | Minimum | Tested |
|---|---|---|
| GPU | 6 GB VRAM (model needs ~5 GB per official Models & Versions table) | RTX 5070 (12 GB) |
| RAM | 8 GB | — |
| Storage | ~2 GB (1.30 GB pytorch_model.bin + 0.30 GB audiovae.pth + denoiser/ASR helpers) | — |
| Software | Python >= 3.10 (<3.13), PyTorch >= 2.5.0, CUDA 12.8 (source) | — |
Installation
1. Install PyTorch with the cu128 wheel (Blackwell)
The RTX 5070 is a Blackwell (GB205, sm_120) GPU. PyTorch's default pip install torch already ships sm_120 kernels via the CUDA 12.8 (cu128) build, so a current PyTorch satisfies it — pin cu128 explicitly to be safe:
pip install torch --index-url https://download.pytorch.org/whl/cu128
VoxCPM-0.5B runs through standard PyTorch attention; it does not require FlashAttention-2, so the FA2 sm_120 wheel gap that bites other Blackwell recipes does not apply here.
2. Install the voxcpm package
The canonical install is the published PyPI package, per both the Hugging Face model card and the official OpenBMB/VoxCPM GitHub README:
pip install voxcpm
3. (Optional) Install from source for the Gradio web demo
git clone https://github.com/OpenBMB/VoxCPM.git
cd VoxCPM
pip install -e .
Source: VoxCPM Quick Start docs.
4. (Optional) Pre-download model weights
Weights download on first inference automatically, but you can pre-fetch them to control where they land:
from huggingface_hub import snapshot_download
snapshot_download("openbmb/VoxCPM-0.5B")
If you plan to use the denoiser / ASR helpers for cleaner cloning (recommended), also pre-fetch the ZipEnhancer and SenseVoice-Small models that the HF card references:
from modelscope import snapshot_download
snapshot_download("iic/speech_zipenhancer_ans_multiloss_16k_base")
snapshot_download("iic/SenseVoiceSmall")
Running
Python — basic synthesis
import soundfile as sf
from voxcpm import VoxCPM
model = VoxCPM.from_pretrained("openbmb/VoxCPM-0.5B")
wav = model.generate(
text="VoxCPM is an innovative end-to-end TTS model.",
prompt_wav_path=None,
prompt_text=None,
cfg_value=2.0,
inference_timesteps=10,
normalize=True,
denoise=True,
)
sf.write("output.wav", wav, 16000)
Note the 16000 sample rate — VoxCPM-0.5B emits 16 kHz audio. Higher inference_timesteps trades speed for quality; lower trades the opposite. Passing the full "openbmb/VoxCPM-0.5B" repo path keeps you on the legacy 0.5B checkpoint; omitting it now resolves to VoxCPM2.
Python — zero-shot voice cloning
Supply a short reference clip plus its transcript and the model will mimic the speaker's timbre, accent, and pacing:
wav = model.generate(
text="Hello — this is your cloned voice speaking.",
prompt_wav_path="reference.wav",
prompt_text="reference transcript matching the wav",
cfg_value=2.0,
inference_timesteps=10,
denoise=True,
)
sf.write("cloned.wav", wav, 16000)
CLI
After installation the voxcpm entry point is available. Pass --hf-model-id openbmb/VoxCPM-0.5B to stay on this recipe's variant (the legacy 0.5B CLI is documented on the HF model card):
voxcpm --text "Hello VoxCPM" --output out.wav --hf-model-id openbmb/VoxCPM-0.5B
Gradio web demo
If you installed from source (step 3), launch the local UI:
python app.py
Results
- Speed: Not yet benchmarked on the RTX 5070. The only published figure is RTF (Real-Time Factor) ~0.17 for VoxCPM-0.5B on an RTX 4090 per the official Models & Versions table. The RTX 5070 is a different, smaller Blackwell GPU (GB205, sm_120) with markedly lower memory bandwidth (~672 GB/s vs the RTX 4090's ~1008 GB/s) and fewer CUDA cores, and TTS inference is memory-bandwidth-sensitive — so the 4090 RTF does not transfer as an RTX 5070 number. Track and contribute a measurement at /check/voxcpm/rtx-5070.
- VRAM usage: ~5 GB per the official Models & Versions table. On disk the model is ~1.6 GB (1.30 GB
pytorch_model.bin+ 0.30 GBaudiovae.pth); the ~5 GB runtime peak covers activations plus the optional denoiser/ASR helper models. That fits the RTX 5070's 12 GB comfortably — roughly ~7 GB stays free even with a display attached, so there is no need for offload or a smaller variant on this card. - Quality notes: 16 kHz output sample rate, 12.5 Hz LM token rate, 2 supported languages (Chinese, English), continuation-only voice cloning — all per the official Models & Versions table. Trained on a 1.8 M-hour bilingual corpus per the HF model card. Apache-2.0 license — commercial use is permitted.
For the full benchmark data, see /check/voxcpm/rtx-5070.
Troubleshooting
torch.compile errors on launch
The VoxCPM helper enables torch.compile optimisations by default. The Quick Start docs note that if you hit platform-specific torch.compile issues, you can pass optimize=False to VoxCPM.from_pretrained — useful on Windows or older CUDA stacks.
Strained or weird-sounding voice
Per the Hugging Face card, lower the cfg_value (e.g. from 2.0 toward 1.5) to relax adherence to the text; raise it for tighter prompt following. For long or expressive inputs the model may exhibit instability — chunk the text or reduce inference_timesteps if needed.
Background noise in cloned voices
Set denoise=True (the default in the snippets above) or enable "Prompt Speech Enhancement" in the Gradio UI — this pipes the reference clip through iic/speech_zipenhancer_ans_multiloss_16k_base before cloning. Documented on the HF model card.
Accidentally loaded VoxCPM2 instead of 0.5B
If you called VoxCPM.from_pretrained() without the explicit "openbmb/VoxCPM-0.5B" argument, or ran the voxcpm CLI without --hf-model-id, you got VoxCPM2 (the new repo default — 2B, ~8 GB VRAM, 48 kHz, 30 languages). On the RTX 5070's 12 GB both fit, so this is a feature-mix surprise rather than an OOM. Always pass "openbmb/VoxCPM-0.5B" verbatim to stay on the variant this recipe documents.