What You'll Build
A local text-to-speech pipeline using OpenBMB's VoxCPM2 — a 2B-parameter tokenizer-free, diffusion-autoregressive TTS model built on the MiniCPM-4 backbone. It synthesises 48 kHz studio-quality audio in 30 languages, supports zero-shot voice cloning from a short reference clip, and adds "voice design" — generating a voice from a natural-language description like "A young woman, gentle and sweet voice".
Hardware data: RTX 5060 Ti (16 GB VRAM) · VoxCPM2 fits in ~8 GB leaving ~8 GB headroom for other workloads · See benchmark data
Requirements
| Component | Minimum | Tested |
|---|---|---|
| GPU | 8 GB VRAM (model card lists VRAM: ~8 GB on the official HF page) | RTX 5060 Ti (16 GB) |
| RAM | 8 GB | — |
| Storage | ~4 GB (model weights + denoiser/ASR helpers) | — |
| Software | Python ≥ 3.10 (<3.13), PyTorch ≥ 2.5.0, CUDA ≥ 12.0 (source) | — |
Installation
1. Install the voxcpm package
The canonical install is the published PyPI package, identical across both the Hugging Face model card and the official OpenBMB/VoxCPM GitHub README:
pip install voxcpm
2. (Optional) Install from source for the web demo
If you want the Gradio playground, clone the repo and install in editable mode:
git clone https://github.com/OpenBMB/VoxCPM.git
cd VoxCPM
pip install -e .
Source: VoxCPM 2 Quick Start docs.
3. (Optional) Pre-download model weights
Weights download on first inference automatically, but you can pre-fetch them to control where they land:
from huggingface_hub import snapshot_download
snapshot_download("openbmb/VoxCPM2")
Running
Python — basic synthesis
import soundfile as sf
from voxcpm import VoxCPM
model = VoxCPM.from_pretrained("openbmb/VoxCPM2")
wav = model.generate(
text="VoxCPM2 is an innovative end-to-end TTS model.",
cfg_value=2.0,
inference_timesteps=10,
)
sf.write("output.wav", wav, model.tts_model.sample_rate)
Note that the sample rate is read off the loaded model (model.tts_model.sample_rate) rather than hardcoded — VoxCPM2 emits 48 kHz audio, up from VoxCPM v1's 16 kHz. Higher inference_timesteps trades speed for quality.
Python — zero-shot voice cloning
Supply a short reference clip plus its transcript and the model will mimic the speaker's timbre, accent, and pacing. The Quick Start docs cover both "Controllable Cloning" and "Ultimate Cloning" variants using the prompt_wav_path / reference_wav_path parameters.
Python — voice design (new in v2)
VoxCPM2 lets you describe the desired voice in natural language inside the text itself, as shown on the HF card:
wav = model.generate(
text="(A young woman, gentle and sweet voice)Hello, welcome to VoxCPM2.",
cfg_value=2.0,
inference_timesteps=10,
)
sf.write("designed.wav", wav, model.tts_model.sample_rate)
Gradio web demo
If you installed from source (step 2), launch the local UI:
python app.py
Per the Quick Start docs, the web demo also downloads an additional ASR model (SenseVoice-Small) on first use.
Results
- Speed: RTF (Real-Time Factor) as low as
~0.30on an NVIDIA RTX 4090 — i.e. 1 s of audio synthesised in ~0.3 s of GPU time — per the official OpenBMB model card and GitHub README; the same page reports~0.13with the Nano-vLLM accelerated path. RTX 5060 Ti numbers are not yet community-benchmarked; track /check/voxcpm2/rtx-5060-ti for when they land. - VRAM usage:
~8 GBper the official model comparison table (also confirmed in the VoxCPM 2 FAQ table). Leaves about half of the RTX 5060 Ti's 16 GB free. - Quality notes: 48 kHz studio-quality output, tokenizer-free diffusion-autoregressive architecture (LocEnc → TSLM → RALM → LocDiT), 30 supported languages plus several Chinese dialects, Apache-2.0 license (free for commercial use). MLX 8-bit / 4-bit variants exist under the
mlx-communitynamespace for Apple Silicon, but on NVIDIA the stock bf16 path is the supported route.
For the full benchmark data, see /check/voxcpm2/rtx-5060-ti.
Troubleshooting
torch.compile / Triton errors during warm-up
Per the VoxCPM 2 FAQ, torch._dynamo.exc.Unsupported or Triton import failures on launch are most common on Windows or older CUDA stacks. Quick fix:
model = VoxCPM.from_pretrained("openbmb/VoxCPM2", optimize=False)
For a permanent fix, pin matching Triton / PyTorch versions (the FAQ table maps PyTorch 2.5 → Triton 3.1, 2.6 → 3.2, 2.7 → 3.3, 2.8 → 3.4); on Windows install triton-windows.
Could not load libtorchcodec when using reference audio
The FAQ recommends installing FFmpeg system-wide and pip install torchcodec, or forcing torchaudio.set_audio_backend("soundfile") if torchcodec cannot be installed.
Tight on VRAM with other models loaded
Set load_denoiser=False to skip loading the ZipEnhancer reference-audio denoiser, as documented in the Quick Start docs — the denoiser is only required when you want to enhance prompt or reference audio for voice cloning, and per the FAQ it runs on CPU even when CUDA is active. A 16 GB RTX 5060 Ti has comfortable headroom for the default config, but this knob is useful when stacking VoxCPM2 alongside an LLM or image model.