self-hosted/ai
§01·recipe · tts

VoxCPM-0.5B on RTX 4060 Ti 16GB: Zero-Shot Voice Cloning TTS in ~5 GB VRAM

ttsbeginner5GB+ VRAMMay 21, 2026
models
tools
  • Voxcpm
prerequisites
  • NVIDIA GPU with >= 6 GB VRAM (RTX 4060 Ti 16GB has ~11 GB headroom over peak)
  • Python >= 3.10 (<3.13)
  • PyTorch >= 2.5.0 with CUDA >= 12.0

What You'll Build

A local text-to-speech pipeline using OpenBMB's VoxCPM-0.5B — a 0.5B-parameter TTS model built on the MiniCPM-4 backbone that does zero-shot voice cloning from a short reference clip. You'll generate 16 kHz audio in Chinese or English, either from text alone or by cloning a voice you supply as a few-second .wav.

Hardware data: RTX 4060 Ti 16GB · VoxCPM-0.5B fits in ~5 GB leaving ~11 GB headroom · See benchmark data

ℹ️ Recipe targets VoxCPM-0.5B (legacy), not VoxCPM1.5 or VoxCPM2. The OpenBMB repo now ships three versions per the official comparison table: VoxCPM2 (2B backbone, ~8 GB VRAM, 48 kHz, 30 languages — latest), VoxCPM1.5 (0.6B, ~6 GB, 44.1 kHz, 2 languages — stable), and VoxCPM-0.5B (0.5B, ~5 GB, 16 kHz, 2 languages — legacy). All three fit comfortably on a 16 GB card — the choice is feature trade-offs, not headroom. This recipe pins 0.5B because it is the original release and the most mature for Chinese/English voice cloning workflows; if you want 48 kHz audio or non-Chinese/English languages, switch the from_pretrained argument to openbmb/VoxCPM2. Load by passing "openbmb/VoxCPM-0.5B" explicitly to from_pretrained so the helper doesn't default to a newer checkpoint.

Requirements

ComponentMinimumTested
GPU6 GB VRAM (model needs ~5 GB per official comparison table)RTX 4060 Ti 16GB
RAM8 GB
Storage~2 GB (model weights + denoiser/ASR helpers)
SoftwarePython >= 3.10 (<3.13), PyTorch >= 2.5.0, CUDA >= 12.0 (source)

Installation

1. Install the voxcpm package

The simplest path is the published PyPI package:

pip install voxcpm

This is the canonical install per both the Hugging Face model card and the official OpenBMB/VoxCPM GitHub README.

2. (Optional) Install from source for the web demo

If you want the Gradio playground, clone the repo and install in editable mode:

git clone https://github.com/OpenBMB/VoxCPM.git
cd VoxCPM
pip install -e .

Source: VoxCPM Quick Start docs.

3. (Optional) Pre-download model weights

Weights download on first inference automatically, but you can pre-fetch them to control where they land:

from huggingface_hub import snapshot_download
snapshot_download("openbmb/VoxCPM-0.5B")

If you plan to use the denoiser / ASR helpers for cloning (recommended), also pre-fetch the ZipEnhancer model referenced on the HF card:

snapshot_download("iic/speech_zipenhancer_ans_multiloss_16k_base")

Running

Python — basic synthesis

import soundfile as sf
from voxcpm import VoxCPM

model = VoxCPM.from_pretrained("openbmb/VoxCPM-0.5B")
wav = model.generate(
    text="VoxCPM is an innovative end-to-end TTS model.",
    prompt_wav_path=None,
    cfg_value=2.0,
    inference_timesteps=10,
    normalize=True,
    denoise=True,
)
sf.write("output.wav", wav, 16000)

Note the 16000 sample rate — VoxCPM-0.5B emits 16 kHz audio. Higher inference_timesteps trades speed for quality; lower trades the opposite. Passing the full "openbmb/VoxCPM-0.5B" repo path keeps you on the legacy 0.5B checkpoint; omitting it can resolve to a newer variant.

Python — zero-shot voice cloning

Supply a short reference clip plus its transcript and the model will mimic the speaker's timbre, accent, and pacing:

wav = model.generate(
    text="Hello — this is your cloned voice speaking.",
    prompt_wav_path="reference.wav",
    prompt_text="reference transcript matching the wav",
    cfg_value=2.0,
    inference_timesteps=10,
    denoise=True,
)
sf.write("cloned.wav", wav, 16000)

Gradio web demo

If you installed from source (step 2), launch the local UI:

python app.py

Results

  • Speed: RTF (Real-Time Factor) as low as 0.17 on a consumer-grade NVIDIA RTX 4090 — i.e. 1 s of audio synthesised in ~0.17 s of GPU time — per the official OpenBMB README and HF model card. The RTX 4060 Ti 16GB has roughly a third of the RTX 4090's memory bandwidth and FP16 throughput, so expect notably slower wall-clock; no RTX 4060 Ti 16GB community benchmark has landed yet. Track /check/voxcpm/rtx-4060-ti-16gb for when one does.
  • VRAM usage: ~5 GB per the official model comparison table. That leaves ~11 GB headroom on the RTX 4060 Ti 16GB — comfortable margin for the denoiser/ASR helper models, longer reference clips, and concurrent workloads.
  • Quality notes: 16 kHz output sample rate, 12.5 Hz LM token rate, 2 supported languages (Chinese, English) per the official Models & Versions table. Voice cloning is "continuation-only" — the model continues from your reference clip rather than fully replacing the speaker's identity from scratch. Trained on a 1.8 M-hour bilingual corpus per the HF model card. Apache-2.0 license — commercial use is permitted.

For the full benchmark data, see /check/voxcpm/rtx-4060-ti-16gb.

Troubleshooting

torch.compile errors on launch

The VoxCPM helper enables torch.compile optimisations by default. The Quick Start docs recommend passing optimize=False to VoxCPM.from_pretrained if you hit torch.compile platform issues — common on Windows or older CUDA stacks.

Strained or weird-sounding voice

Per the Hugging Face card, lower the cfg_value (e.g. from 2.0 toward 1.5) to relax adherence to the text; raise it for tighter prompt following. For long or expressive inputs the model may exhibit instability — chunk the text or reduce inference_timesteps if needed.

Background noise in cloned voices

Set denoise=True (the default in the snippets above) or enable "Prompt Speech Enhancement" in the Gradio UI — this pipes the reference clip through iic/speech_zipenhancer_ans_multiloss_16k_base before cloning. Documented on the HF model card.

Accidentally pulled VoxCPM2 or VoxCPM1.5 weights

If you typed VoxCPM.from_pretrained() without an explicit repo argument, or used "openbmb/VoxCPM" (the umbrella repo, which returns 401 on its rendered page), the helper may resolve to the latest variant. On a 16 GB card all three variants fit, so this is a feature-mix surprise rather than an OOM — VoxCPM2 emits 48 kHz audio across 30 languages, VoxCPM1.5 emits 44.1 kHz across Chinese/English, VoxCPM-0.5B emits 16 kHz across Chinese/English (per the official comparison table). Always pass "openbmb/VoxCPM-0.5B" verbatim to stay on the variant this recipe documents.