VoxCPM-0.5B on RTX 4070: Zero-Shot Voice Cloning TTS in ~5 GB VRAM

What You'll Build

A local text-to-speech pipeline using OpenBMB's VoxCPM-0.5B — a 0.5B-parameter, tokenizer-free TTS model built on the MiniCPM-4 backbone that does zero-shot voice cloning from a short reference clip. You'll generate 16 kHz audio in Chinese or English, either from text alone or by cloning a voice you supply as a few-second .wav. On the RTX 4070's 12 GB the model uses well under half the card, so it fits comfortably even with a display attached.

Hardware data: RTX 4070 (12 GB VRAM) · VoxCPM-0.5B fits in ~5 GB leaving ~7 GB free · See benchmark data

ℹ️ This recipe pins VoxCPM-0.5B (legacy), not VoxCPM1.5 or VoxCPM2. The OpenBMB repo ships three versions per the official Models & Versions table: VoxCPM2 (2B backbone, ~8 GB VRAM, 48 kHz, 30 languages — the latest, now the repo's default), VoxCPM1.5 (0.6B, ~6 GB, 44.1 kHz, 2 languages — stable), and VoxCPM-0.5B (0.5B, ~5 GB, 16 kHz, 2 languages — legacy). All three fit on the RTX 4070's 12 GB, so the choice here is feature trade-offs, not headroom. This recipe pins 0.5B because it is the original, most-cited release for Chinese/English voice cloning. The upstream Quick Start and CLI have moved their default to openbmb/VoxCPM2 — the GitHub README now headlines VoxCPM2 and its sample code calls from_pretrained("openbmb/VoxCPM2", load_denoiser=False). So you must pass "openbmb/VoxCPM-0.5B" explicitly to from_pretrained (shown below) or you will silently load the 2B VoxCPM2 checkpoint instead. If you want 48 kHz audio or one of the 30 languages, switch to VoxCPM2 deliberately.

Requirements

Component	Minimum	Tested
GPU	6 GB VRAM (model needs ~5 GB per official Models & Versions table)	RTX 4070 (12 GB)
RAM	8 GB	—
Storage	~2 GB (1.30 GB `pytorch_model.bin` + 0.30 GB `audiovae.pth` + denoiser/ASR helpers)	~1.6 GB on disk
Software	Python >= 3.10 (<3.13), PyTorch >= 2.5.0, CUDA >= 12.0 (source)	—

The checkpoint on the canonical HuggingFace card is two weight files — pytorch_model.bin (~1.30 GB) plus the audio VAE audiovae.pth (~0.30 GB), about 1.6 GB on disk total — so 2 GB of free storage covers the weights (the denoiser/ASR helpers add more if you enable them). There is no model.safetensors; the weights ship as a .bin plus the .pth VAE.

Installation

1. Install PyTorch (Ada / RTX 4070)

The RTX 4070 is an Ada Lovelace (AD104, sm_89) GPU. Unlike Blackwell RTX 50-series cards, no special CUDA-12.8 (cu128) wheel is needed — the default pip install torch already ships sm_89 kernels, so a current PyTorch satisfies the RTX 4070:

pip install torch

VoxCPM-0.5B runs through standard PyTorch attention; it does not require FlashAttention-2, so there is no attention-backend flag to reason about on this card.

2. Install the `voxcpm` package

The canonical install is the published PyPI package, per both the Hugging Face model card and the official OpenBMB/VoxCPM GitHub README:

pip install voxcpm

3. (Optional) Install from source for the Gradio web demo

git clone https://github.com/OpenBMB/VoxCPM.git
cd VoxCPM
pip install -e .

4. (Optional) Pre-download model weights

Weights download on first inference automatically, but you can pre-fetch them to control where they land — pass the full openbmb/VoxCPM-0.5B repo id so you don't pull the newer VoxCPM2:

from huggingface_hub import snapshot_download
snapshot_download("openbmb/VoxCPM-0.5B")

If you plan to use the denoiser / ASR helpers for cleaner cloning (recommended), also pre-fetch the ZipEnhancer and SenseVoice-Small models that the HF card references:

from modelscope import snapshot_download
snapshot_download("iic/speech_zipenhancer_ans_multiloss_16k_base")
snapshot_download("iic/SenseVoiceSmall")

Running

Python — basic synthesis

The Python API follows the canonical VoxCPM-0.5B model card — load the checkpoint and synthesize a clip:

import soundfile as sf
from voxcpm import VoxCPM

model = VoxCPM.from_pretrained("openbmb/VoxCPM-0.5B")
wav = model.generate(
    text="VoxCPM is an innovative end-to-end TTS model.",
    prompt_wav_path=None,
    prompt_text=None,
    cfg_value=2.0,
    inference_timesteps=10,
    normalize=True,
    denoise=True,
)
sf.write("output.wav", wav, 16000)

Note the 16000 sample rate — VoxCPM-0.5B emits 16 kHz audio. Higher inference_timesteps trades speed for quality; lower trades the opposite. Passing the full "openbmb/VoxCPM-0.5B" repo path keeps you on the legacy 0.5B checkpoint; omitting it now resolves to VoxCPM2.

Python — zero-shot voice cloning

Supply a short reference clip plus its transcript and the model will mimic the speaker's timbre, accent, and pacing:

wav = model.generate(
    text="Hello — this is your cloned voice speaking.",
    prompt_wav_path="reference.wav",
    prompt_text="reference transcript matching the wav",
    cfg_value=2.0,
    inference_timesteps=10,
    denoise=True,
)
sf.write("cloned.wav", wav, 16000)

CLI

After installation the voxcpm entry point is available. Pass --hf-model-id openbmb/VoxCPM-0.5B to stay on this recipe's variant (the legacy 0.5B CLI is documented on the HF model card):

voxcpm --text "Hello VoxCPM" --output out.wav --hf-model-id openbmb/VoxCPM-0.5B

Gradio web demo

If you installed from source (step 3), launch the local UI:

python app.py

First run downloads ~1.6 GB of weights; subsequent runs load from the local cache.

Results

Speed: No first-party RTX 4070 benchmark exists yet — contribute one here. The only published figure is RTF (Real-Time Factor) ~0.17 for VoxCPM-0.5B on an RTX 4090, per the official Models & Versions table. The RTX 4090 is a much larger, faster card (24 GB, far more CUDA cores and ~2× the memory bandwidth), and TTS inference is memory-bandwidth-sensitive — so that 0.17 does not transfer as an RTX 4070 number. Track and contribute a measurement at /check/voxcpm/rtx-4070.
VRAM usage: ~5 GB per the official Models & Versions table. On disk the model is ~1.6 GB (1.30 GB pytorch_model.bin + 0.30 GB audiovae.pth); the ~5 GB runtime peak covers activations plus the optional denoiser/ASR helper models. That fits the RTX 4070's 12 GB comfortably — roughly ~7 GB stays free even with a display attached, so there is no need for offload or a smaller variant on this card. See benchmark data.
Quality notes: 16 kHz output sample rate, 2 supported languages (Chinese, English), continuation-style voice cloning — all per the official Models & Versions table. Per the HF model card, the model "comprehends text to infer and generate appropriate prosody," trained on a 1.8 million-hour bilingual corpus. Apache-2.0 license — commercial use is permitted.

For the full benchmark data, see /check/voxcpm/rtx-4070.

Troubleshooting

Accidentally loaded VoxCPM2 instead of 0.5B

If you called VoxCPM.from_pretrained() without the explicit "openbmb/VoxCPM-0.5B" argument, or ran the voxcpm CLI without --hf-model-id, you got VoxCPM2 (the new repo default — 2B, ~8 GB VRAM, 48 kHz, 30 languages per the Models & Versions table). On the RTX 4070's 12 GB both fit, so this is a feature-mix surprise rather than an OOM. Always pass "openbmb/VoxCPM-0.5B" verbatim to stay on the variant this recipe documents.

Out of memory on a smaller card or with very long input

VoxCPM-0.5B needs ~5 GB; the 12 GB RTX 4070 has ample headroom, so a fresh-run OOM here usually means another process is holding VRAM (check nvidia-smi). Very long single-shot inputs grow the KV cache and can exhaust memory on smaller cards — issue #52 reports a KV cache is full failure on a 15 GB Colab GPU when generating very long passages; split long text into shorter segments.

Strained or weird-sounding voice

Per the Hugging Face card, lower the cfg_value (e.g. from 2.0 toward 1.5) to relax adherence to the text; raise it for tighter prompt following. For long or expressive inputs the model may exhibit instability — the API's retry_badcase=True mode (documented on the model card) re-runs unstable generations automatically, or chunk the text and reduce inference_timesteps.

Background noise in cloned voices

Set denoise=True (the default in the snippets above) or enable "Prompt Speech Enhancement" in the Gradio UI — this pipes the reference clip through iic/speech_zipenhancer_ans_multiloss_16k_base before cloning. Documented on the HF model card.

Want to add your own benchmark? Contribute here.