VoxCPM-0.5B on RTX 4070 Ti SUPER: Zero-Shot Voice Cloning TTS in ~5 GB VRAM

What You'll Build

A local zero-shot voice-cloning text-to-speech setup running VoxCPM-0.5B on your RTX 4070 Ti SUPER. Clone any voice from a few seconds of reference audio and synthesize natural, context-aware speech — all offline, no API calls.

Hardware data: RTX 4070 Ti SUPER (16GB VRAM) · ~5 GB peak · See benchmark data

ℹ️ This recipe targets openbmb/VoxCPM-0.5B (the original Sept-2025 release), not the newer VoxCPM1.5 or VoxCPM2. OpenBMB now ships several checkpoints under the same voxcpm package, and the package's newer GitHub examples default to VoxCPM2 — so always pass "openbmb/VoxCPM-0.5B" explicitly to from_pretrained to stay on the variant this recipe documents. The "0.5B" is the parameter count, not the VRAM cost — at inference it uses ~5 GB, leaving most of your 16 GB card free.

Requirements

Component	Minimum	Tested
GPU	6GB+ VRAM	RTX 4070 Ti SUPER (16GB)
RAM	8GB	—
Storage	2 GB (model weights)	~1.6 GB
Software	Python 3.10+, PyTorch 2.1+, CUDA 12.0+	—

The checkpoint on the canonical HuggingFace card is two files — pytorch_model.bin (~1.3 GB) plus the audio VAE audiovae.pth (~0.3 GB), about 1.6 GB on disk total — so 2 GB of free storage covers it. (There is no model.safetensors; the weights ship as a .bin plus the .pth VAE.)

Installation

1. Install VoxCPM

The canonical HuggingFace card and the official OpenBMB/VoxCPM repo both install via PyPI:

pip install voxcpm

This pulls in the voxcpm package and its dependencies — everything the Python workflow below needs.

2. Download the model weights

The weights download automatically on first run, or you can pre-fetch them — pass the full openbmb/VoxCPM-0.5B repo id so you don't pull a newer variant:

from huggingface_hub import snapshot_download
snapshot_download("openbmb/VoxCPM-0.5B")

About 1.6 GB lands in your HuggingFace cache (pytorch_model.bin + audiovae.pth).

Running

The Python API follows the canonical VoxCPM-0.5B model card — load the checkpoint and synthesize a clip with a cloned voice:

import soundfile as sf
from voxcpm import VoxCPM

model = VoxCPM.from_pretrained("openbmb/VoxCPM-0.5B")

wav = model.generate(
    text="VoxCPM is an innovative end-to-end TTS model, designed to generate highly expressive speech.",
    prompt_wav_path="reference.wav",   # a few seconds of the target voice (None to skip cloning)
    prompt_text="This is the reference transcript.",
    cfg_value=2.0,             # LM guidance — higher tracks the prompt more closely
    inference_timesteps=10,    # higher = better quality, lower = faster
    normalize=True,            # enable external text-normalization
    denoise=True,              # enable external denoiser for the reference clip
    retry_badcase=True,        # auto-retry on unstable ("unstoppable") generations
)

sf.write("output.wav", wav, 16000)

The card writes the output at the 16000 sample rate — VoxCPM-0.5B emits 16 kHz audio. A command-line entry point is also available after install (voxcpm --help, or python -m voxcpm.cli --help); for example:

voxcpm --text "Hello VoxCPM" --output out.wav

First run downloads the weights; subsequent runs load from cache. Output lands in output.wav.

Results

Speed: No first-party RTX 4070 Ti SUPER benchmark exists yet — contribute one here. For reference, the canonical model card reports streaming synthesis with a Real-Time Factor (RTF) as low as 0.17 on a consumer-grade NVIDIA RTX 4090 — but the 4090 is a faster, larger card (24 GB, more CUDA cores and memory bandwidth), so treat that 0.17 as an upper-bound reference, not an RTX 4070 Ti SUPER measurement. The install and workflow are identical (both Ada sm_89); only wall-clock differs.
VRAM usage: ~5 GB peak at inference — comfortably within the RTX 4070 Ti SUPER's 16 GB envelope, see benchmark data. The ~1.6 GB of on-disk weights load alongside the audio VAE, activations, and runtime overhead well under the card's capacity.
Quality notes: VoxCPM produces context-aware prosody — per the model card, it "comprehends text to infer and generate appropriate prosody." Reference audio of a few seconds works best for cloning. Bilingual: Chinese and English. Apache-2.0 license (per the model card) — commercial use permitted, though the card notes the release is intended for research and development and recommends testing before production use.

For the full benchmark data, see /check/voxcpm/rtx-4070-ti-super.

Troubleshooting

Accidentally pulled VoxCPM2 weights

The voxcpm package's newer examples default from_pretrained to the larger VoxCPM2 checkpoint. On a 16 GB card this is a feature-mix surprise rather than an OOM — VoxCPM2 and VoxCPM-0.5B differ in language coverage and output sample rate. Always pass "openbmb/VoxCPM-0.5B" verbatim to stay on the variant this recipe documents.

Out of memory on a smaller card or with very long input

VoxCPM-0.5B needs ~5 GB; the 16 GB RTX 4070 Ti SUPER has ample headroom, so a fresh-run OOM here usually means another process is holding VRAM (check nvidia-smi). Note that very long single-shot inputs grow the KV cache and can exhaust memory on smaller cards — issue #52 reports a KV cache is full failure on a 15 GB Colab GPU when generating very long passages; split long text into shorter segments.

Voice clone quality is poor

Use clean reference audio (a few seconds, minimal background noise) and an accurate prompt_text transcript — mismatched transcripts degrade prosody. Lower cfg_value (toward 1.5) if the voice sounds strained; raise it for tighter text adherence. If a clip comes out unstable, the API's retry_badcase=True mode (documented on the model card) re-runs bad cases automatically.

First run is slow / downloads weights

The first from_pretrained / generate call downloads ~1.6 GB of weights from HuggingFace. Subsequent runs use the local cache. Pre-fetch with snapshot_download (see Installation) to control timing.

Want to add your own benchmark? Contribute here.