VoxCPM-0.5B on RTX 5070 Ti: Zero-Shot Voice Cloning TTS in ~5 GB VRAM

What You'll Build

A local text-to-speech pipeline using OpenBMB's VoxCPM-0.5B — a 0.5B-parameter, tokenizer-free TTS model built on the MiniCPM-4 backbone that does zero-shot voice cloning from a short reference clip. You'll generate 16 kHz audio in Chinese or English, either from text alone or by cloning a voice you supply as a few-second .wav. On the RTX 5070 Ti's 16 GB the model uses roughly a third of the card — ample room left over (see "Spending the headroom" below).

Hardware data: RTX 5070 Ti (16 GB VRAM) · VoxCPM-0.5B fits in ~5 GB leaving ~11 GB headroom · See benchmark data

ℹ️ This recipe pins VoxCPM-0.5B (legacy), not VoxCPM1.5 or VoxCPM2. The OpenBMB repo ships three versions per the official Models & Versions table: VoxCPM2 (2B backbone, ~8 GB VRAM, 48 kHz, 30 languages — the latest, now the repo's default), VoxCPM1.5 (0.6B, ~6 GB, 44.1 kHz, 2 languages — stable), and VoxCPM-0.5B (0.5B, ~5 GB, 16 kHz, 2 languages — legacy). All three fit comfortably on a 16 GB card, so the choice here is feature trade-offs, not headroom. This recipe pins 0.5B because it is the original, most-cited release for Chinese/English voice cloning. The upstream Quick Start and CLI now default to openbmb/VoxCPM2 — so you must pass "openbmb/VoxCPM-0.5B" explicitly to from_pretrained (shown below) or you will silently load the 2B VoxCPM2 checkpoint instead. If you want 48 kHz audio or one of the 30 languages, switch to VoxCPM2 deliberately.

Requirements

Component	Minimum	Tested
GPU	6 GB VRAM (model needs ~5 GB per official Models & Versions table)	RTX 5070 Ti (16 GB)
RAM	8 GB	—
Storage	~2 GB (1.30 GB `pytorch_model.bin` + 0.30 GB `audiovae.pth` + denoiser/ASR helpers)	—
Software	Python >= 3.10 (<3.13), PyTorch >= 2.5.0, CUDA 12.8 (source)	—

Installation

1. Install PyTorch with the cu128 wheel (Blackwell)

The RTX 5070 Ti is a Blackwell (GB203, sm_120) GPU. PyTorch's default pip install torch already ships sm_120 kernels via the CUDA 12.8 (cu128) build, so a current PyTorch satisfies it — pin cu128 explicitly to be safe:

pip install torch --index-url https://download.pytorch.org/whl/cu128

VoxCPM-0.5B runs through standard PyTorch attention; it does not require FlashAttention-2, so the FA2 sm_120 wheel gap that bites other Blackwell recipes does not apply here.

2. Install the `voxcpm` package

The canonical install is the published PyPI package, per both the Hugging Face model card and the official OpenBMB/VoxCPM GitHub README:

pip install voxcpm

3. (Optional) Install from source for the Gradio web demo

git clone https://github.com/OpenBMB/VoxCPM.git
cd VoxCPM
pip install -e .

Source: VoxCPM Quick Start docs.

4. (Optional) Pre-download model weights

Weights download on first inference automatically, but you can pre-fetch them to control where they land:

from huggingface_hub import snapshot_download
snapshot_download("openbmb/VoxCPM-0.5B")

If you plan to use the denoiser / ASR helpers for cleaner cloning (recommended), also pre-fetch the ZipEnhancer and SenseVoice-Small models that the HF card references:

from modelscope import snapshot_download
snapshot_download("iic/speech_zipenhancer_ans_multiloss_16k_base")
snapshot_download("iic/SenseVoiceSmall")

Running

Python — basic synthesis

import soundfile as sf
from voxcpm import VoxCPM

model = VoxCPM.from_pretrained("openbmb/VoxCPM-0.5B")
wav = model.generate(
    text="VoxCPM is an innovative end-to-end TTS model.",
    prompt_wav_path=None,
    prompt_text=None,
    cfg_value=2.0,
    inference_timesteps=10,
    normalize=True,
    denoise=True,
)
sf.write("output.wav", wav, 16000)

Note the 16000 sample rate — VoxCPM-0.5B emits 16 kHz audio. Higher inference_timesteps trades speed for quality; lower trades the opposite. Passing the full "openbmb/VoxCPM-0.5B" repo path keeps you on the legacy 0.5B checkpoint; omitting it now resolves to VoxCPM2.

Python — zero-shot voice cloning

Supply a short reference clip plus its transcript and the model will mimic the speaker's timbre, accent, and pacing:

wav = model.generate(
    text="Hello — this is your cloned voice speaking.",
    prompt_wav_path="reference.wav",
    prompt_text="reference transcript matching the wav",
    cfg_value=2.0,
    inference_timesteps=10,
    denoise=True,
)
sf.write("cloned.wav", wav, 16000)

CLI

After installation the voxcpm entry point is available. Pass --hf-model-id openbmb/VoxCPM-0.5B to stay on this recipe's variant (the CLI otherwise defaults to VoxCPM2):

voxcpm --text "Hello VoxCPM" --output out.wav --hf-model-id openbmb/VoxCPM-0.5B

Gradio web demo

If you installed from source (step 3), launch the local UI:

python app.py

Results

Speed: Not yet benchmarked on the RTX 5070 Ti. The only published figure is RTF (Real-Time Factor) ~0.17 for VoxCPM-0.5B on an RTX 4090 per the official Models & Versions table. The RTX 5070 Ti is a different architecture (Blackwell GB203, sm_120) with a different bandwidth/compute profile from the RTX 4090, so the 4090 RTF does not transfer as a 5070 Ti number — TTS inference is memory-bandwidth-sensitive and a 5070 Ti figure could land either side of the 4090's. Track and contribute a measurement at /check/voxcpm/rtx-5070-ti.
VRAM usage: ~5 GB per the official Models & Versions table. On disk the model is ~1.6 GB (1.30 GB pytorch_model.bin + 0.30 GB audiovae.pth); the ~5 GB runtime peak covers activations plus the optional denoiser/ASR helper models. That leaves ~11 GB free on the RTX 5070 Ti's 16 GB.
Quality notes: 16 kHz output sample rate, 12.5 Hz LM token rate, 2 supported languages (Chinese, English), continuation-only voice cloning — all per the official Models & Versions table. Trained on a 1.8 M-hour bilingual corpus per the HF model card. Apache-2.0 license — commercial use is permitted.

For the full benchmark data, see /check/voxcpm/rtx-5070-ti.

Spending the headroom

VoxCPM-0.5B uses ~5 GB of the RTX 5070 Ti's 16 GB, so ~11 GB sits idle during synthesis. Practical ways to use it:

Colocate the denoiser and ASR helpers permanently. The ZipEnhancer (iic/speech_zipenhancer_ans_multiloss_16k_base) and SenseVoice-Small models used for prompt enhancement and reference-clip ASR are small; keeping them resident alongside VoxCPM avoids reload latency on every cloned generation.
Run a small LLM alongside for a text-then-speech pipeline — e.g. a 7B chat model at a 4-bit quant (~5–6 GB) fits in the remaining headroom, letting one card both draft text and voice it.
Batch longer scripts without spilling — the spare VRAM absorbs longer reference clips and multi-utterance batches comfortably.

Troubleshooting

`torch.compile` errors on launch

The VoxCPM helper enables torch.compile optimisations by default. The Quick Start docs note that if you hit platform-specific torch.compile issues, you can pass optimize=False to VoxCPM.from_pretrained — useful on Windows or older CUDA stacks.

Strained or weird-sounding voice

Per the Hugging Face card, lower the cfg_value (e.g. from 2.0 toward 1.5) to relax adherence to the text; raise it for tighter prompt following. For long or expressive inputs the model may exhibit instability — chunk the text or reduce inference_timesteps if needed.

Background noise in cloned voices

Set denoise=True (the default in the snippets above) or enable "Prompt Speech Enhancement" in the Gradio UI — this pipes the reference clip through iic/speech_zipenhancer_ans_multiloss_16k_base before cloning. Documented on the HF model card.

Accidentally loaded VoxCPM2 instead of 0.5B

If you called VoxCPM.from_pretrained() without the explicit "openbmb/VoxCPM-0.5B" argument, or ran the voxcpm CLI without --hf-model-id, you got VoxCPM2 (the new repo default — 2B, ~8 GB VRAM, 48 kHz, 30 languages). On the RTX 5070 Ti's 16 GB both fit, so this is a feature-mix surprise rather than an OOM. Always pass "openbmb/VoxCPM-0.5B" verbatim to stay on the variant this recipe documents.