self-hosted/ai
§01·recipe · tts

VoxCPM2 on RTX 5070: 30-Language 48kHz Voice Cloning in ~8 GB VRAM

ttsbeginner8GB+ VRAMJun 5, 2026
models
tools
prerequisites
  • NVIDIA GPU with ≥ 8 GB VRAM (RTX 5070 12GB has ~4 GB of headroom over the requirement)
  • Python ≥ 3.10 (<3.13)
  • PyTorch ≥ 2.5.0 with CUDA ≥ 12.0 — on Blackwell (sm_120) prefer a CUDA 12.8+ / 13.0 toolchain (see Installation)

What You'll Build

A local text-to-speech pipeline using OpenBMB's VoxCPM2 — a 2B-parameter tokenizer-free, diffusion-autoregressive TTS model built on the MiniCPM-4 backbone. It synthesises 48 kHz studio-quality audio in 30 languages, supports zero-shot voice cloning from a short reference clip, and adds "voice design" — generating a voice from a natural-language description like "A young woman, gentle and sweet voice". VoxCPM2 is the successor to VoxCPM (the original 0.5B model): the v2 jump moves to 2B parameters, upgrades audio output from 16 kHz to 48 kHz (via AudioVAE V2's built-in super-resolution), and expands language coverage to 30 languages plus several Chinese dialects.

Hardware data: RTX 5070 (12 GB VRAM) · ~8 GB VRAM requirement per the official model card leaves ~4 GB free for other workloads · See benchmark data

ℹ️ Blackwell note. The RTX 5070 is an sm_120 (Blackwell GB205) card. The default pip install torch from a CUDA 12.x index does not ship Triton kernels with native sm_120 support, which causes silent fallbacks or crashes — so the install below pins a CUDA 12.8+ / 13.0 PyTorch wheel. This is the one place where the RTX 5070 recipe differs from the same model on older (Ada/Ampere) cards.

Requirements

ComponentMinimumTested
GPU8 GB VRAM (model card lists VRAM: ~8 GB on the official HF page and confirms in the Quick Start FAQ VRAM & RTF reference table)RTX 5070 (12 GB GDDR7, Blackwell GB205 sm_120)
RAM8 GB
Storage~5 GB (model.safetensors 4.58 GB + audiovae.pth 0.38 GB on the HF repo)
SoftwarePython ≥ 3.10 (<3.13), PyTorch ≥ 2.5.0, CUDA ≥ 12.0 (source) — CUDA 12.8+/13.0 on Blackwell

Installation

1. Install a Blackwell-compatible PyTorch first

On sm_120 cards, install PyTorch from a CUDA 12.8 (cu128) or newer index before the voxcpm package, so the correct CUDA build wins over any transitive CPU-only torch that dependencies might pull in:

pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu128

The motivation is documented on the canonical repo: PR #250 explains that CUDA 12.x does not ship Triton wheels with native sm_120 support — causing silent fallbacks or crashes on RTX 5000-series (Blackwell) GPUs — and that a CUDA 13.0 toolchain resolves this and lets VoxCPM2 run at full speed on the RTX 5070/5080/5090. A CUDA 13.0 (cu130) wheel works equally well if you prefer the newest toolchain.

2. Install the voxcpm package

The canonical install is the published PyPI package, identical across both the Hugging Face model card and the official OpenBMB/VoxCPM GitHub README:

pip install voxcpm

3. (Optional) Install from source for the web demo

If you want the Gradio playground, clone the repo and install in editable mode:

git clone https://github.com/OpenBMB/VoxCPM.git
cd VoxCPM
pip install -e .

Source: VoxCPM 2 Quick Start docs.

4. (Optional) Pre-download model weights

Weights download on first inference automatically, but you can pre-fetch them to control where they land:

from huggingface_hub import snapshot_download
snapshot_download("openbmb/VoxCPM2")

Running

Python — basic synthesis

import soundfile as sf
from voxcpm import VoxCPM

model = VoxCPM.from_pretrained(
    "openbmb/VoxCPM2",
    load_denoiser=False,
)

wav = model.generate(
    text="VoxCPM2 brings multilingual support, creative voice design, and controllable voice cloning.",
    cfg_value=2.0,
    inference_timesteps=10,
)
sf.write("output.wav", wav, model.tts_model.sample_rate)

Two notes on this snippet, both from the HF model card and the Quick Start docs:

  • load_denoiser=False skips loading the optional ZipEnhancer denoiser. Keep it off unless you plan to use voice cloning with prompt/reference audio — the denoiser is only needed to clean up reference clips, and per the FAQ it runs on CPU even when CUDA is active.
  • The sample rate is read off the loaded model (model.tts_model.sample_rate) rather than hardcoded — VoxCPM2 emits 48 kHz audio, up from VoxCPM v1's 16 kHz.

Higher inference_timesteps trades speed for quality.

Python — voice design (new in v2)

VoxCPM2 lets you describe the desired voice in natural language inside the text itself, as shown on the HF card:

wav = model.generate(
    text="(A young woman, gentle and sweet voice)Hello, welcome to VoxCPM2!",
    cfg_value=2.0,
    inference_timesteps=10,
)
sf.write("voice_design.wav", wav, model.tts_model.sample_rate)

Python — zero-shot voice cloning

Supply a short reference clip plus its transcript and the model will mimic the speaker's timbre, accent, and pacing. The HF card covers both "Controllable Cloning" and "Ultimate Cloning" variants via the reference_wav_path / prompt_wav_path parameters. For cloning, you'll generally want load_denoiser=True so the reference audio is cleaned up before voiceprint extraction.

Gradio web demo

If you installed from source (step 3), launch the local UI:

python app.py

Results

  • Speed: Not yet community-benchmarked on the RTX 5070 — no source names this card. The official model card publishes a Real-Time Factor (RTF) of ~0.30 standard / ~0.13 with the Nano-VLLM accelerated path, but those figures are measured on an NVIDIA RTX 4090 (Ada), a different architecture — treat them as a reference point, not an RTX 5070 prediction. The RTX 5070 has substantially less compute and memory bandwidth than the 4090, so do not assume the 4090's RTF transfers. A community report on the canonical repo (Issue #282) notes early Blackwell builds ran VoxCPM2 inference below a 4090's throughput on an RTX Pro 6000 Blackwell card — a different Blackwell GPU, but a signal that Blackwell tuning was still maturing at that time. Track /check/voxcpm2/rtx-5070 and share your own measurement via /contribute when it lands.
  • VRAM usage: ~8 GB for the stock bf16 path, per the official model card's Model Details table and independently confirmed in the VoxCPM 2 FAQ VRAM & RTF reference table. On the RTX 5070's 12 GB that leaves roughly 4 GB free — enough to keep the runtime stable and a light second workload resident, but not the half-the-card headroom a 16 GB card would offer.
  • Quality notes: 48 kHz studio-quality output, tokenizer-free diffusion-autoregressive architecture (LocEnc → TSLM → RALM → LocDiT) on the MiniCPM-4 backbone, 30 supported languages plus several Chinese dialects, Apache-2.0 license (free for commercial use). Audio path is 16 kHz reference → AudioVAE V2 → 48 kHz output per the official model card architecture section. MLX 8-bit / 4-bit variants exist under the mlx-community namespace for Apple Silicon, but on NVIDIA the stock bf16 path is the supported route.

For the full benchmark data, see /check/voxcpm2/rtx-5070.

Making use of the 12 GB card

The model needs ~8 GB; the RTX 5070 has 12 GB. That spare ~4 GB is modest but useful — it keeps optimize=True (torch.compile + CUDA Graphs) comfortable and leaves room for a small colocated helper rather than forcing you onto an 8 GB card with zero margin:

  • Keep the voice-cloning helper resident — with load_denoiser=True the ZipEnhancer denoiser and the SenseVoice-Small ASR helper (downloaded by the web demo on first use) fit alongside the model for an end-to-end clone-from-clip service. The denoiser runs on CPU per the FAQ, so it costs little GPU memory.
  • Keep optimize=True warm — torch.compile + CUDA Graphs (the default) reserves extra working memory for compiled kernels; the ~4 GB headroom means you don't have to disable it to keep the run stable.
  • Tighter than a 16 GB card — if you specifically want to colocate a 7B-class LLM at Q4 (~4–5 GB) on the same GPU, the 12 GB envelope is too tight to do it alongside VoxCPM2's ~8 GB; step up to a 16 GB card for that, or run the LLM in a separate process with paging.

Troubleshooting

Crashes or "not compatible with the current PyTorch installation" on the RTX 5070

This is the Blackwell (sm_120) kernel-availability gap. A is not compatible with the current PyTorch installation error, or silent fallbacks/crashes after load, means torch was installed without sm_120 kernels. The fix is to (re)install PyTorch from a CUDA 12.8+ index as shown in Installation step 1; on Issue #258 a VoxCPM collaborator advised a 5000-series user to upgrade both CUDA and torch to resolve exactly this incompatibility. PR #250 documents that CUDA 13.0 is the cleanest toolchain for native sm_120 support.

torch.compile / Triton errors during warm-up

Per the VoxCPM 2 FAQ, torch._dynamo.exc.Unsupported errors (often mentioning einops) or Triton import failures during warm-up come from mismatched PyTorch/Triton versions. Quick fix — skip torch.compile entirely:

model = VoxCPM.from_pretrained("openbmb/VoxCPM2", optimize=False)

For a permanent fix, pin matching versions: the FAQ's compatibility table maps PyTorch 2.4/2.5 → Triton 3.1, 2.6 → 3.2, 2.7 → 3.3, 2.8 → 3.4. On Windows, Issue #258 reports that optimize=True (torch.compile + CUDA Graphs) can also hit a thread-local-storage assertion after a few generations when inference runs inside Gradio's background worker thread — optimize=False sidesteps it. For maximum stability on Blackwell, a VoxCPM collaborator on that thread recommends the Nano-VLLM accelerated runtime as an alternative.

Could not load libtorchcodec when using reference audio

The FAQ recommends installing FFmpeg system-wide and pip install torchcodec, or forcing torchaudio.set_audio_backend("soundfile") if torchcodec cannot be installed. This only matters when you pass reference/prompt audio for voice cloning.