self-hosted/ai
§01·recipe · tts

VoxCPM2 on RTX 4060 Ti 16GB: 30-Language 48kHz Voice Cloning with Headroom to Spare

ttsbeginner8GB+ VRAMMay 21, 2026
models
tools
prerequisites
  • NVIDIA GPU with ≥ 8 GB VRAM (RTX 4060 Ti 16GB has ~8 GB of headroom over the requirement)
  • Python ≥ 3.10 (<3.13)
  • PyTorch ≥ 2.5.0 with CUDA ≥ 12.0

What You'll Build

A local text-to-speech pipeline using OpenBMB's VoxCPM2 — a 2B-parameter tokenizer-free, diffusion-autoregressive TTS model built on the MiniCPM-4 backbone. It synthesises 48 kHz studio-quality audio in 30 languages, supports zero-shot voice cloning from a short reference clip, and adds "voice design" — generating a voice from a natural-language description like "A young woman, gentle and sweet voice". VoxCPM2 is the successor to VoxCPM (the original 0.5B model): the v2 jump moves to 2B parameters, upgrades audio output from 16 kHz to 48 kHz, and expands language coverage to 30 languages including several Chinese dialects.

Hardware data: RTX 4060 Ti 16GB · ~8 GB VRAM requirement per the official model card leaves ~8 GB free for other workloads · See benchmark data

Requirements

ComponentMinimumTested
GPU8 GB VRAM (model card lists VRAM: ~8 GB on the official HF page and confirms in the Quick Start FAQ table)RTX 4060 Ti 16GB (Ada Lovelace, sm_89)
RAM8 GB
Storage~4 GB (model weights + optional denoiser/ASR helpers)
SoftwarePython ≥ 3.10 (<3.13), PyTorch ≥ 2.5.0, CUDA ≥ 12.0 (source)

Installation

1. Install the voxcpm package

The canonical install is the published PyPI package, identical across both the Hugging Face model card and the official OpenBMB/VoxCPM GitHub README:

pip install voxcpm

Unlike Blackwell GPUs (sm_120), the 4060 Ti 16GB (Ada sm_89) has full FlashAttention-2 and Triton kernel coverage on stock CUDA wheels — the default pip install torch already includes sm_89 kernels, so no special wheel selection is required.

2. (Optional) Install from source for the web demo

If you want the Gradio playground, clone the repo and install in editable mode:

git clone https://github.com/OpenBMB/VoxCPM.git
cd VoxCPM
pip install -e .

Source: VoxCPM 2 Quick Start docs.

3. (Optional) Pre-download model weights

Weights download on first inference automatically, but you can pre-fetch them to control where they land:

from huggingface_hub import snapshot_download
snapshot_download("openbmb/VoxCPM2")

Running

Python — basic synthesis

import soundfile as sf
from voxcpm import VoxCPM

model = VoxCPM.from_pretrained(
    "openbmb/VoxCPM2",
    load_denoiser=False,
)

wav = model.generate(
    text="VoxCPM2 is an innovative end-to-end TTS model.",
    cfg_value=2.0,
    inference_timesteps=10,
)
sf.write("output.wav", wav, model.tts_model.sample_rate)

Two notes on this snippet, both straight from the VoxCPM 2 Quick Start docs:

  • load_denoiser=False skips loading the optional ZipEnhancer denoiser. Keep this off unless you plan to use voice cloning with prompt/reference audio — it saves a chunk of memory and download time.
  • The sample rate is read off the loaded model (model.tts_model.sample_rate) rather than hardcoded — VoxCPM2 emits 48 kHz audio, up from VoxCPM v1's 16 kHz.

Higher inference_timesteps trades speed for quality.

Python — zero-shot voice cloning

Supply a short reference clip plus its transcript and the model will mimic the speaker's timbre, accent, and pacing. The OpenBMB/VoxCPM GitHub README covers both "Controllable Cloning" and "Ultimate Cloning" variants using the prompt_wav_path / reference_wav_path parameters. For cloning, you'll generally want load_denoiser=True so the reference audio is cleaned up before voiceprint extraction.

Python — voice design (new in v2)

VoxCPM2 lets you describe the desired voice in natural language inside the text itself, as shown on the HF card:

wav = model.generate(
    text="(A young woman, gentle and sweet voice)Hello, welcome to VoxCPM2.",
    cfg_value=2.0,
    inference_timesteps=10,
)
sf.write("designed.wav", wav, model.tts_model.sample_rate)

CLI commands

The package also ships a voxcpm CLI, per the Quick Start docs:

voxcpm design --text "Hello from VoxCPM2!" --output out.wav
voxcpm clone --text "This is a cloned voice sample." --reference-audio path/to/voice.wav --output out.wav --denoise

Gradio web demo

If you installed from source (step 2), launch the local UI:

python app.py

Per the Quick Start docs, the web demo also downloads an additional ASR model (SenseVoice-Small) on first use.

Results

  • Speed: Vendor RTF (Real-Time Factor) of ~0.30 on an NVIDIA RTX 4090 — i.e. 1 s of audio synthesised in ~0.3 s of GPU time — per the official OpenBMB model card and GitHub README; the same page reports ~0.13 with the Nano-vLLM accelerated path. The RTX 4060 Ti 16GB is the same Ada Lovelace generation (sm_89) but has materially lower memory bandwidth and compute than the 4090 — TTS workloads of this class are bandwidth-bound during the AR decode step, so expect RTF noticeably above the 4090's ~0.30 figure. Treat the RTX 4090 number as a faster-sibling reference, not a target prediction. RTX 4060 Ti 16GB RTF is not yet community-benchmarked; track /check/voxcpm2/rtx-4060-ti-16gb and share your own via /contribute when it lands.
  • VRAM usage: ~8 GB per the official model comparison table (also confirmed in the VoxCPM 2 FAQ table). Leaves about half of the RTX 4060 Ti 16GB free for stacking another model (small LLM, image classifier, etc.).
  • Quality notes: 48 kHz studio-quality output, tokenizer-free diffusion-autoregressive architecture (LocEnc → TSLM → RALM → LocDiT) on the MiniCPM-4 backbone, 30 supported languages plus several Chinese dialects, Apache-2.0 license (free for commercial use). Audio path is 16 kHz reference → AudioVAE V2 → 48 kHz output per the official model card architecture section. MLX 8-bit / 4-bit variants exist under the mlx-community namespace for Apple Silicon, but on NVIDIA the stock bf16 path is the supported route.

For the full benchmark data, see /check/voxcpm2/rtx-4060-ti-16gb.

Troubleshooting

torch.compile / Triton errors during warm-up

Per the VoxCPM 2 FAQ, torch._dynamo.exc.Unsupported or Triton import failures on launch are most common on Windows or older CUDA stacks. Quick fix:

model = VoxCPM.from_pretrained("openbmb/VoxCPM2", optimize=False)

For a permanent fix, pin matching Triton / PyTorch versions (the FAQ table maps PyTorch 2.5 → Triton 3.1, 2.6 → 3.2, 2.7 → 3.3, 2.8 → 3.4); on Windows install triton-windows. The 4060 Ti 16GB itself is fine here — Ada sm_89 is in the well-tested kernel-coverage band for Triton 3.1+.

Could not load libtorchcodec when using reference audio

The FAQ recommends installing FFmpeg system-wide and pip install torchcodec, or forcing torchaudio.set_audio_backend("soundfile") if torchcodec cannot be installed.

Tight on VRAM with other models loaded

A 16 GB RTX 4060 Ti has ~8 GB of comfortable headroom over the default config, but if you want to stack VoxCPM2 alongside an LLM, image model, or another TTS pipeline, set load_denoiser=False to skip loading the ZipEnhancer reference-audio denoiser, as documented in the Quick Start docs. The denoiser is only required when you want to enhance prompt or reference audio for voice cloning, and per the FAQ it runs on CPU even when CUDA is active — leaving it off costs you nothing on the GPU side but reclaims its CPU RAM footprint.