VoxCPM2 on RTX 4080: 30-Language 48kHz Voice Cloning with 8 GB to Spare

What You'll Build

A local text-to-speech pipeline using OpenBMB's VoxCPM2 — a 2B-parameter, tokenizer-free, diffusion-autoregressive TTS model built on the MiniCPM-4 backbone. It synthesises 48 kHz studio-quality audio in 30 languages, supports zero-shot voice cloning from a short reference clip, and adds "voice design" — generating a voice from a natural-language description alone (gender, age, tone, emotion, pace). VoxCPM2 is the successor to VoxCPM (the original 0.5B model): the v2 jump moves to 2B parameters, upgrades audio output from 16 kHz to 48 kHz, and expands language coverage to 30 languages including several Chinese dialects.

Hardware data: RTX 4080 (16 GB VRAM) · ~8 GB VRAM requirement per the official model card leaves roughly 8 GB free for other workloads · See benchmark data

ℹ️ The 4080 is over-provisioned for this model. VoxCPM2 needs only ~8 GB per the official model card, so a 16 GB RTX 4080 runs it with about half the card free. The interesting question on this GPU is not "does it fit" (it fits easily) but "what to do with the spare ~8 GB" — see Colocating with a second model below.

Requirements

Component	Minimum	Tested
GPU	8 GB VRAM (model card lists `VRAM: ~8 GB` in the Model Details table)	RTX 4080 (Ada Lovelace AD103, sm_89, 16 GB)
RAM	8 GB	—
Storage	~5 GB (model.safetensors ~4.58 GB + AudioVAE ~0.38 GB, per the HF Files tab)	—
Software	Python ≥ 3.10 (<3.13), PyTorch ≥ 2.5.0, CUDA ≥ 12.0 (source)	—

Installation

1. Install the `voxcpm` package

The canonical install is the published PyPI package, identical across both the Hugging Face model card and the official OpenBMB/VoxCPM GitHub README:

pip install voxcpm

The RTX 4080 (Ada Lovelace, sm_89) has full FlashAttention-2 and Triton kernel coverage on stock CUDA wheels — the default pip install torch already includes sm_89 kernels, so no special wheel selection or index URL is required.

2. (Optional) Install from source for the web demo

If you want the Gradio playground, clone the repo and install in editable mode:

git clone https://github.com/OpenBMB/VoxCPM.git
cd VoxCPM
pip install -e .

Source: VoxCPM 2 Quick Start docs.

3. (Optional) Pre-download model weights

Weights download on first inference automatically, but you can pre-fetch them to control where they land:

from huggingface_hub import snapshot_download
snapshot_download("openbmb/VoxCPM2")

Running

Python — basic synthesis

import soundfile as sf
from voxcpm import VoxCPM

model = VoxCPM.from_pretrained(
    "openbmb/VoxCPM2",
    load_denoiser=False,
)

wav = model.generate(
    text="VoxCPM2 brings multilingual support, creative voice design, and controllable voice cloning.",
    cfg_value=2.0,
    inference_timesteps=10,
)
sf.write("output.wav", wav, model.tts_model.sample_rate)

Two notes on this snippet, both straight from the HF model card:

load_denoiser=False skips loading the optional denoiser. Keep this off unless you plan to use voice cloning with prompt/reference audio — it saves memory and download time.
The sample rate is read off the loaded model (model.tts_model.sample_rate) rather than hardcoded — VoxCPM2 emits 48 kHz audio, up from VoxCPM v1's 16 kHz.

Higher inference_timesteps trades speed for quality.

Python — zero-shot voice cloning

Supply a short reference clip and the model mimics the speaker's timbre, accent, and pacing. Per the HF card, basic cloning passes reference_wav_path; "Ultimate Cloning" additionally passes prompt_wav_path plus the reference's exact transcript via prompt_text for maximum fidelity:

wav = model.generate(
    text="This is a cloned voice generated by VoxCPM2.",
    reference_wav_path="speaker.wav",
)
sf.write("clone.wav", wav, model.tts_model.sample_rate)

Python — voice design (new in v2)

VoxCPM2 lets you describe the desired voice in natural language inside the text itself, as shown on the HF card:

wav = model.generate(
    text="(A young woman, gentle and sweet voice)Hello, welcome to VoxCPM2!",
    cfg_value=2.0,
    inference_timesteps=10,
)
sf.write("designed.wav", wav, model.tts_model.sample_rate)

Streaming

The package exposes a streaming generator that yields audio chunks as they are produced, per the HF card:

import numpy as np

chunks = []
for chunk in model.generate_streaming(text="Streaming is easy with VoxCPM2!"):
    chunks.append(chunk)
wav = np.concatenate(chunks)
sf.write("streaming.wav", wav, model.tts_model.sample_rate)

Colocating with a second model — the real use case

VoxCPM2's ~8 GB footprint leaves roughly 8 GB free on a 16 GB RTX 4080. That headroom is the genuine reason to run this model on a 4080 rather than a smaller card, and the RTX 4080's PCIe Gen4 x16 host link (versus the RTX 4060 Ti's Gen4 x8) makes any CPU↔GPU staging faster, never tighter. Practical ways to spend the spare VRAM:

Pair with a small LLM for a voice-assistant loop. A 7B–8B language model at Q4 (~5–6 GB) fits alongside VoxCPM2, letting you run text generation and speech synthesis on the same card.
Add an ASR front-end. Load a Whisper-class transcription model next to VoxCPM2 to build a full speech-to-speech pipeline (transcribe → reason → synthesise) on one GPU.
Batch multiple TTS streams. With memory to spare, raise concurrency rather than serialising requests one at a time.

When you stack models, keep load_denoiser=False (see Troubleshooting) so VoxCPM2's footprint stays at its baseline.

Results

Speed: No RTX 4080-specific benchmark exists yet. The official model card reports a Real-Time Factor (RTF) of ~0.30 measured on an NVIDIA RTX 4090 (1 s of audio synthesised in ~0.3 s of GPU time), and ~0.13 with the Nano-vLLM accelerated path. The RTX 4080 is the same Ada Lovelace generation (sm_89) but has lower memory bandwidth (~717 GB/s vs the 4090's ~1008 GB/s) and lower compute — and TTS workloads of this class are bandwidth-bound during the autoregressive decode step — so the 4080 will land slower than the 4090's ~0.30 figure. Treat the RTX 4090 number strictly as a faster-sibling reference, not a 4080 prediction. RTX 4080 RTF is not yet community-benchmarked; track /check/voxcpm2/rtx-4080 and share your own via /contribute when it lands.
VRAM usage: ~8 GB per the official Model Details table. On a 16 GB RTX 4080 that leaves roughly 8 GB free for stacking another model.
Quality notes: 48 kHz studio-quality output via the AudioVAE V2 path (16 kHz reference → 48 kHz output), tokenizer-free diffusion-autoregressive architecture (LocEnc → TSLM → RALM → LocDiT) on the MiniCPM-4 backbone, 30 supported languages plus several Chinese dialects. Apache-2.0 licensed — free for commercial use per the model card License section. The supported NVIDIA route is the stock bf16 path through the voxcpm package.

For the full benchmark data, see /check/voxcpm2/rtx-4080.

Troubleshooting

`torch.compile` / Triton errors during warm-up

torch._dynamo.exc.Unsupported or Triton import failures on launch are most common on Windows or older CUDA stacks. Disable compile optimisation when loading:

model = VoxCPM.from_pretrained("openbmb/VoxCPM2", optimize=False)

The RTX 4080 itself is fine here — Ada sm_89 is in the well-tested kernel-coverage band for the Triton releases that ship with PyTorch 2.5+. See the GitHub repo for version-pinning guidance.

Garbled Chinese audio in third-party runtimes

VoxCPM2 ships a custom VoxCPM2Tokenizer (tokenization_voxcpm2.py) that splits multi-character Chinese tokens into single-character IDs before embedding — the model was trained that way, and a stock LlamaTokenizerFast produces multi-character tokens the model never saw, yielding garbled Chinese output. Using the voxcpm package as shown above applies this splitting automatically; if you wire VoxCPM2 into a different inference framework (vLLM, Nano-vLLM, etc.), make sure the bundled tokenizer is loaded so Chinese synthesis stays correct.

Tight on VRAM with other models loaded

A 16 GB RTX 4080 has ~8 GB of comfortable headroom over VoxCPM2's default footprint, but if you stack it alongside an LLM, ASR model, or another TTS pipeline, set load_denoiser=False (as in every snippet above) to skip the optional reference-audio denoiser. The denoiser is only needed when enhancing prompt/reference audio for cloning; leaving it off keeps VoxCPM2 at its baseline footprint and frees memory for the colocated model.