VoxCPM2 on RTX 4070: 30-Language 48kHz Voice Cloning in ~8 GB VRAM

What You'll Build

A local text-to-speech pipeline using OpenBMB's VoxCPM2 — a 2B-parameter, tokenizer-free, diffusion-autoregressive TTS model built on the MiniCPM-4 backbone. It synthesises 48 kHz studio-quality audio in 30 languages, supports zero-shot voice cloning from a short reference clip, and adds "voice design" — generating a voice from a natural-language description like "A young woman, gentle and sweet voice". VoxCPM2 is the successor to VoxCPM (the original 0.5B model): per the Models & Versions table, the v2 jump moves to 2B parameters, upgrades audio output from 16 kHz to 48 kHz (via AudioVAE V2's built-in super-resolution), and expands language coverage from 2 to 30 languages plus several Chinese dialects.

Hardware data: RTX 4070 (12 GB VRAM) · ~8 GB VRAM requirement per the official model card leaves ~4 GB free for other workloads · See benchmark data

ℹ️ Pick the right variant. This recipe targets VoxCPM2 (the latest 2B model). It is distinct from the older VoxCPM-0.5B and the intermediate VoxCPM1.5 (0.6B) — those are separate, smaller models with fewer languages and lower sample rates. Load openbmb/VoxCPM2 exactly as shown below to get the 2B / 30-language / 48 kHz model.

Requirements

Component	Minimum	Tested
GPU	8 GB VRAM (model card lists `VRAM: ~8 GB` in the Model Details table)	RTX 4070 (Ada Lovelace AD104, sm_89, 12 GB)
RAM	8 GB	—
Storage	~5 GB (`model.safetensors` 4.58 GB + `audiovae.pth` 0.38 GB, per the HF Files tab)	—
Software	Python ≥ 3.10 (<3.13), PyTorch ≥ 2.5.0, CUDA ≥ 12.0 (source)	—

Installation

1. Install the `voxcpm` package

The canonical install is the published PyPI package, identical across both the Hugging Face model card and the official OpenBMB/VoxCPM GitHub README:

pip install voxcpm

The RTX 4070 (Ada Lovelace, sm_89) runs the stock bf16 path through the voxcpm package on default CUDA wheels — the default pip install torch already includes sm_89 kernels, so no special wheel selection or index URL is required. The package handles attention via PyTorch's built-in SDPA; there is no FlashAttention-2 install step to add.

2. (Optional) Pre-download model weights

Weights download on first inference automatically, but you can pre-fetch them to control where they land:

from huggingface_hub import snapshot_download
snapshot_download("openbmb/VoxCPM2")

3. (Optional) Run the web demo

To launch the Gradio playground from the cloned repo, per the GitHub README:

python app.py --port 8808   # then open http://localhost:8808

Running

Python — basic synthesis

from voxcpm import VoxCPM
import soundfile as sf

model = VoxCPM.from_pretrained("openbmb/VoxCPM2", load_denoiser=False)

wav = model.generate(
    text="VoxCPM2 brings multilingual support, creative voice design, and controllable voice cloning.",
    cfg_value=2.0,
    inference_timesteps=10,
)
sf.write("output.wav", wav, model.tts_model.sample_rate)

Two notes on this snippet, both straight from the HF model card:

load_denoiser=False skips loading the optional reference-audio denoiser. Keep it off unless you plan to use voice cloning with prompt/reference audio — the denoiser is only needed to clean up reference clips, and leaving it off saves memory and download time.
The sample rate is read off the loaded model (model.tts_model.sample_rate) rather than hardcoded — VoxCPM2 emits 48 kHz audio, up from VoxCPM v1's 16 kHz.

Higher inference_timesteps trades speed for quality.

Python — voice design (new in v2)

VoxCPM2 lets you describe the desired voice in natural language inside the text itself, as shown on the HF card — put the description in parentheses at the start of text:

wav = model.generate(
    text="(A young woman, gentle and sweet voice)Hello, welcome to VoxCPM2!",
    cfg_value=2.0,
    inference_timesteps=10,
)
sf.write("voice_design.wav", wav, model.tts_model.sample_rate)

Python — zero-shot voice cloning

Supply a short reference clip and the model mimics the speaker's timbre, accent, and pacing. Per the HF card, basic cloning passes reference_wav_path; "Ultimate Cloning" additionally passes prompt_wav_path plus the reference's exact transcript via prompt_text for maximum fidelity:

wav = model.generate(
    text="This is a cloned voice generated by VoxCPM2.",
    reference_wav_path="speaker.wav",
)
sf.write("clone.wav", wav, model.tts_model.sample_rate)

For cloning, you'll generally want load_denoiser=True so the reference audio is cleaned up before voiceprint extraction.

CLI

The voxcpm package also ships a CLI, per the GitHub README:

# Voice design (no reference audio needed)
voxcpm design --text "VoxCPM2 brings studio-quality multilingual speech synthesis." --output out.wav

# Voice cloning (reference audio)
voxcpm clone --text "This is a voice cloning demo." --reference-audio path/to/voice.wav --output out.wav

Streaming

The package exposes a streaming generator that yields audio chunks as they are produced, per the HF card:

import numpy as np

chunks = []
for chunk in model.generate_streaming(text="Streaming is easy with VoxCPM!"):
    chunks.append(chunk)
wav = np.concatenate(chunks)
sf.write("streaming.wav", wav, model.tts_model.sample_rate)

Results

Speed: Not yet community-benchmarked on the RTX 4070 — no source names this card. The official model card publishes a Real-Time Factor (RTF) of ~0.30 (standard path) and ~0.13 with the Nano-vLLM accelerated path, but those figures are measured on an NVIDIA RTX 4090 (Ada), a substantially faster card with much higher memory bandwidth (384-bit bus + ~1 TB/s) and compute than the RTX 4070 (192-bit bus, ~504 GB/s). TTS workloads of this class are bandwidth-bound during the autoregressive decode step, so the RTX 4070 will land slower than the 4090's ~0.30 figure — treat the RTX 4090 number strictly as a faster-sibling upper bound, not an RTX 4070 prediction. RTX 4070 RTF is not yet community-benchmarked; track /check/voxcpm2/rtx-4070 and share your own via /contribute when it lands.
VRAM usage: ~8 GB for the stock bf16 path, per the official Model Details table. On the RTX 4070's 12 GB that leaves roughly 4 GB free — enough to keep the runtime stable and a light second workload resident, but not the half-the-card headroom a 16 GB card would offer.
Quality notes: 48 kHz studio-quality output via the AudioVAE V2 path (16 kHz reference → 48 kHz output), tokenizer-free diffusion-autoregressive architecture (LocEnc → TSLM → RALM → LocDiT) on the MiniCPM-4 backbone, 30 supported languages plus several Chinese dialects. Apache-2.0 licensed — free for commercial use per the model card License section. The supported NVIDIA route is the stock bf16 path through the voxcpm package.

For the full benchmark data, see /check/voxcpm2/rtx-4070.

Making use of the 12 GB card

The model needs ~8 GB; the RTX 4070 has 12 GB. That spare ~4 GB is modest but useful — it keeps the default optimize=True (torch.compile + CUDA Graphs) comfortable and leaves room for a small colocated helper rather than forcing you onto an 8 GB card with zero margin:

Keep the voice-cloning helper resident — with load_denoiser=True the reference-audio denoiser and the ASR helper the web demo downloads on first use fit alongside the model for an end-to-end clone-from-clip service.
Keep optimize=True warm — torch.compile + CUDA Graphs (the default) reserves extra working memory for compiled kernels; the ~4 GB headroom means you don't have to disable it to keep the run stable.
Tighter than a 16 GB card — if you specifically want to colocate a 7B-class LLM at Q4 (~4–5 GB) on the same GPU, the 12 GB envelope is too tight to do it alongside VoxCPM2's ~8 GB; step up to a 16 GB card for that, or run the LLM in a separate process with paging.

Troubleshooting

`torch.compile` errors during warm-up

torch._dynamo.exc.Unsupported or Triton import failures on launch are most common on Windows or older CUDA stacks. If you hit platform-specific torch.compile issues, skip it by passing optimize=False:

model = VoxCPM.from_pretrained("openbmb/VoxCPM2", load_denoiser=False, optimize=False)

The RTX 4070 itself is fine here — Ada sm_89 is in the well-tested kernel-coverage band for the Triton releases that ship with PyTorch 2.5+, and the default pip install torch already includes the sm_89 kernels this card needs.

Concurrent multi-process workers crash with CUDA Graph errors

If you run two or more independent Python processes each loading VoxCPM2 against the same GPU and serving inference concurrently, you may hit CUDA-level failures (silent empty exceptions through to a CUDACachingAllocator invalid device pointer abort). On GitHub Issue #269 a VoxCPM collaborator confirmed this is a known issue caused by the CUDA Graph optimization path enabled by torch.compile, and recommended using Nano-vLLM-VoxCPM or vLLM-Omni for concurrent inference instead — their single-process serving architecture avoids the multi-process CUDA Graph instability. For single-process / single-stream use on the RTX 4070 this issue does not apply.

Garbled Chinese audio in third-party runtimes

VoxCPM2 ships a custom VoxCPM2Tokenizer (tokenization_voxcpm2.py) that splits multi-character Chinese tokens into single-character IDs before embedding — the model was trained that way, and a stock LlamaTokenizerFast produces multi-character tokens the model never saw, yielding garbled Chinese output. Using the voxcpm package as shown above applies this splitting automatically; if you wire VoxCPM2 into a different inference framework, make sure the bundled tokenizer is loaded so Chinese synthesis stays correct.