VoxCPM2 on RTX 4070 Ti Super: 30-Language 48kHz Voice Cloning with 8 GB to Spare

What You'll Build

A local text-to-speech pipeline using OpenBMB's VoxCPM2 — a 2B-parameter, tokenizer-free, diffusion-autoregressive TTS model built on the MiniCPM-4 backbone. It synthesises 48 kHz studio-quality audio in 30 languages, supports zero-shot voice cloning from a short reference clip, and adds "voice design" — generating a voice from a natural-language description alone (gender, age, tone, emotion, pace). VoxCPM2 is the successor to VoxCPM (the original 0.5B model): per the Models & Versions table, the v2 jump moves to 2B parameters, upgrades audio output from 16 kHz to 48 kHz, and expands language coverage from 2 to 30 languages plus several Chinese dialects.

Hardware data: RTX 4070 Ti Super (16 GB VRAM) · ~8 GB VRAM requirement per the official model card leaves roughly 8 GB free for other workloads · See benchmark data

ℹ️ The 4070 Ti Super is over-provisioned for this model. VoxCPM2 needs only ~8 GB per the official model card, so a 16 GB RTX 4070 Ti Super runs it with about half the card free. The interesting question on this GPU is not "does it fit" (it fits easily) but "what to do with the spare ~8 GB" — see Colocating with a second model below.

Requirements

Component	Minimum	Tested
GPU	8 GB VRAM (model card lists `VRAM: ~8 GB` in the Model Details table)	RTX 4070 Ti Super (Ada Lovelace AD103, sm_89, 16 GB)
RAM	8 GB	—
Storage	~5 GB (model.safetensors ~4.58 GB + AudioVAE ~0.38 GB, per the HF Files tab)	—
Software	Python ≥ 3.10 (<3.13), PyTorch ≥ 2.5.0, CUDA ≥ 12.0 (source)	—

Installation

1. Install the `voxcpm` package

The canonical install is the published PyPI package, identical across both the Hugging Face model card and the official OpenBMB/VoxCPM GitHub README:

pip install voxcpm

The RTX 4070 Ti Super (Ada Lovelace, sm_89) runs the stock bf16 path through the voxcpm package on default CUDA wheels — the default pip install torch already includes sm_89 kernels, so no special wheel selection or index URL is required. The package handles attention via PyTorch's built-in SDPA; there is no FlashAttention-2 install step to add.

2. (Optional) Pre-download model weights

Weights download on first inference automatically, but you can pre-fetch them to control where they land:

from huggingface_hub import snapshot_download
snapshot_download("openbmb/VoxCPM2")

3. (Optional) Run the web demo

To launch the Gradio playground from the cloned repo, per the GitHub README:

python app.py --port 8808   # then open http://localhost:8808

Running

Python — basic synthesis

from voxcpm import VoxCPM
import soundfile as sf

model = VoxCPM.from_pretrained("openbmb/VoxCPM2", load_denoiser=False)

wav = model.generate(
    text="VoxCPM2 brings multilingual support, creative voice design, and controllable voice cloning.",
    cfg_value=2.0,
    inference_timesteps=10,
)
sf.write("output.wav", wav, model.tts_model.sample_rate)

Two notes on this snippet, both straight from the HF model card:

load_denoiser=False skips loading the optional reference-audio denoiser. The usage guide notes the denoiser runs in a 16 kHz pipeline and can slightly change voice characteristics — keep it off unless cloning quality needs it; it saves memory and download time.
The sample rate is read off the loaded model (model.tts_model.sample_rate) rather than hardcoded — VoxCPM2 emits 48 kHz audio, up from VoxCPM v1's 16 kHz.

Higher inference_timesteps trades speed for quality.

Python — zero-shot voice cloning

Supply a short reference clip and the model mimics the speaker's timbre, accent, and pacing. Per the HF card, basic cloning passes reference_wav_path; "Ultimate Cloning" additionally passes prompt_wav_path plus the reference's exact transcript via prompt_text for maximum fidelity:

wav = model.generate(
    text="This is a cloned voice generated by VoxCPM2.",
    reference_wav_path="speaker.wav",
)
sf.write("clone.wav", wav, model.tts_model.sample_rate)

Python — voice design (new in v2)

VoxCPM2 lets you describe the desired voice in natural language inside the text itself, as shown on the HF card — put the description in parentheses at the start of text:

wav = model.generate(
    text="(A young woman, gentle and sweet voice)Hello, welcome to VoxCPM2!",
    cfg_value=2.0,
    inference_timesteps=10,
)
sf.write("voice_design.wav", wav, model.tts_model.sample_rate)

CLI

The voxcpm package also ships a CLI, per the GitHub README:

# Voice design (no reference audio needed)
voxcpm design --text "VoxCPM2 brings studio-quality multilingual speech synthesis." --output out.wav

# Voice cloning (reference audio)
voxcpm clone --text "This is a voice cloning demo." --reference-audio path/to/voice.wav --output out.wav

Streaming

The package exposes a streaming generator that yields audio chunks as they are produced, per the HF card:

import numpy as np

chunks = []
for chunk in model.generate_streaming(text="Streaming is easy with VoxCPM!"):
    chunks.append(chunk)
wav = np.concatenate(chunks)
sf.write("streaming.wav", wav, model.tts_model.sample_rate)

Colocating with a second model — the real use case

VoxCPM2's ~8 GB footprint leaves roughly 8 GB free on a 16 GB RTX 4070 Ti Super. That headroom is the genuine reason to run this model on a 4070 Ti Super rather than a smaller card. Practical ways to spend the spare VRAM:

Pair with a small LLM for a voice-assistant loop. A 7B–8B language model at Q4 (~5–6 GB) fits alongside VoxCPM2, letting you run text generation and speech synthesis on the same card.
Add an ASR front-end. Load a Whisper-class transcription model next to VoxCPM2 to build a full speech-to-speech pipeline (transcribe → reason → synthesise) on one GPU.
Serve concurrent TTS streams the supported way. With memory to spare you can raise concurrency, but do it through a single-process serving runtime — Nano-vLLM-VoxCPM or vLLM-Omni — rather than spawning multiple subprocess workers, which a maintainer has confirmed is unstable (see Troubleshooting).

When you stack models, keep load_denoiser=False (see Troubleshooting) so VoxCPM2's footprint stays at its baseline.

Results

Speed: No RTX 4070 Ti Super-specific benchmark exists yet, so no throughput number is quoted for this card. For reference only, the official model card reports a Real-Time Factor (RTF) of ~0.30 measured on an NVIDIA RTX 4090 (standard path) and ~0.13 with the Nano-vLLM accelerated path. The RTX 4070 Ti Super is the same Ada Lovelace generation (sm_89) but has substantially lower memory bandwidth (256-bit bus vs the 4090's 384-bit) and lower compute — and TTS workloads of this class are bandwidth-bound during the autoregressive decode step — so it will land slower than the 4090's ~0.30 figure. Treat the RTX 4090 number strictly as a faster-sibling upper bound, not a 4070 Ti Super prediction. RTX 4070 Ti Super RTF is not yet community-benchmarked; track /check/voxcpm2/rtx-4070-ti-super and share your own via /contribute when it lands.
VRAM usage: ~8 GB per the official Model Details table. On a 16 GB RTX 4070 Ti Super that leaves roughly 8 GB free for stacking another model.
Quality notes: 48 kHz studio-quality output via the AudioVAE V2 path (16 kHz reference → 48 kHz output), tokenizer-free diffusion-autoregressive architecture (LocEnc → TSLM → RALM → LocDiT) on the MiniCPM-4 backbone, 30 supported languages plus several Chinese dialects. Apache-2.0 licensed — free for commercial use per the model card License section. The supported NVIDIA route is the stock bf16 path through the voxcpm package.

For the full benchmark data, see /check/voxcpm2/rtx-4070-ti-super.

Troubleshooting

`torch.compile` errors during warm-up

torch._dynamo.exc.Unsupported or Triton import failures on launch are most common on Windows or older CUDA stacks. The Quick Start docs note: if you hit platform-specific torch.compile issues, pass optimize=False:

model = VoxCPM.from_pretrained("openbmb/VoxCPM2", load_denoiser=False, optimize=False)

The RTX 4070 Ti Super itself is fine here — Ada sm_89 is in the well-tested kernel-coverage band for the Triton releases that ship with PyTorch 2.5+.

Concurrent multi-process workers crash with CUDA Graph errors

If you run two or more independent Python processes each loading VoxCPM2 against the same GPU and serving inference concurrently, you may hit CUDA-level failures (silent empty exceptions through to a CUDACachingAllocator invalid device pointer abort). On GitHub Issue #269 a VoxCPM maintainer confirmed this is a known issue caused by the CUDA Graph optimization path enabled by torch.compile, and recommended using Nano-vLLM-VoxCPM or vLLM-Omni for concurrent inference instead — their single-process serving architecture avoids the multi-process CUDA Graph instability. For single-process / single-stream use on the 4070 Ti Super this issue does not apply.

Garbled Chinese audio in third-party runtimes

VoxCPM2 ships a custom VoxCPM2Tokenizer (tokenization_voxcpm2.py) that splits multi-character Chinese tokens into single-character IDs before embedding — the model was trained that way, and a stock LlamaTokenizerFast produces multi-character tokens the model never saw, yielding garbled Chinese output. Using the voxcpm package as shown above applies this splitting automatically; if you wire VoxCPM2 into a different inference framework, make sure the bundled tokenizer is loaded so Chinese synthesis stays correct.

Tight on VRAM with other models loaded

A 16 GB RTX 4070 Ti Super has ~8 GB of comfortable headroom over VoxCPM2's default footprint, but if you stack it alongside an LLM, ASR model, or another TTS pipeline, set load_denoiser=False (as in every snippet above) to skip the optional reference-audio denoiser. The denoiser is only needed when enhancing prompt/reference audio for cloning; leaving it off keeps VoxCPM2 at its baseline footprint and frees memory for the colocated model.