How much VRAM does VoxCPM need?

About 2 GB — the minimum this recipe targets.

How hard is this setup?

Beginner — follow the steps above.

VoxCPM-0.5B on Apple M2 Max: Zero-Shot Voice Cloning TTS in Unified Memory (MPS)

What You'll Build

A fully-local text-to-speech pipeline using OpenBMB's VoxCPM-0.5B on an Apple M2 Max — running on Apple's MPS (Metal) backend through stock PyTorch, with no NVIDIA GPU, no CUDA, and no FlashAttention. VoxCPM-0.5B is a 0.5B-parameter, tokenizer-free TTS model built on the MiniCPM-4 backbone that does zero-shot voice cloning from a short reference clip. You'll generate 16 kHz audio in Chinese or English, either from text alone or by cloning a voice you supply as a few-second .wav. The weights are only ~1.6 GB on disk, so on the M2 Max's 64 GB unified memory this is one of the most comfortable models the site covers — it fits with room to spare and needs no memory tuning.

Hardware data: Apple M2 Max (64 GB unified memory) · VoxCPM-0.5B weights ~1.6 GB on disk · See benchmark data

ℹ️ Unified memory is not VRAM. The M2 Max has 64 GB of unified memory shared by CPU and GPU — not 64 GB of dedicated VRAM. By default macOS only lets the GPU address roughly 75% of it (~48 GB via Metal's recommendedMaxWorkingSetSize). VoxCPM-0.5B's ~1.6 GB of weights (and a runtime peak in the low single-digit GB even with the optional denoiser/ASR helpers loaded) sit so far below that ceiling that the addressable-share caveat is moot here — it runs comfortably on any Apple Silicon Mac, down to an 8 GB MacBook, with no wired-limit tuning.

ℹ️ This recipe pins VoxCPM-0.5B (legacy), not VoxCPM1.5 or VoxCPM2. The OpenBMB repo ships three versions per the official Models & Versions table: VoxCPM2 (2B backbone, 48 kHz, 30 languages — the latest, now the repo's default), VoxCPM1.5 (0.6B, 44.1 kHz), and VoxCPM-0.5B (0.5B, 16 kHz, Chinese/English — the original release). The upstream Quick Start and CLI have moved their default to openbmb/VoxCPM2 — the GitHub README now headlines VoxCPM2 and its sample code calls from_pretrained("openbmb/VoxCPM2", load_denoiser=False). So you must pass "openbmb/VoxCPM-0.5B" explicitly to from_pretrained (shown below) or you will silently load the 2B VoxCPM2 checkpoint instead. This recipe pins 0.5B because it is the original, most-cited release for Chinese/English voice cloning and the lightest fit.

Requirements

Component	Minimum	Tested
GPU / memory	8 GB unified memory (model weights ~1.6 GB; low-single-digit-GB runtime peak)	Apple M2 Max (64 GB unified memory, ~48 GB GPU-addressable)
RAM	Same pool — unified	64 GB unified
Storage	~2 GB for the weights (~1.6 GB core + denoiser/ASR helpers if enabled)	~1.6 GB on disk
Software	Python >= 3.10 (<3.13), PyTorch >= 2.5.0, macOS Sonoma 14 / Sequoia 15+	macOS Sequoia 15

The binding constraint on Apple Silicon is normally addressable unified memory, not raw capacity — but VoxCPM-0.5B barely touches it. The checkpoint on the canonical Hugging Face card is two weight files — pytorch_model.bin (~1.30 GB) plus the audio VAE audiovae.pth (~0.30 GB), about 1.6 GB on disk total (verified via the HF tree for openbmb/VoxCPM-0.5B). There is no model.safetensors; the weights ship as a .bin plus the .pth VAE. Against the ~48 GB the M2 Max's GPU can address by default, that leaves enormous headroom for activations, the optional denoiser/ASR helper models, and macOS itself — this recipe is not memory-bound on this hardware.

Installation

1. Install PyTorch with the MPS (Metal) backend

The standard PyTorch pip wheel for macOS already includes the MPS backend — there is no CUDA wheel, no cu12x build flag, and no FlashAttention to install on Apple Silicon. A current PyTorch (>= 2.5.0, as required by the OpenBMB/VoxCPM repo) satisfies VoxCPM-0.5B:

pip install torch

VoxCPM-0.5B runs through standard PyTorch attention (SDPA on MPS); it does not require FlashAttention-2, bitsandbytes, GPTQ, or AWQ — none of which have a working GPU path on macOS arm64 anyway. There is no attention-backend or quantization flag to reason about on this Mac.

2. Install the `voxcpm` package

The canonical install is the published PyPI package, per both the Hugging Face model card and the official OpenBMB/VoxCPM GitHub README:

pip install voxcpm

3. (Optional) Install from source for the Gradio web demo

git clone https://github.com/OpenBMB/VoxCPM.git
cd VoxCPM
pip install -e .

4. (Optional) Pre-download model weights

Weights download on first inference automatically, but you can pre-fetch them — pass the full openbmb/VoxCPM-0.5B repo id so you don't pull the newer VoxCPM2:

from huggingface_hub import snapshot_download
snapshot_download("openbmb/VoxCPM-0.5B")

If you plan to use the denoiser / ASR helpers for cleaner cloning (recommended), also pre-fetch the ZipEnhancer and SenseVoice-Small models that the HF card references:

from modelscope import snapshot_download
snapshot_download("iic/speech_zipenhancer_ans_multiloss_16k_base")
snapshot_download("iic/SenseVoiceSmall")

Running

Selecting the MPS (Metal) device

MPS support is a property of the voxcpm package, not of any one checkpoint — it was added by the OpenBMB team after the 0.5B card was first published. On the model card's HF discussion "Mac mps support?", the OpenBMB maintainer confirmed the update (discussion #11), and a community user reported a successful run on an Apple M3 Pro the same day. The package's device selector accepts auto, cpu, mps, cuda, and cuda:N; per the OpenBMB/VoxCPM README, "On Apple Silicon Macs, auto uses MPS when available."

Python — basic synthesis

Load the legacy 0.5B checkpoint and synthesize a clip. Pass device="mps" (or device="auto", which resolves to MPS on Apple Silicon) and optimize=False — torch.compile acceleration (optimize=True) is a CUDA-oriented path and should be off on MPS (see Troubleshooting). The generate() parameters below mirror the VoxCPM-0.5B model card:

import soundfile as sf
from voxcpm import VoxCPM

model = VoxCPM.from_pretrained("openbmb/VoxCPM-0.5B", device="mps", optimize=False)
wav = model.generate(
    text="VoxCPM is an innovative end-to-end TTS model.",
    prompt_wav_path=None,
    prompt_text=None,
    cfg_value=2.0,
    inference_timesteps=10,
    normalize=True,
    denoise=True,
)
sf.write("output.wav", wav, 16000)

Note the 16000 sample rate — VoxCPM-0.5B emits 16 kHz audio. Higher inference_timesteps trades speed for quality; lower trades the opposite. Passing the full "openbmb/VoxCPM-0.5B" repo path keeps you on the legacy 0.5B checkpoint; omitting it now resolves to VoxCPM2.

Python — zero-shot voice cloning

Supply a short reference clip plus its transcript and the model will mimic the speaker's timbre, accent, and pacing:

wav = model.generate(
    text="Hello — this is your cloned voice speaking.",
    prompt_wav_path="reference.wav",
    prompt_text="reference transcript matching the wav",
    cfg_value=2.0,
    inference_timesteps=10,
    denoise=True,
)
sf.write("cloned.wav", wav, 16000)

CLI

After installation the voxcpm entry point is available. The CLI uses a design subcommand; on Apple Silicon select MPS with --device mps (or --device auto) and pass --no-optimize — the optimize path is a CUDA torch.compile acceleration and should be off on MPS:

voxcpm design --text "Hello VoxCPM" --device mps --no-optimize --output out.wav

The design CLI loads the package's default checkpoint. To guarantee you stay on this recipe's 0.5B weights (upstream's default has since moved to openbmb/VoxCPM2), pin them through the Python API shown above — VoxCPM.from_pretrained("openbmb/VoxCPM-0.5B", device="mps", optimize=False) — which takes the model id explicitly.

Gradio web demo

If you installed from source (step 3), launch the local UI on MPS:

python app.py --device auto

First run downloads ~1.6 GB of weights; subsequent runs load from the local cache.

Results

Speed: No first-party Apple M2 Max benchmark for this pair exists yet — /check/voxcpm/m2-max currently returns verdict: unknown with no measurements, and contribute one here. We are deliberately not quoting a throughput figure. The only published RTF (~0.17 for VoxCPM-0.5B) is on an RTX 4090 per the official Models & Versions table — a CUDA card, so it does not transfer to Apple Silicon. The lone community Apple datapoint (an M3 Pro run noted in discussion #11) is a different chip on a different memory bandwidth than the M2 Max, so it is an existence proof of "it runs on MPS," not a forwardable M2 Max number. Track and contribute a measurement at /check/voxcpm/m2-max.
Memory usage: ~1.6 GB resident for the weights (1.30 GB pytorch_model.bin + 0.30 GB audiovae.pth), plus activations and the optional denoiser/ASR helper models — a runtime peak in the low single-digit GB. Against the M2 Max's ~48 GB default-addressable pool, memory is not the limiting factor on this hardware. See benchmark data.
Quality notes: 16 kHz output sample rate, two supported languages (Chinese, English), continuation-style voice cloning — all per the official Models & Versions table. Per the HF model card, the model comprehends text to infer and generate appropriate prosody. Apache-2.0 license — commercial use is permitted.

For the full benchmark data (and to be the first to populate it), see /check/voxcpm/m2-max.

Troubleshooting

`optimize=True` / torch.compile errors on MPS

The optimize=True flag enables torch.compile acceleration, which is primarily a CUDA path; torch.compile/Inductor is limited on MPS. Pass optimize=False (as in the snippets above) on Apple Silicon. If you still hit a runtime error on MPS, fall back to CPU explicitly with device="cpu" and optimize=False — slower, but it always works. (See the VoxCPM device guidance via discussion #11 and the OpenBMB/VoxCPM README.)

Accidentally loaded VoxCPM2 instead of 0.5B

If you called VoxCPM.from_pretrained() without the explicit "openbmb/VoxCPM-0.5B" argument, or ran the voxcpm CLI without --hf-model-id, you got VoxCPM2 (the new repo default — 2B, 48 kHz, 30 languages per the Models & Versions table). VoxCPM2 still fits the M2 Max's large memory budget easily, but it is a different model with different output (48 kHz, 30 languages) — if you specifically want the lightweight 16 kHz bilingual 0.5B this recipe documents, always pass "openbmb/VoxCPM-0.5B" verbatim.

A generic tutorial told me to install FlashAttention / bitsandbytes / a `cu12x` wheel

None of those apply on Apple Silicon. There is no CUDA, no FlashAttention, and no GPU bitsandbytes/GPTQ/AWQ kernel on macOS — PyTorch on the Mac uses the MPS backend with SDPA attention. If a generic VoxCPM or TTS tutorial tells you to pip install flash-attn, pass --load-in-4bit, or install a CUDA wheel, skip those steps entirely; the pip install torch + pip install voxcpm + device="mps" path above is the complete Apple path.

Strained or weird-sounding voice

Per the Hugging Face card, lower the cfg_value (e.g. from 2.0 toward 1.5) to relax adherence to the text; raise it for tighter prompt following. For long or expressive inputs the model may exhibit instability — the API's retry_badcase=True mode (documented on the model card) re-runs unstable generations automatically, or chunk the text and reduce inference_timesteps.

Background noise in cloned voices

Set denoise=True (the default in the snippets above) or enable "Prompt Speech Enhancement" in the Gradio UI — this pipes the reference clip through iic/speech_zipenhancer_ans_multiloss_16k_base before cloning. Documented on the HF model card.

Want to add your own benchmark? Contribute here.