What You'll Build
A fully-local text-to-speech pipeline using OpenBMB's VoxCPM-0.5B on an Apple M2 Max — running on Apple's MPS (Metal) backend through stock PyTorch, with no NVIDIA GPU, no CUDA, and no FlashAttention. VoxCPM-0.5B is a 0.5B-parameter, tokenizer-free TTS model built on the MiniCPM-4 backbone that does zero-shot voice cloning from a short reference clip. You'll generate 16 kHz audio in Chinese or English, either from text alone or by cloning a voice you supply as a few-second .wav. The weights are only ~1.6 GB on disk, so on the M2 Max's 64 GB unified memory this is one of the most comfortable models the site covers — it fits with room to spare and needs no memory tuning.
Hardware data: Apple M2 Max (64 GB unified memory) · VoxCPM-0.5B weights ~1.6 GB on disk · See benchmark data
ℹ️ Unified memory is not VRAM. The M2 Max has 64 GB of unified memory shared by CPU and GPU — not 64 GB of dedicated VRAM. By default macOS only lets the GPU address roughly 75% of it (~48 GB via Metal's
recommendedMaxWorkingSetSize). VoxCPM-0.5B's ~1.6 GB of weights (and a runtime peak in the low single-digit GB even with the optional denoiser/ASR helpers loaded) sit so far below that ceiling that the addressable-share caveat is moot here — it runs comfortably on any Apple Silicon Mac, down to an 8 GB MacBook, with no wired-limit tuning.
ℹ️ This recipe pins VoxCPM-0.5B (legacy), not VoxCPM1.5 or VoxCPM2. The OpenBMB repo ships three versions per the official Models & Versions table: VoxCPM2 (2B backbone, 48 kHz, 30 languages — the latest, now the repo's default), VoxCPM1.5 (0.6B, 44.1 kHz), and VoxCPM-0.5B (0.5B, 16 kHz, Chinese/English — the original release). The upstream Quick Start and CLI have moved their default to
openbmb/VoxCPM2— the GitHub README now headlines VoxCPM2 and its sample code callsfrom_pretrained("openbmb/VoxCPM2", load_denoiser=False). So you must pass"openbmb/VoxCPM-0.5B"explicitly tofrom_pretrained(shown below) or you will silently load the 2B VoxCPM2 checkpoint instead. This recipe pins 0.5B because it is the original, most-cited release for Chinese/English voice cloning and the lightest fit.
Requirements
| Component | Minimum | Tested |
|---|---|---|
| GPU / memory | 8 GB unified memory (model weights ~1.6 GB; low-single-digit-GB runtime peak) | Apple M2 Max (64 GB unified memory, ~48 GB GPU-addressable) |
| RAM | Same pool — unified | 64 GB unified |
| Storage | ~2 GB for the weights (~1.6 GB core + denoiser/ASR helpers if enabled) | ~1.6 GB on disk |
| Software | Python >= 3.10 (<3.13), PyTorch >= 2.5.0, macOS Sonoma 14 / Sequoia 15+ | macOS Sequoia 15 |
The binding constraint on Apple Silicon is normally addressable unified memory, not raw capacity — but VoxCPM-0.5B barely touches it. The checkpoint on the canonical Hugging Face card is two weight files — pytorch_model.bin (~1.30 GB) plus the audio VAE audiovae.pth (~0.30 GB), about 1.6 GB on disk total (verified via the HF tree for openbmb/VoxCPM-0.5B). There is no model.safetensors; the weights ship as a .bin plus the .pth VAE. Against the ~48 GB the M2 Max's GPU can address by default, that leaves enormous headroom for activations, the optional denoiser/ASR helper models, and macOS itself — this recipe is not memory-bound on this hardware.
Installation
1. Install PyTorch with the MPS (Metal) backend
The standard PyTorch pip wheel for macOS already includes the MPS backend — there is no CUDA wheel, no cu12x build flag, and no FlashAttention to install on Apple Silicon. A current PyTorch (>= 2.5.0, as required by the OpenBMB/VoxCPM repo) satisfies VoxCPM-0.5B:
pip install torch
VoxCPM-0.5B runs through standard PyTorch attention (SDPA on MPS); it does not require FlashAttention-2, bitsandbytes, GPTQ, or AWQ — none of which have a working GPU path on macOS arm64 anyway. There is no attention-backend or quantization flag to reason about on this Mac.
2. Install the voxcpm package
The canonical install is the published PyPI package, per both the Hugging Face model card and the official OpenBMB/VoxCPM GitHub README:
pip install voxcpm
3. (Optional) Install from source for the Gradio web demo
git clone https://github.com/OpenBMB/VoxCPM.git
cd VoxCPM
pip install -e .
4. (Optional) Pre-download model weights
Weights download on first inference automatically, but you can pre-fetch them — pass the full openbmb/VoxCPM-0.5B repo id so you don't pull the newer VoxCPM2:
from huggingface_hub import snapshot_download
snapshot_download("openbmb/VoxCPM-0.5B")
If you plan to use the denoiser / ASR helpers for cleaner cloning (recommended), also pre-fetch the ZipEnhancer and SenseVoice-Small models that the HF card references:
from modelscope import snapshot_download
snapshot_download("iic/speech_zipenhancer_ans_multiloss_16k_base")
snapshot_download("iic/SenseVoiceSmall")
Running
Selecting the MPS (Metal) device
MPS support is a property of the voxcpm package, not of any one checkpoint — it was added by the OpenBMB team after the 0.5B card was first published. On the model card's HF discussion "Mac mps support?", the OpenBMB maintainer confirmed the update (discussion #11), and a community user reported a successful run on an Apple M3 Pro the same day. The package's device selector accepts auto, cpu, mps, cuda, and cuda:N; per the OpenBMB/VoxCPM README, "On Apple Silicon Macs, auto uses MPS when available."
Python — basic synthesis
Load the legacy 0.5B checkpoint and synthesize a clip. Pass device="mps" (or device="auto", which resolves to MPS on Apple Silicon) and optimize=False — torch.compile acceleration (optimize=True) is a CUDA-oriented path and should be off on MPS (see Troubleshooting). The generate() parameters below mirror the VoxCPM-0.5B model card:
import soundfile as sf
from voxcpm import VoxCPM
model = VoxCPM.from_pretrained("openbmb/VoxCPM-0.5B", device="mps", optimize=False)
wav = model.generate(
text="VoxCPM is an innovative end-to-end TTS model.",
prompt_wav_path=None,
prompt_text=None,
cfg_value=2.0,
inference_timesteps=10,
normalize=True,
denoise=True,
)
sf.write("output.wav", wav, 16000)
Note the 16000 sample rate — VoxCPM-0.5B emits 16 kHz audio. Higher inference_timesteps trades speed for quality; lower trades the opposite. Passing the full "openbmb/VoxCPM-0.5B" repo path keeps you on the legacy 0.5B checkpoint; omitting it now resolves to VoxCPM2.
Python — zero-shot voice cloning
Supply a short reference clip plus its transcript and the model will mimic the speaker's timbre, accent, and pacing:
wav = model.generate(
text="Hello — this is your cloned voice speaking.",
prompt_wav_path="reference.wav",
prompt_text="reference transcript matching the wav",
cfg_value=2.0,
inference_timesteps=10,
denoise=True,
)
sf.write("cloned.wav", wav, 16000)
CLI
After installation the voxcpm entry point is available. The CLI uses a design subcommand; on Apple Silicon select MPS with --device mps (or --device auto) and pass --no-optimize — the optimize path is a CUDA torch.compile acceleration and should be off on MPS:
voxcpm design --text "Hello VoxCPM" --device mps --no-optimize --output out.wav
The design CLI loads the package's default checkpoint. To guarantee you stay on this recipe's 0.5B weights (upstream's default has since moved to openbmb/VoxCPM2), pin them through the Python API shown above — VoxCPM.from_pretrained("openbmb/VoxCPM-0.5B", device="mps", optimize=False) — which takes the model id explicitly.
Gradio web demo
If you installed from source (step 3), launch the local UI on MPS:
python app.py --device auto
First run downloads ~1.6 GB of weights; subsequent runs load from the local cache.
Results
- Speed: No first-party Apple M2 Max benchmark for this pair exists yet — /check/voxcpm/m2-max currently returns
verdict: unknownwith no measurements, and contribute one here. We are deliberately not quoting a throughput figure. The only published RTF (~0.17 for VoxCPM-0.5B) is on an RTX 4090 per the official Models & Versions table — a CUDA card, so it does not transfer to Apple Silicon. The lone community Apple datapoint (an M3 Pro run noted in discussion #11) is a different chip on a different memory bandwidth than the M2 Max, so it is an existence proof of "it runs on MPS," not a forwardable M2 Max number. Track and contribute a measurement at /check/voxcpm/m2-max. - Memory usage: ~1.6 GB resident for the weights (1.30 GB
pytorch_model.bin+ 0.30 GBaudiovae.pth), plus activations and the optional denoiser/ASR helper models — a runtime peak in the low single-digit GB. Against the M2 Max's ~48 GB default-addressable pool, memory is not the limiting factor on this hardware. See benchmark data. - Quality notes: 16 kHz output sample rate, two supported languages (Chinese, English), continuation-style voice cloning — all per the official Models & Versions table. Per the HF model card, the model comprehends text to infer and generate appropriate prosody. Apache-2.0 license — commercial use is permitted.
For the full benchmark data (and to be the first to populate it), see /check/voxcpm/m2-max.
Troubleshooting
optimize=True / torch.compile errors on MPS
The optimize=True flag enables torch.compile acceleration, which is primarily a CUDA path; torch.compile/Inductor is limited on MPS. Pass optimize=False (as in the snippets above) on Apple Silicon. If you still hit a runtime error on MPS, fall back to CPU explicitly with device="cpu" and optimize=False — slower, but it always works. (See the VoxCPM device guidance via discussion #11 and the OpenBMB/VoxCPM README.)
Accidentally loaded VoxCPM2 instead of 0.5B
If you called VoxCPM.from_pretrained() without the explicit "openbmb/VoxCPM-0.5B" argument, or ran the voxcpm CLI without --hf-model-id, you got VoxCPM2 (the new repo default — 2B, 48 kHz, 30 languages per the Models & Versions table). VoxCPM2 still fits the M2 Max's large memory budget easily, but it is a different model with different output (48 kHz, 30 languages) — if you specifically want the lightweight 16 kHz bilingual 0.5B this recipe documents, always pass "openbmb/VoxCPM-0.5B" verbatim.
A generic tutorial told me to install FlashAttention / bitsandbytes / a cu12x wheel
None of those apply on Apple Silicon. There is no CUDA, no FlashAttention, and no GPU bitsandbytes/GPTQ/AWQ kernel on macOS — PyTorch on the Mac uses the MPS backend with SDPA attention. If a generic VoxCPM or TTS tutorial tells you to pip install flash-attn, pass --load-in-4bit, or install a CUDA wheel, skip those steps entirely; the pip install torch + pip install voxcpm + device="mps" path above is the complete Apple path.
Strained or weird-sounding voice
Per the Hugging Face card, lower the cfg_value (e.g. from 2.0 toward 1.5) to relax adherence to the text; raise it for tighter prompt following. For long or expressive inputs the model may exhibit instability — the API's retry_badcase=True mode (documented on the model card) re-runs unstable generations automatically, or chunk the text and reduce inference_timesteps.
Background noise in cloned voices
Set denoise=True (the default in the snippets above) or enable "Prompt Speech Enhancement" in the Gradio UI — this pipes the reference clip through iic/speech_zipenhancer_ans_multiloss_16k_base before cloning. Documented on the HF model card.
Want to add your own benchmark? Contribute here.