How much VRAM does Kokoro TTS need?

About 1 GB — the minimum this recipe targets.

How hard is this setup?

Beginner — follow the steps above.

Kokoro TTS on Apple M2 Max: 82M Text-to-Speech, 54 Voices, Native MLX-Audio

What You'll Build

A fully-local text-to-speech pipeline built on hexgrad/Kokoro-82M — an 82-million-parameter Apache-2.0 TTS model that emits 24 kHz audio across 9 languages and 54 voices (counted from the canonical VOICES.md) — running on Apple's native MLX-Audio runtime. No NVIDIA GPU, no CUDA, no FlashAttention: Kokoro runs directly on the M2 Max's GPU through Metal via MLX. At a 327 MB bf16 footprint the model is so small that this is the easiest possible on-ramp to local speech synthesis on a Mac.

Hardware data: Apple M2 Max (64 GB unified memory) · MLX-Audio bf16 weights are a single 327 MB safetensors shard (HF tree) · See benchmark data

ℹ️ Runs on any Apple Silicon — tested on M2 Max. Kokoro is borderline hardware-agnostic. At 82M parameters / 327 MB it is wildly over-provisioned even for a 16 GB MacBook Air; the M2 Max's 64 GB of unified memory is here simply the config this recipe was authored against. The MLX-Audio commands below apply unchanged to any Apple Silicon Mac (M1/M2/M3/M4, Pro/Max/Ultra).

ℹ️ Unified memory is not VRAM. The M2 Max has 64 GB of unified memory shared by CPU and GPU — not 64 GB of dedicated VRAM. By default macOS lets the GPU address roughly 75% of it (~48 GB via Metal's recommendedMaxWorkingSetSize). At 327 MB Kokoro sits so far below that ceiling that the addressable-share caveat is completely moot — there is no fit question here, no wired-limit tuning, on any Apple Silicon Mac the site covers.

Requirements

Component	Minimum	Tested
GPU / memory	16 GB unified memory (~10.5 GB GPU-addressable — vastly more than needed)	Apple M2 Max (64 GB unified memory, ~48 GB addressable)
RAM	Same pool — unified	64 GB unified
Storage	~0.5 GB (bf16 weights are a single 327 MB shard; misaki G2P data adds a little)	~0.5 GB
Software	Python 3.10+, macOS Sonoma 14 / Sequoia 15+	macOS Sequoia 15

The binding constraint on Apple Silicon is normally addressable unified memory, not raw capacity — but for an 82M model at bf16 it does not bind at all. The weights are a single kokoro-v1_0.safetensors shard of 327,115,152 bytes (~0.33 GB) per the HF tree API for mlx-community/Kokoro-82M-bf16. Against the ~48 GB the M2 Max's GPU can address by default, that leaves the entire memory budget effectively untouched. Even a 16 GB MacBook Air (~10.5 GB GPU-addressable) clears these weights more than 30× over.

Installation

1. Install MLX-Audio (the Apple-native TTS path)

pip install mlx-audio

MLX is Apple's array framework; mlx-audio is its speech front-end, and it lists Kokoro as a supported TTS model (Blaizzy/mlx-audio). There is nothing CUDA-shaped to install — no torch build flags, no cu12x wheel index, no FlashAttention.

2. Install misaki (Kokoro's grapheme-to-phoneme front-end)

Kokoro routes text through the misaki G2P library before synthesis. The MLX-Audio README is explicit that the Kokoro path requires it:

pip install misaki

For non-English languages, misaki exposes optional language extras — install only the ones you need:

pip install "misaki[ja]"   # Japanese
pip install "misaki[zh]"   # Mandarin Chinese

misaki's English path also shells out to espeak-ng as a phonemiser fallback; if you hit a phonemisation error on first run, install it the system way with brew install espeak-ng.

Running

Synthesize speech with the mlx_audio.tts.generate command — the exact invocation from the MLX-Audio README:

mlx_audio.tts.generate \
    --model mlx-community/Kokoro-82M-bf16 \
    --text "Welcome to MLX-Audio!" \
    --voice "af_heart" \
    --lang_code "a"

On first run, MLX-Audio pulls the bf16 weights (~327 MB, single shard) from the mlx-community/Kokoro-82M-bf16 Hugging Face repo and caches them under ~/.cache/huggingface. These mirror weights are an ungated bf16 conversion of hexgrad's original Apache-2.0 release, so no license-acceptance step is needed to download them. The command writes a WAV file to the current directory and prints the output path.

Selecting a voice and language

--voice picks one of the 54 voices. af_heart is the American-female default used above; the VOICES.md catalogue lists all 54. The two-letter prefix encodes language + gender (e.g. af_ = American female, bm_ = British male).
--lang_code is a single letter that selects the language family. From the canonical VOICES.md:

Code	Language
`a`	American English
`b`	British English
`e`	Spanish
`f`	French
`h`	Hindi
`i`	Italian
`j`	Japanese (requires `misaki[ja]`)
`p`	Brazilian Portuguese
`z`	Mandarin Chinese (requires `misaki[zh]`)

Match the --voice prefix to the --lang_code — e.g. --voice "bm_george" --lang_code "b" for British English.

Smaller quantized weights (optional)

bf16 is already tiny, but MLX-Audio also lists pre-quantized Kokoro builds if you want an even smaller download: mlx-community/Kokoro-82M-8bit and mlx-community/Kokoro-82M-4bit (Blaizzy/mlx-audio). On a 64 GB M2 Max there is no memory reason to prefer them — bf16 gives the highest fidelity at a footprint that is already negligible — but they exist for constrained setups.

Results

Speed: No first-party Apple M2 Max benchmark for this pair has been recorded yet — /check/kokoro-tts/m2-max currently returns verdict: unknown with no measurements. We deliberately do not quote a real-time-factor (RTF) here: TTS throughput is hardware-specific and an RTF measured on an NVIDIA or AMD card does not transfer to Apple Silicon's Metal/MLX path. If you run this, please contribute your numbers so we can seed a real M2 Max datapoint.
Memory usage: ~0.33 GB resident for the bf16 weights plus a small runtime overhead — negligible against the ~48 GB the M2 Max's GPU can address by default. Memory is in no sense the limiting factor on this hardware.
Quality notes: 24 kHz output, 54 voices across 9 languages, Apache-2.0 license, built on the StyleTTS2 architecture (base model yl4579/StyleTTS2-LJSpeech, per the mlx-community/Kokoro-82M-bf16 model metadata). For the highest fidelity stay on bf16; the 8-bit and 4-bit builds trade a little quality for a smaller download that this hardware doesn't need.

For the full benchmark data (and to be the first to populate it), see /check/kokoro-tts/m2-max.

Troubleshooting

Tried to install FlashAttention / bitsandbytes / a `cu12x` wheel and it failed

None of those apply on Apple Silicon. There is no CUDA, no FlashAttention, and no GPU bitsandbytes kernel on macOS — MLX runs Kokoro directly on the M2 Max's GPU through Metal. If a generic Kokoro tutorial tells you to pip install flash-attn, pass --load-in-4bit, or install a cu124/cu128 PyTorch wheel, skip those steps entirely; the pip install mlx-audio + mlx_audio.tts.generate commands above are the complete Apple path.

`espeak-ng not found` or a phonemisation error on first synthesis

Kokoro's misaki G2P front-end can shell out to espeak-ng for phonemisation. The pip install does not bundle the binary — install it with brew install espeak-ng, then re-run. Confirm it works on its own with espeak-ng -v en "hello" (should print phonemes) before filing an issue.

Non-English voice errors

Japanese (--lang_code "j") and Mandarin (--lang_code "z") require the optional misaki language packs — pip install "misaki[ja]" or pip install "misaki[zh]". Without them you'll see a missing-dependency traceback at synthesis time. Match the --voice prefix to the language (e.g. jf_alpha with --lang_code "j").

Do I need to raise the unified-memory wired limit?

No. The sudo sysctl iogpu.wired_limit_mb raise matters only when a model's weights plus working set exceed the ~75% default-addressable share (~48 GB on a 64 GB Mac) — a 70B-LLM-class problem. Kokoro at 327 MB sits far below the default ceiling on every Apple Silicon Mac the site covers, so the default limit is more than sufficient. Leave it alone.

No other widely-reported issues. Report problems via the submission form.