How much VRAM does Kokoro TTS need?

About 2 GB — the minimum this recipe targets.

How hard is this setup?

Beginner — follow the steps above.

Kokoro TTS on RTX 5060 Ti: 82M-Parameter Text-to-Speech, 54 Voices, Under 3 GB VRAM

What You'll Build

A local text-to-speech pipeline using hexgrad/Kokoro-82M — an 82-million-parameter Apache-2.0 TTS model that emits 24 kHz audio across 9 languages and 54 voices (counted from the canonical VOICES.md). The model is small enough that the 16 GB RTX 5060 Ti is wildly over-provisioned for it; you'll have most of your VRAM free for other workloads on the same card.

Hardware data: RTX 5060 Ti (16 GB VRAM) · weights fit under 1 GB at FP16; total inference footprint typically 2–3 GB · See benchmark data

Sizing note: Kokoro is borderline hardware-agnostic — it runs comfortably on a single RTX 3060 or even on CPU. We pair it with the 5060 Ti here because that's a popular Blackwell card to deploy it on, but the steps below apply unchanged to any modern NVIDIA GPU with ≥ 2 GB VRAM.

Requirements

Component	Minimum	Tested
GPU	2 GB VRAM (per Clore.ai guide)	RTX 5060 Ti (16 GB)
RAM	8 GB	—
Storage	~1 GB (weights ~200 MB; rest is the Python wheel + misaki G2P data)	—
Software	Python 3.9+ (per Clore.ai), PyTorch with CUDA, `espeak-ng`	—

Installation

1. Install espeak-ng at the OS level

The misaki G2P (grapheme-to-phoneme) library underneath Kokoro shells out to espeak-ng. Install it the system way for your OS — verbatim from the official GitHub README:

# Linux (Debian / Ubuntu)
sudo apt-get install -y espeak-ng

# macOS (Apple Silicon also needs the MPS fallback env var)
brew install espeak-ng
export PYTORCH_ENABLE_MPS_FALLBACK=1

# Windows — download and run the .msi from
# https://github.com/espeak-ng/espeak-ng/releases

2. Install the Python package

The kokoro PyPI package pulls in misaki, the model loader, and the inference loop. From the Hugging Face model card:

pip install "kokoro>=0.9.4" soundfile

Optional non-English language packs (only install what you need):

pip install "misaki[ja]"   # Japanese
pip install "misaki[zh]"   # Mandarin Chinese

3. (Optional) Pre-download the weights

The first call to KPipeline downloads ~200 MB of weights to your Hugging Face cache. Pre-fetch them if you want to control where they land:

from huggingface_hub import snapshot_download
snapshot_download("hexgrad/Kokoro-82M")

Running

The minimal Python example, verbatim from the Hugging Face model card:

from kokoro import KPipeline
import soundfile as sf

pipeline = KPipeline(lang_code='a')  # 'a' = American English

text = "Kokoro is an open-weight 82-million-parameter text-to-speech model."
generator = pipeline(text, voice='af_heart')

for i, (gs, ps, audio) in enumerate(generator):
    sf.write(f'{i}.wav', audio, 24000)

Note the 24000 sample rate — Kokoro emits 24 kHz audio. gs and ps are the graphemes and phonemes for the chunk, useful for debugging pronunciation.

Language codes

From the official GitHub README, lang_code is a single letter:

Code	Language
`a`	American English
`b`	British English
`e`	Spanish
`f`	French
`h`	Hindi
`i`	Italian
`j`	Japanese (requires `misaki[ja]`)
`p`	Brazilian Portuguese
`z`	Mandarin Chinese (requires `misaki[zh]`)

The voice ID (af_heart above) selects one of 54 voices — see the VOICES.md catalogue for the full list. The prefix encodes language + gender (e.g. af_ = American female).

Server-style deployment (optional)

If you'd rather expose an OpenAI-compatible HTTP API instead of writing Python, the community-maintained Kokoro-FastAPI wrapper is the most-cited path, per ThinkSmart.Life's local-rig writeup:

git clone https://github.com/remsky/Kokoro-FastAPI.git
cd Kokoro-FastAPI
docker compose -f docker-compose.gpu.yml up -d

Results

Speed: No first-party speed numbers exist for the RTX 5060 Ti yet. For reference, an independent PyTorch/ONNX benchmark gist reports 96× real-time on an AWS g5.xlarge instance (NVIDIA A10G, 24 GB Ampere) with PyTorch CUDA, using the af_bella voice on ~3,000 words of English input. The Spheron deployment guide summarises community RTX 4090 measurements at "RTF ~0.04–0.06" (i.e. 1 s of audio in 40–60 ms). The 5060 Ti is in the same consumer-GPU class — expect comparable, faster-than-realtime performance, but track /check/kokoro-tts/rtx-5060-ti for empirical 5060 Ti numbers as they land.
VRAM usage: Weights are under 1 GB at FP16; total GPU memory during inference (including CUDA kernels and buffers) is 2–3 GB, per the Spheron deployment guide. The Clore.ai guide lists 2 GB as the minimum and 4 GB as the recommended VRAM, with an RTX 3060 as their recommended card.
Quality notes: 24 kHz output, 54 voices, 9 languages, Apache-2.0 license. Input is hard-capped at 510 tokens per generation call (per the gist benchmark) — long text gets chunked automatically by the pipeline iterator.

For the full benchmark data, see /check/kokoro-tts/rtx-5060-ti.

Troubleshooting

`RuntimeError: espeak-ng not found` on first synthesis call

The Python kokoro package wraps misaki, which in turn calls espeak-ng for phonemisation. The PyPI install does not bundle the binary — you must install it through your OS package manager (apt-get install espeak-ng / brew install espeak-ng / Windows .msi) as covered in step 1. Per the official GitHub README.

Non-English voice errors

Japanese (lang_code='j') and Mandarin (lang_code='z') require the optional misaki language packs — pip install "misaki[ja]" or pip install "misaki[zh]". Without them you'll see a missing-dependency traceback at pipeline init. Source: Hugging Face model card.

Silent / empty audio output on Windows

Some non-English voices have been reported to return silence on Windows in the upstream GitHub issue tracker (e.g. Spanish em_alex on Windows 11). If you hit this, verify your espeak-ng install can phonemise the language on its own (espeak-ng -v es "hola" should produce phonemes), then file a fresh issue on the repo if the wrapper still fails.

Pip install fails with `misaki[en]>=X.Y.Z has no matching distribution`

Reported in the upstream issue tracker — pin kokoro to a version whose declared misaki dependency is actually published on PyPI (the latest tagged release on the GitHub repo is the safe bet). Avoid installing from main if the bumped misaki requirement hasn't shipped yet.