Kokoro TTS on RTX 5090: 82M-Parameter Text-to-Speech, 54 Voices, 30 GB Free for a Multi-Model Server

What You'll Build

A local text-to-speech pipeline using hexgrad/Kokoro-82M — an 82-million-parameter Apache-2.0 TTS model that emits 24 kHz audio across 9 languages and 54 voices. Kokoro is the most over-provisioned model in the catalogue when paired with the 32 GB RTX 5090: it needs only 2–3 GB total during inference, leaving roughly 30 GB free for one or more additional resident models on the same card. The angle of this recipe is not "does it fit" (it fits 32× over) — it is how to spend the headroom by colocating Kokoro with a 7B-class LLM and a 20B image-gen model into a single-card multi-model production server.

Hardware data: RTX 5090 (32 GB VRAM) · weights fit under 1 GB at FP16; total inference footprint typically 2–3 GB · See benchmark data

Sizing note: Kokoro is borderline hardware-agnostic — it runs comfortably on a single RTX 3060 or even on CPU. The 5090 is wildly over-provisioned for this 82M model on its own (1 GB on 32 GB is 32× over-provisioned). The steps below apply unchanged to any modern NVIDIA GPU with >= 2 GB VRAM; the unique 5090 framing is the Multi-model server section near the end. No Blackwell-specific tweaks are required — Kokoro does not depend on FlashAttention-2, so the FA2 sm_120 kernel gap that bites larger Blackwell workloads is a non-issue here.

Requirements

Component	Minimum	Tested
GPU	2 GB VRAM (per Clore.ai guide)	RTX 5090 (32 GB)
RAM	8 GB	—
Storage	~1 GB (weights ~200 MB; rest is the Python wheel + misaki G2P data)	—
Software	Python 3.9+ (per Clore.ai), PyTorch with CUDA (cu128 build for Blackwell), `espeak-ng`	—

Installation

1. Install espeak-ng at the OS level

The misaki G2P (grapheme-to-phoneme) library underneath Kokoro shells out to espeak-ng. Install it the system way for your OS — the Linux apt-get line is taken from the official GitHub README; the macOS Homebrew and Windows installer lines are the canonical equivalents:

# Linux (Debian / Ubuntu)
sudo apt-get install -y espeak-ng

# macOS
brew install espeak-ng

# Windows — download and run the .msi from
# https://github.com/espeak-ng/espeak-ng/releases

2. Install PyTorch with the Blackwell-compatible CUDA build

The RTX 5090 is a Blackwell (sm_120) card. Install the cu128 PyTorch wheel so the kernels match — cu126 and earlier wheels do not ship sm_120 kernels:

pip install torch --index-url https://download.pytorch.org/whl/cu128

Kokoro itself does not call FlashAttention-2, xformers, or sageattention, so the FA2 sm_120 kernel gap does not apply to this recipe.

3. Install the Python package

The kokoro PyPI package pulls in misaki, the model loader, and the inference loop. From the Hugging Face model card:

pip install "kokoro>=0.9.2" soundfile

Optional non-English language packs (only install what you need):

pip install "misaki[ja]"   # Japanese
pip install "misaki[zh]"   # Mandarin Chinese

4. (Optional) Pre-download the weights

The first call to KPipeline downloads ~200 MB of weights to your Hugging Face cache. Pre-fetch them if you want to control where they land:

from huggingface_hub import snapshot_download
snapshot_download("hexgrad/Kokoro-82M")

Running

The minimal Python example, verbatim from the Hugging Face model card:

from kokoro import KPipeline
import soundfile as sf

pipeline = KPipeline(lang_code='a')  # 'a' = American English

text = "Kokoro is an open-weight 82-million-parameter text-to-speech model."
generator = pipeline(text, voice='af_heart')

for i, (gs, ps, audio) in enumerate(generator):
    sf.write(f'{i}.wav', audio, 24000)

Note the 24000 sample rate — Kokoro emits 24 kHz audio. gs and ps are the graphemes and phonemes for the chunk, useful for debugging pronunciation.

Language codes

From the official GitHub README, lang_code is a single letter:

Code	Language
`a`	American English
`b`	British English
`e`	Spanish
`f`	French
`h`	Hindi
`i`	Italian
`j`	Japanese (requires `misaki[ja]`)
`p`	Brazilian Portuguese
`z`	Mandarin Chinese (requires `misaki[zh]`)

The voice ID (af_heart above) selects one of the 54 voices — see the VOICES.md catalogue for the full list. The prefix encodes language + gender (e.g. af_ = American female).

Server-style deployment (optional)

If you'd rather expose an OpenAI-compatible HTTP API instead of writing Python, the community-maintained Kokoro-FastAPI wrapper is the most-cited path, per ThinkSmart.Life's local-rig writeup:

git clone https://github.com/remsky/Kokoro-FastAPI.git
cd Kokoro-FastAPI/docker/gpu
docker compose up --build

The repo also ships a top-level start-gpu.sh helper that wraps the same compose invocation.

Multi-model server (the real 5090 angle)

A 5090 running Kokoro alone leaves ~30 GB of unused VRAM. The legitimate per-GPU angle here is to keep Kokoro resident on one CUDA context and load multiple other models alongside it — turning the card into a single-box production stack with voice, chat, and image generation all colocated. Concrete combinations that fit a 32 GB envelope:

Stack	Kokoro footprint	Companion(s)	Companion footprint	Total resident
TTS + chat	2–3 GB	Qwen3-8B Q4_K_M GGUF	~6 GB peak (recipe)	~9 GB; ~23 GB free
TTS + image gen	2–3 GB	Qwen-Image 20B FP8 via ComfyUI	~15 GB peak (recipe)	~18 GB; ~14 GB free
TTS + chat + image gen	2–3 GB	Qwen3-8B Q4_K_M + Qwen-Image 20B FP8 (loaded into separate runtimes)	~6 GB + ~15 GB	~24 GB; ~8 GB free for KV cache headroom

Kokoro's own GPU footprint stays at the 2–3 GB Spheron reports for any of these combinations — it is the other model(s) that define whether the stack fits 32 GB. Run each companion in its own process / CUDA context (separate python for the LLM via llama.cpp or Ollama, ComfyUI for the image-gen model, the FastAPI wrapper above for Kokoro) so VRAM allocation stays additive and each component can be restarted independently.

Results

Speed: No first-party RTX 5090 speed numbers exist for Kokoro yet. For reference, the Spheron deployment guide reports an A100 RTF of ~0.03 (i.e. 1 s of audio in 30 ms), and community RTX 4090 measurements cluster around RTF ~0.04–0.06 (1 s of audio in 40–60 ms). The 5090 is in the same consumer-GPU class and faster per-clock than the 4090 — expect comfortably faster-than-realtime performance, but track /check/kokoro-tts/rtx-5090 for empirical 5090 numbers once a community benchmark lands via /contribute.
VRAM usage: Weights are under 1 GB at FP16; total GPU memory during inference (including CUDA kernels and buffers) is 2–3 GB, per the Spheron deployment guide. The Clore.ai guide lists 2 GB as the minimum and 4 GB as the recommended VRAM, with an RTX 3060 as their recommended card — which leaves the 5090 with roughly 30 GB of spare headroom for the colocation table above.
Quality notes: 24 kHz output, 54 voices, 9 languages, Apache-2.0 license. Input is hard-capped at 510 tokens per generation call (per an independent PyTorch/ONNX benchmark gist) — long text gets chunked automatically by the pipeline iterator.

For the full benchmark data, see /check/kokoro-tts/rtx-5090.

Troubleshooting

`RuntimeError: espeak-ng not found` on first synthesis call

The Python kokoro package wraps misaki, which in turn calls espeak-ng for phonemisation. The PyPI install does not bundle the binary — you must install it through your OS package manager (apt-get install espeak-ng / brew install espeak-ng / Windows .msi) as covered in step 1. Per the official GitHub README.

Non-English voice errors

Japanese (lang_code='j') and Mandarin (lang_code='z') require the optional misaki language packs — pip install "misaki[ja]" or pip install "misaki[zh]". Without them you'll see a missing-dependency traceback at pipeline init. Source: Hugging Face model card.

PyTorch crash on first inference call (Blackwell-specific)

If you installed PyTorch via the default index (pip install torch), you may have landed a cu126 wheel that does not ship sm_120 kernels for the 5090. Reinstall with the explicit cu128 index per step 2:

pip uninstall -y torch
pip install torch --index-url https://download.pytorch.org/whl/cu128

Kokoro itself does not depend on FlashAttention-2, so the FA2 sm_120 kernel gap is not a concern here — only the base PyTorch CUDA build matters.

Silent / empty audio output on Windows

Some non-English voices have been reported to return silence on Windows in the upstream GitHub issue tracker (e.g. Spanish em_alex on Windows 11). If you hit this, verify your espeak-ng install can phonemise the language on its own (espeak-ng -v es "hola" should produce phonemes), then file a fresh issue on the repo if the wrapper still fails.

Pip install fails with `misaki[en]>=X.Y.Z has no matching distribution`

Reported in the upstream issue tracker — pin kokoro to a version whose declared misaki dependency is actually published on PyPI (the latest tagged release on the GitHub repo is the safe bet). Avoid installing from main if the bumped misaki requirement hasn't shipped yet.