How much VRAM does Kokoro TTS need?

About 2 GB — the minimum this recipe targets.

How hard is this setup?

Beginner — follow the steps above.

Kokoro TTS on RTX 3060 Ti: 82M-Parameter Text-to-Speech, 47 Voices, Under 3 GB VRAM

What You'll Build

A local text-to-speech pipeline using hexgrad/Kokoro-82M — an 82-million-parameter Apache-2.0 TTS model that emits 24 kHz audio across 9 languages and 47 voices (counted from the canonical VOICES.md). The model is small enough that the 8 GB RTX 3060 Ti has plenty of headroom; you'll have most of your VRAM free for other workloads on the same card.

Hardware data: RTX 3060 Ti (8 GB, GA104-200 / Ampere, sm_86, 256-bit GDDR6 ~448 GB/s) · weights fit under 1 GB at FP16; total inference footprint typically 2–3 GB · See benchmark data

Sizing note: Kokoro is borderline hardware-agnostic — a hands-on local-rig writeup notes that even a "$200 used RTX 3060 would be more than enough for Kokoro alone", and the model also runs on CPU. The 8 GB RTX 3060 Ti is well over-provisioned for this 82M model (2–3 GB on 8 GB), so the steps below apply unchanged to any modern NVIDIA GPU with >= 2 GB VRAM. The RTX 3060 Ti is an Ampere (GA104-200, sm_86) card — 4864 CUDA cores, 8 GB GDDR6 on a 256-bit bus (~448 GB/s) per Wikipedia's GeForce 30 series table and the ASUS Dual RTX 3060 Ti techspec — so the default pip install torch stable wheels already ship the right kernels. No special CUDA build or FlashAttention step is needed for a model this small.

Requirements

Component	Minimum	Tested
GPU	2 GB VRAM (per Clore.ai guide)	RTX 3060 Ti (8 GB)
RAM	8 GB	—
Storage	~1 GB (weights ~312 MB; rest is the Python wheel + misaki G2P data)	—
Software	Python 3.9+ (per Clore.ai), PyTorch with CUDA, `espeak-ng`	—

The weights are a single file, kokoro-v1_0.pth — ~312 MB on disk per the Hugging Face file listing (327,212,226 bytes exactly). Inference activations for an 82M model stay well under a gigabyte, so total resident VRAM is comfortably 2–3 GB on any modern GPU, including the RTX 3060 Ti.

Installation

1. Install espeak-ng at the OS level

The misaki G2P (grapheme-to-phoneme) library underneath Kokoro shells out to espeak-ng. Install it the system way for your OS — the Linux apt-get line is taken from the HF model card's Colab quickstart cell; the macOS Homebrew and Windows installer lines are the canonical equivalents:

# Linux (Debian / Ubuntu)
sudo apt-get install -y espeak-ng

# macOS
brew install espeak-ng

# Windows — download and run the .msi from
# https://github.com/espeak-ng/espeak-ng/releases

2. Install PyTorch (default CUDA wheel)

The RTX 3060 Ti is an Ampere (GA104-200, sm_86) card, fully covered by the default stable PyTorch wheels — no special index URL is required:

pip install torch

Unlike Blackwell (sm_120) GPUs, the 3060 Ti needs no cu128-specific wheel selection: the default pip install torch already includes sm_86 kernels. Kokoro itself does not call FlashAttention-2, xformers, or sageattention, so there is no custom-kernel step for this recipe regardless of the card.

3. Install the Python package

The kokoro PyPI package pulls in misaki, the model loader, and the inference loop. From the Hugging Face model card:

pip install "kokoro>=0.9.2" soundfile

Optional non-English language packs (only install what you need):

pip install "misaki[ja]"   # Japanese
pip install "misaki[zh]"   # Mandarin Chinese

4. (Optional) Pre-download the weights

The first call to KPipeline downloads ~312 MB of weights to your Hugging Face cache. Pre-fetch them if you want to control where they land:

from huggingface_hub import snapshot_download
snapshot_download("hexgrad/Kokoro-82M")

Running

The minimal Python example, from the Hugging Face model card:

from kokoro import KPipeline
import soundfile as sf

pipeline = KPipeline(lang_code='a')  # 'a' = American English

text = "Kokoro is an open-weight 82-million-parameter text-to-speech model."
generator = pipeline(text, voice='af_heart')

for i, (gs, ps, audio) in enumerate(generator):
    sf.write(f'{i}.wav', audio, 24000)

Note the 24000 sample rate — Kokoro emits 24 kHz audio. gs and ps are the graphemes and phonemes for the chunk, useful for debugging pronunciation. The first run also downloads the model weights.

Language codes

From the official GitHub README, lang_code is a single letter:

Code	Language
`a`	American English
`b`	British English
`e`	Spanish
`f`	French
`h`	Hindi
`i`	Italian
`j`	Japanese (requires `misaki[ja]`)
`p`	Brazilian Portuguese
`z`	Mandarin Chinese (requires `misaki[zh]`)

The voice ID (af_heart above) selects one of the 47 voices — see the VOICES.md catalogue for the descriptions. The prefix encodes language + gender (e.g. af_ = American female).

Server-style deployment (optional)

If you'd rather expose an OpenAI-compatible HTTP API instead of writing Python, the community-maintained Kokoro-FastAPI wrapper is the most-cited path, per ThinkSmart.Life's local-rig writeup. The quickest GPU start is the prebuilt image:

docker run --gpus all -p 8880:8880 ghcr.io/remsky/kokoro-fastapi-gpu:latest

For a build-from-source setup, the repo's README directs you to its docker/gpu compose file:

git clone https://github.com/remsky/Kokoro-FastAPI.git
cd Kokoro-FastAPI/docker/gpu
docker compose up --build

Results

Speed: No RTX 3060 Ti speed numbers exist for Kokoro yet. Kokoro's tiny 82M architecture is faster-than-realtime on any modern consumer GPU; the Spheron deployment guide reports a real-time factor (RTF) of about 0.03 measured on an A100 (i.e. 1 s of audio synthesised in ~30 ms). At this 82M scale the bottleneck is the G2P front-end and per-chunk Python overhead rather than raw tensor throughput, so the RTX 3060 Ti will also be comfortably faster-than-realtime — but because no source has benchmarked Kokoro on the 3060 Ti by name, track /check/kokoro-tts/rtx-3060-ti for empirical numbers and contribute your own via /contribute.
VRAM usage: Weights are under 1 GB at FP16; total GPU memory during inference (including CUDA kernels and buffers) is 2–3 GB, per the Spheron deployment guide. The Clore.ai guide lists 2 GB as the minimum and 4 GB as the recommended VRAM — well within the 3060 Ti's 8 GB envelope.
Quality notes: 24 kHz output, 47 voices, 9 languages, Apache-2.0 license. Long text is chunked automatically by the pipeline iterator, which is why the Running example loops over generator.

For the full benchmark data, see /check/kokoro-tts/rtx-3060-ti.

Troubleshooting

`RuntimeError: espeak-ng not found` on first synthesis call

The Python kokoro package wraps misaki, which in turn calls espeak-ng for phonemisation. The PyPI install does not bundle the binary — you must install it through your OS package manager (apt-get install espeak-ng / brew install espeak-ng / Windows .msi) as covered in step 1. Source: official GitHub README.

Non-English voice errors

Japanese (lang_code='j') and Mandarin (lang_code='z') require the optional misaki language packs — pip install "misaki[ja]" or pip install "misaki[zh]". Without them you'll see a missing-dependency traceback at pipeline init. Source: Hugging Face model card.

Pip install fails with `misaki[en]>=X.Y.Z has no matching distribution`

Reported in the upstream issue tracker — pin kokoro to a version whose declared misaki dependency is actually published on PyPI (the latest tagged release on the GitHub repo is the safe bet). Avoid installing from main if the bumped misaki requirement hasn't shipped yet.