self-hosted/ai
§01·recipe · tts

Kokoro TTS on RTX 4080 SUPER: Universal 82M Voice Synthesis with 15 GB to Spare

ttsbeginner1GB+ VRAMJun 2, 2026
models
tools
prerequisites
  • Any CUDA GPU with 4 GB+ VRAM (this recipe targets an NVIDIA RTX 4080 SUPER 16GB)
  • Python 3.10+
  • espeak-ng (system package)

What You'll Build

A working Kokoro-82M text-to-speech setup on an RTX 4080 SUPER — an open-weight (Apache-2.0) 82-million-parameter TTS model that turns text into natural speech in real time, using under 1 GB of VRAM and leaving the other ~15 GB of the card free for colocated models.

Hardware data: RTX 4080 SUPER (16GB VRAM) · 82M params, <1 GB VRAM · real-time synthesis (RTF « 1) · See benchmark data

ℹ️ Why this recipe is about headroom, not fit. Kokoro is an 82M-parameter model — it runs comfortably on a 4 GB card and is not GPU-tier-sensitive. On a 16 GB RTX 4080 SUPER the interesting question isn't "does it fit" (it trivially does) but "what else can you run alongside it." This recipe leads with the colocation angle. The 4080 SUPER's raw compute advantage over smaller cards is largely irrelevant here: Kokoro is already real-time on far cheaper hardware, so the extra throughput buys you concurrency, not a faster single stream.

Requirements

ComponentMinimumTested
GPUAny CUDA GPU (4 GB+)RTX 4080 SUPER (16GB)
RAM8 GB
Storage~312 MB (weights)~312 MB
SoftwarePython 3.10+, espeak-ng

The weights are a single file, kokoro-v1_0.pth — ~312 MB on disk per the Hugging Face file listing. Inference activations for an 82M model stay well under a gigabyte, so total resident VRAM is comfortably under 1 GB on any modern GPU, including the RTX 4080 SUPER.

Installation

1. Install espeak-ng (system dependency)

sudo apt-get install espeak-ng

Kokoro uses espeak-ng for grapheme-to-phoneme fallback on out-of-dictionary words. It is the one non-Python dependency.

2. Install the kokoro package

pip install kokoro>=0.9.2 soundfile

This is the canonical install command from the model card's Quick Start (latest release on PyPI is 0.9.4). The kokoro package bundles the model loader and the KPipeline inference API. On first use it downloads the hexgrad/Kokoro-82M weights (~312 MB) from Hugging Face automatically.

No special CUDA wheel selection is required for the RTX 4080 SUPER: it is an Ada Lovelace AD103 (sm_89) card, fully covered by the default pip install torch stable wheels. There is no FlashAttention or custom-kernel step for a model this small.

3. Verify the install

python -c "from kokoro import KPipeline; print('kokoro ok')"

Running

Create a short script that synthesizes speech from text. Kokoro uses a one-letter lang_code to select the language pack (a = American English, b = British English, and so on) and a voice argument to pick a voice. The example below uses af_heart, the voice from the model card's Quick Start; the full set of 54 voices is listed in VOICES.md.

from kokoro import KPipeline
import soundfile as sf

pipeline = KPipeline(lang_code='a')  # American English

text = "Kokoro is an open-weight text to speech model with 82 million parameters."

generator = pipeline(text, voice='af_heart')
for i, (gs, ps, audio) in enumerate(generator):
    sf.write(f'output_{i}.wav', audio, 24000)

Running the script writes one or more output_*.wav files at a 24 kHz sample rate. The first run also downloads the model weights.

Results

  • Speed: Real-time on any modern GPU — Kokoro's real-time factor is well below 1 (it generates audio faster than playback) even on entry-level cards, so the RTX 4080 SUPER's compute headroom translates into concurrency rather than a faster single stream. No 4080 SUPER-specific wall-clock benchmark has been contributed yet; see /check/kokoro-tts/rtx-4080-super and add yours via /contribute.
  • VRAM usage: Under 1 GB resident — the 82M weights are a ~312 MB file and inference activations stay well within a gigabyte, leaving ~15 GB of the card free. See /check/kokoro-tts/rtx-4080-super for community benchmark data.
  • Voices & languages: 54 voices across 9 language packs (American and British English, Spanish, French, Hindi, Italian, Japanese, Brazilian Portuguese, and Mandarin Chinese) per VOICES.md.

Colocating with a second model — the real use case

With Kokoro using under 1 GB, the RTX 4080 SUPER has ~15 GB free. Concrete pairings that fit comfortably:

  • A 7B LLM at Q4_K_M (~4.5–5 GB) — run a conversational agent and voice its replies with Kokoro in the same process. Pair with Qwen3-8B or similar.
  • Whisper-large-v3 (~3 GB) — build a full speech-to-speech loop (ASR in, TTS out) on one card, with room to spare.
  • Batch TTS — run many KPipeline workers in parallel; each instance is tiny, so you can saturate the card with concurrent synthesis streams. This is where the 4080 SUPER's compute actually pays off versus a smaller card.

Troubleshooting

espeak-ng not found

Kokoro uses espeak-ng for grapheme-to-phoneme fallback on out-of-dictionary words. If you see phonemizer errors, confirm espeak-ng is installed (espeak-ng --version) and on PATH.

Wrong language output

The lang_code argument must match the voice prefix — af_*/am_* voices need lang_code='a' (American English), bf_*/bm_* voices need lang_code='b' (British English). Mismatches produce garbled prosody. The full prefix-to-language mapping is in VOICES.md.

For the full benchmark data, see /check/kokoro-tts/rtx-4080-super.