How much VRAM does Kokoro TTS need?

About 1 GB — the minimum this recipe targets.

How hard is this setup?

Beginner — follow the steps above.

Kokoro TTS on RX 7800 XT: 82M Voice Synthesis on ROCm (BF16)

What You'll Build

A working Kokoro-82M text-to-speech setup on a 16 GB Radeon RX 7800 XT (RDNA3, Navi 32, gfx1101) through the ROCm stack — an open-weight (Apache-2.0) 82-million-parameter TTS model that turns text into natural speech in real time, using under 1 GB of VRAM and leaving the other ~15 GB of the card free for colocated models.

Hardware data: RX 7800 XT (16GB VRAM) · 82M params, <1 GB VRAM · BF16 on ROCm 7.2 · real-time synthesis (RTF « 1) · See benchmark data

⚠️ This is a ROCm recipe, not CUDA. The RX 7800 XT runs on AMD's ROCm/HIP stack — there is no cu124/cu128 wheel, no xformers install, and no FP8/FP4 path here. The ONLY thing that differs from a CUDA Kokoro setup is the PyTorch wheel: you install the ROCm build of torch from the ROCm wheel index, then the rest (pip install kokoro soundfile, espeak-ng) is identical and hardware-agnostic. Kokoro is a pure-PyTorch 82M model — it uses PyTorch SDPA / eager attention, not FlashAttention-2 and not xformers, and there are no custom kernels to build. RDNA3 has no FP8/FP4 hardware (its WMMA units accept FP16, BF16, INT8, INT4 only), but at 82M params you run the model at native BF16/FP16 with no quantization concern whatsoever. If a guide tells you to pip install xformers, build flash-attn, or pick a cu12x wheel for this card, it's written for the wrong vendor.

ℹ️ Why this recipe is about headroom, not fit. Kokoro is an 82M-parameter model — it runs comfortably on a 4 GB card and is not GPU-tier-sensitive. On a 16 GB RX 7800 XT the interesting question isn't "does it fit" (it trivially does) but "what else can you run alongside it." This recipe leads with the colocation angle. The 7800 XT's compute over smaller cards is largely irrelevant for Kokoro alone: the model is already real-time on far cheaper hardware, so the extra capacity buys you concurrency, not a faster single stream.

✅ AMD covers this on the Radeon family. A dedicated Kokoro AMD Radeon guide lists the officially-supported AMD cards as "RX 7900 XTX, RX 7900 XT, RX 7900 GRE" (plus RDNA2 cards) and notes ROCm "can work with other modern AMD GPUs." The 7800 XT is the same RDNA3 generation as the named 7900-series cards, and AMD's own ROCm install-on-Linux system-requirements matrix lists the RX 7800 XT (gfx1101) as officially supported — so this card runs the same pure-PyTorch Kokoro path as its 7900-series siblings, even though the community Kokoro guide names only the 7900s by name.

Requirements

Component	Minimum	Tested
GPU	1 GB VRAM (ROCm-supported AMD card)	RX 7800 XT (16 GB)
RAM	8 GB	—
Storage	~312 MB (weights)	~312 MB
Driver	AMD ROCm 7.2.x on Linux	—
Software	Python 3.10+, espeak-ng, PyTorch (ROCm 7.2 build)	—

Kokoro-82M is released under the Apache-2.0 license per its Hugging Face model card, and the weights are not gated — no access request or login is required. The weights are a single file, kokoro-v1_0.pth — 327,212,226 bytes (≈312 MiB) per the Hugging Face file listing. Inference activations for an 82M model stay well under a gigabyte, so total resident VRAM is comfortably under 1 GB on any modern GPU, including the 16 GB 7800 XT.

Installation

1. Install espeak-ng (system dependency)

sudo apt-get install espeak-ng

Kokoro uses espeak-ng for grapheme-to-phoneme fallback on out-of-dictionary words. It is the one non-Python, non-GPU dependency and is the same on AMD as on any other platform.

2. Install PyTorch for ROCm

This is the one and only AMD-specific step. The RX 7800 XT (gfx1101) is an officially ROCm-supported GPU, so it uses the stable ROCm PyTorch wheel. Install torch from the ROCm wheel index before installing the kokoro package, so pip doesn't pull a default CUDA build:

pip install torch torchaudio --index-url https://download.pytorch.org/whl/rocm7.2

ℹ️ Verify the ROCm tag before you copy it. As of this writing the current stable ROCm PyTorch wheel is rocm7.2 — but the rocmX.Y tag MOVES over time (6.3 → 6.4 → 7.x). Read the live selector at pytorch.org/get-started/locally and pick the ROCm option to confirm the current tag before running. AMD also ships its own Radeon-tuned wheels at repo.radeon.com if you prefer AMD's build over the community PyTorch one.

3. Install the kokoro package

pip install kokoro soundfile

This is the canonical install command from the model card's Quick Start (pip install -q kokoro>=0.9.2 soundfile; the latest PyPI release is 0.9.4). The kokoro package bundles the model loader and the KPipeline inference API and is hardware-agnostic — it imports whatever PyTorch is already installed, so because you installed the ROCm torch wheel in step 2, it runs on the HIP/cuda device namespace automatically. On first use it downloads the hexgrad/Kokoro-82M weights (~312 MB) from Hugging Face automatically.

4. Verify the install

python -c "import torch; print(torch.__version__, torch.cuda.is_available())"
python -c "from kokoro import KPipeline; print('kokoro ok')"

The first line should print a +rocm7.2-style version suffix and True — ROCm masquerades as the cuda device namespace under HIP, so torch.cuda.is_available() returning True on an AMD card is correct and expected.

Running

Create a short script that synthesizes speech from text. Kokoro uses a one-letter lang_code to select the language pack (a = American English, b = British English, and so on) and a voice argument to pick a voice. The example below uses af_heart, the voice from the model card's Quick Start; the full set of 54 voices is listed in VOICES.md.

from kokoro import KPipeline
import soundfile as sf

pipeline = KPipeline(lang_code='a')  # American English

text = "Kokoro is an open-weight text to speech model with 82 million parameters."

generator = pipeline(text, voice='af_heart')
for i, (gs, ps, audio) in enumerate(generator):
    sf.write(f'output_{i}.wav', audio, 24000)

Running the script writes one or more output_*.wav files at a 24 kHz sample rate. The first run also downloads the model weights. No special device-selection code is needed — the kokoro package uses the available ROCm-backed GPU through PyTorch's standard cuda/HIP namespace.

Results

Speed: Real-time — Kokoro's real-time factor is well below 1 (it generates audio faster than playback) even on entry-level cards, so the 7800 XT's compute headroom translates into concurrency rather than a faster single stream. No RX-7800-XT-named wall-clock benchmark has been contributed yet; rather than transfer a number from a different GPU, the speed figure is omitted. (In particular, no figure is carried from the higher-end 7900 XTX — that card has more memory bandwidth, 960 vs 624 GB/s, and more WMMA units, so its numbers would not represent the 7800 XT.) If you've measured Kokoro synthesis speed on a 7800 XT, please contribute it so it lands on /check/kokoro-tts/rx-7800-xt.
VRAM usage: Under 1 GB resident — the 82M weights are a ~312 MB file and inference activations stay well within a gigabyte, leaving ~15 GB of the card free. See /check/kokoro-tts/rx-7800-xt for community benchmark data.
Voices & languages: 54 voices across 9 language packs (American and British English, Spanish, French, Hindi, Italian, Japanese, Brazilian Portuguese, and Mandarin Chinese) per VOICES.md. Voice prefixes are case-sensitive (af_*/am_* for American English, bf_*/bm_* for British, ef_*/em_* Spanish, ff_* French, hf_*/hm_* Hindi, if_*/im_* Italian, jf_*/jm_* Japanese, pf_*/pm_* Brazilian Portuguese, zf_*/zm_* Mandarin).

Colocating with a second model — the real use case

With Kokoro using under 1 GB, the 16 GB RX 7800 XT has ~15 GB free. Concrete pairings that fit comfortably on ROCm:

A 7B–8B LLM at Q4_K_M (~5 GB) via Ollama or llama.cpp (HIP) — run a conversational agent and voice its replies with Kokoro. Pair with Qwen3-8B or similar; on RDNA3 the LLM path is the cleanest "just works" surface.
Whisper-large-v3 (~3 GB) — build a full speech-to-speech loop (ASR in, TTS out) on one card, with room to spare even alongside a small LLM.
Batch TTS — run many KPipeline workers in parallel; each instance is tiny, so you can saturate the card with concurrent synthesis streams. This is where the 7800 XT's compute actually pays off versus a smaller card.

The 7800 XT's 16 GB is enough to colocate Kokoro with one mid-size workload (a Q4 8B LLM, or Whisper-large, or both if you keep the LLM at Q4 and the ASR model small) — that's the practical envelope, versus the 7900 XTX's 24 GB which can hold a larger LLM alongside.

Troubleshooting

"Torch not compiled with CUDA enabled"

This means a CUDA build of PyTorch got installed instead of the ROCm build (it happens when pip install kokoro pulls a default torch before you've installed the ROCm wheel). Uninstall and reinstall torch against the ROCm wheel index:

pip uninstall torch torchaudio
pip install torch torchaudio --index-url https://download.pytorch.org/whl/rocm7.2

Confirm the installed build is the ROCm one: python -c "import torch; print(torch.__version__)" should print a +rocm7.2-style suffix, and torch.cuda.is_available() returns True (ROCm masquerades as the cuda device namespace under HIP). Always install the ROCm torch wheel before pip install kokoro to avoid this entirely.

A library ships only gfx1100 kernels and won't load on the 7800 XT

The 7800 XT is gfx1101 (Navi 32), while the flagship 7900 XTX is gfx1100 (Navi 31). Most of the ROCm stack ships kernels for both, but occasionally a library or prebuilt extension only carries gfx1100 kernels and refuses to run on gfx1101. The standard Linux-only fallback is to mask the card as gfx1100 at runtime:

HSA_OVERRIDE_GFX_VERSION=11.0.0 python your_script.py

This is a legacy fallback, not a default — the 7800 XT is an officially ROCm-supported card (gfx1101), and Kokoro on the stable ROCm PyTorch wheel runs natively without it. Only reach for it if you hit a "no kernel image is available" / missing-gfx1101-kernel error from a specific library.

`espeak-ng` not found

Kokoro uses espeak-ng for grapheme-to-phoneme fallback on out-of-dictionary words. If you see phonemizer errors, confirm espeak-ng is installed (espeak-ng --version) and on PATH. This dependency is unrelated to the GPU and identical across vendors.

Wrong language output

The lang_code argument must match the voice prefix — af_*/am_* voices need lang_code='a' (American English), bf_*/bm_* voices need lang_code='b' (British English), and so on. Prefixes are case-sensitive. Mismatches produce garbled prosody. The full prefix-to-language mapping is in VOICES.md.

Prefer a containerized AMD setup?

If you'd rather not manage the ROCm Python stack by hand, the community Kokoro-FastAPI project ships a prebuilt ROCm Docker image (ghcr.io/remsky/kokoro-fastapi-rocm:latest, run with --device=/dev/kfd --device=/dev/dri). Note it is marked experimental (ROCm 6.4, amd64 only) — for a plain inference script the manual pip path above is the simpler and more current route on a 7800 XT.

For the full benchmark data, see /check/kokoro-tts/rx-7800-xt.