self-hosted/ai
§01·recipe · tts

Kokoro TTS: Lightweight Text-to-Speech That Runs on Anything

ttsbeginner0GB+ VRAMMay 13, 2026
prerequisites
  • Python 3.10+
  • No GPU required (CPU works perfectly)

What You'll Build

Add high-quality text-to-speech to any project using Kokoro — an 82M parameter model that generates near-real-time audio on CPU and blazing-fast on any GPU. No GPU required.

Why Kokoro:

  • 82M parameters — tiny enough to embed in any application
  • Runs on CPU at near real-time speed
  • Multiple built-in voices — American and British English, different styles
  • Open weights — deploy anywhere, no API limits
  • 9M+ HuggingFace downloads (as of 2025)

Requirements

ComponentValue
GPUOptional (any GPU helps)
CPUAny modern CPU
RAM2GB
Storage350MB
Python3.10+

Zero GPU requirement is Kokoro's key advantage. Perfect for servers, edge devices, Raspberry Pi 5, or any machine that needs TTS.

Installation

pip install kokoro>=0.9.4 soundfile

That's it. Kokoro downloads the model automatically on first use.

For phoneme support (recommended):

# Linux/Mac
pip install kokoro[en]

# Windows — install espeak-ng first
# https://github.com/espeak-ng/espeak-ng/releases
pip install kokoro

Basic Usage

Generate Speech

from kokoro import KPipeline
import soundfile as sf
import numpy as np

# Initialize pipeline (downloads model automatically)
pipeline = KPipeline(lang_code='a')  # 'a' = American English, 'b' = British

# Generate speech
text = "Hello! This is Kokoro TTS running entirely locally on your machine."

samples = []
for audio, sr, phonemes in pipeline(text, voice='af_heart', speed=1.0):
    samples.append(audio)

# Save to file
audio_out = np.concatenate(samples)
sf.write('output.wav', audio_out, sr)
print(f"Generated {len(audio_out)/sr:.1f}s of audio")

Available Voices

# American English voices
american_voices = ['af_heart', 'af_bella', 'af_nicole', 'af_sky',
                   'am_adam', 'am_michael']

# British English voices  
british_voices = ['bf_emma', 'bf_isabella', 'bm_george', 'bm_lewis']

# Try different voices
for voice in ['af_heart', 'af_bella', 'am_adam']:
    for audio, sr, _ in pipeline("Testing voice quality.", voice=voice):
        sf.write(f'{voice}_sample.wav', audio, sr)

Stream to Speakers (Real-Time)

import sounddevice as sd

# pip install sounddevice
pipeline = KPipeline(lang_code='a')

for audio, sr, _ in pipeline("Streaming audio directly to speakers!", 
                               voice='af_heart'):
    sd.play(audio, sr)
    sd.wait()

Performance

HardwareReal-time Factor100-word generation
RTX 409060–100×< 0.5 seconds
RTX 407030–60×~1 second
CPU (modern 8-core)3–8×3–8 seconds
CPU (older 4-core)1–3×5–15 seconds

Real-time factor > 1× = faster than real-time. Even on CPU, Kokoro is typically fast enough for live use.

Advanced Usage

Long Document TTS

from kokoro import KPipeline
import soundfile as sf
import numpy as np

def text_to_audio_file(text: str, output_path: str, voice: str = 'af_heart'):
    pipeline = KPipeline(lang_code='a')
    chunks = []
    
    for audio, sr, _ in pipeline(text, voice=voice, speed=1.0):
        chunks.append(audio)
    
    sf.write(output_path, np.concatenate(chunks), sr)
    return sr

# Convert an entire article
with open('article.txt') as f:
    text = f.read()

text_to_audio_file(text, 'article.wav')

FastAPI Server (Production)

from fastapi import FastAPI
from kokoro import KPipeline
import soundfile as sf
import numpy as np
import io

app = FastAPI()
pipeline = KPipeline(lang_code='a')

@app.post("/tts")
async def generate_tts(text: str, voice: str = 'af_heart'):
    chunks = []
    for audio, sr, _ in pipeline(text, voice=voice):
        chunks.append(audio)
    
    buffer = io.BytesIO()
    sf.write(buffer, np.concatenate(chunks), sr, format='WAV')
    buffer.seek(0)
    return Response(content=buffer.read(), media_type="audio/wav")

Batch Processing

# Generate multiple lines efficiently
texts = [
    "Welcome to the tutorial.",
    "Today we'll learn about local AI.",
    "No cloud services required.",
]

for i, text in enumerate(texts):
    chunks = list(pipeline(text, voice='am_adam'))
    sf.write(f'line_{i:03d}.wav', 
             np.concatenate([a for a, _, _ in chunks]), 
             chunks[0][1])

Merge files:

# Using ffmpeg
ffmpeg -i "concat:line_000.wav|line_001.wav|line_002.wav" output.wav

Voice Characteristics

VoiceGenderAccentStyle
af_heartFemaleAmericanWarm, natural
af_bellaFemaleAmericanClear, professional
af_nicoleFemaleAmericanEnergetic
af_skyFemaleAmericanSoft, gentle
am_adamMaleAmericanDeep, authoritative
am_michaelMaleAmericanNeutral, clear
bf_emmaFemaleBritishFormal, crisp
bm_georgeMaleBritishClassic BBC

Embedding in Applications

Kokoro's small size makes it ideal for embedding:

# Example: Discord bot with TTS
import discord
from kokoro import KPipeline
import numpy as np, io, soundfile as sf

pipeline = KPipeline(lang_code='a')

async def speak_in_channel(vc, text):
    chunks = list(pipeline(text, voice='af_heart'))
    audio = np.concatenate([a for a, _, _ in chunks])
    sr = chunks[0][1]
    
    buf = io.BytesIO()
    sf.write(buf, audio, sr, format='WAV')
    buf.seek(0)
    
    source = discord.FFmpegPCMAudio(buf, pipe=True)
    vc.play(source)

Troubleshooting

No audio output / silent file: Check soundfile is installed: pip install soundfile

espeak-ng errors on Windows: Download and install espeak-ng before pip install kokoro

Poor pronunciation of technical terms: Add phoneme hints using [brackets]: "Install [k-oʊ-k-ɔ-r-oʊ] TTS"

Slow on CPU: Normal. CPU speed is ~1–8× real-time — fast enough for most use cases

Compare with Other TTS Models

ModelVRAMSpeedVoice CloningQuality
KokoroNoneVery FastNoGood
F5-TTS4GB+FastYes (5s)Excellent
XTTS-v24GBMediumYes (30s)Good

Use Kokoro when: You need fast, reliable TTS with built-in voices, no GPU, or lightweight embedding.

Use F5-TTS when: You need to clone a specific voice.