Kokoro TTS: Lightweight Text-to-Speech That Runs on Anything

What You'll Build

Add high-quality text-to-speech to any project using Kokoro — an 82M parameter model that generates near-real-time audio on CPU and blazing-fast on any GPU. No GPU required.

Why Kokoro:

82M parameters — tiny enough to embed in any application
Runs on CPU at near real-time speed
Multiple built-in voices — American and British English, different styles
Open weights — deploy anywhere, no API limits
9M+ HuggingFace downloads (as of 2025)

Requirements

Component	Value
GPU	Optional (any GPU helps)
CPU	Any modern CPU
RAM	2GB
Storage	350MB
Python	3.10+

Zero GPU requirement is Kokoro's key advantage. Perfect for servers, edge devices, Raspberry Pi 5, or any machine that needs TTS.

Installation

pip install kokoro>=0.9.4 soundfile

That's it. Kokoro downloads the model automatically on first use.

For phoneme support (recommended):

# Linux/Mac
pip install kokoro[en]

# Windows — install espeak-ng first
# https://github.com/espeak-ng/espeak-ng/releases
pip install kokoro

Basic Usage

Generate Speech

from kokoro import KPipeline
import soundfile as sf
import numpy as np

# Initialize pipeline (downloads model automatically)
pipeline = KPipeline(lang_code='a')  # 'a' = American English, 'b' = British

# Generate speech
text = "Hello! This is Kokoro TTS running entirely locally on your machine."

samples = []
for audio, sr, phonemes in pipeline(text, voice='af_heart', speed=1.0):
    samples.append(audio)

# Save to file
audio_out = np.concatenate(samples)
sf.write('output.wav', audio_out, sr)
print(f"Generated {len(audio_out)/sr:.1f}s of audio")

Available Voices

# American English voices
american_voices = ['af_heart', 'af_bella', 'af_nicole', 'af_sky',
                   'am_adam', 'am_michael']

# British English voices  
british_voices = ['bf_emma', 'bf_isabella', 'bm_george', 'bm_lewis']

# Try different voices
for voice in ['af_heart', 'af_bella', 'am_adam']:
    for audio, sr, _ in pipeline("Testing voice quality.", voice=voice):
        sf.write(f'{voice}_sample.wav', audio, sr)

Stream to Speakers (Real-Time)

import sounddevice as sd

# pip install sounddevice
pipeline = KPipeline(lang_code='a')

for audio, sr, _ in pipeline("Streaming audio directly to speakers!", 
                               voice='af_heart'):
    sd.play(audio, sr)
    sd.wait()

Performance

Hardware	Real-time Factor	100-word generation
RTX 4090	60–100×	< 0.5 seconds
RTX 4070	30–60×	~1 second
CPU (modern 8-core)	3–8×	3–8 seconds
CPU (older 4-core)	1–3×	5–15 seconds

Real-time factor > 1× = faster than real-time. Even on CPU, Kokoro is typically fast enough for live use.

Advanced Usage

Long Document TTS

from kokoro import KPipeline
import soundfile as sf
import numpy as np

def text_to_audio_file(text: str, output_path: str, voice: str = 'af_heart'):
    pipeline = KPipeline(lang_code='a')
    chunks = []
    
    for audio, sr, _ in pipeline(text, voice=voice, speed=1.0):
        chunks.append(audio)
    
    sf.write(output_path, np.concatenate(chunks), sr)
    return sr

# Convert an entire article
with open('article.txt') as f:
    text = f.read()

text_to_audio_file(text, 'article.wav')

FastAPI Server (Production)

from fastapi import FastAPI
from kokoro import KPipeline
import soundfile as sf
import numpy as np
import io

app = FastAPI()
pipeline = KPipeline(lang_code='a')

@app.post("/tts")
async def generate_tts(text: str, voice: str = 'af_heart'):
    chunks = []
    for audio, sr, _ in pipeline(text, voice=voice):
        chunks.append(audio)
    
    buffer = io.BytesIO()
    sf.write(buffer, np.concatenate(chunks), sr, format='WAV')
    buffer.seek(0)
    return Response(content=buffer.read(), media_type="audio/wav")

Batch Processing

# Generate multiple lines efficiently
texts = [
    "Welcome to the tutorial.",
    "Today we'll learn about local AI.",
    "No cloud services required.",
]

for i, text in enumerate(texts):
    chunks = list(pipeline(text, voice='am_adam'))
    sf.write(f'line_{i:03d}.wav', 
             np.concatenate([a for a, _, _ in chunks]), 
             chunks[0][1])

Merge files:

# Using ffmpeg
ffmpeg -i "concat:line_000.wav|line_001.wav|line_002.wav" output.wav

Voice Characteristics

Voice	Gender	Accent	Style
af_heart	Female	American	Warm, natural
af_bella	Female	American	Clear, professional
af_nicole	Female	American	Energetic
af_sky	Female	American	Soft, gentle
am_adam	Male	American	Deep, authoritative
am_michael	Male	American	Neutral, clear
bf_emma	Female	British	Formal, crisp
bm_george	Male	British	Classic BBC

Embedding in Applications

Kokoro's small size makes it ideal for embedding:

# Example: Discord bot with TTS
import discord
from kokoro import KPipeline
import numpy as np, io, soundfile as sf

pipeline = KPipeline(lang_code='a')

async def speak_in_channel(vc, text):
    chunks = list(pipeline(text, voice='af_heart'))
    audio = np.concatenate([a for a, _, _ in chunks])
    sr = chunks[0][1]
    
    buf = io.BytesIO()
    sf.write(buf, audio, sr, format='WAV')
    buf.seek(0)
    
    source = discord.FFmpegPCMAudio(buf, pipe=True)
    vc.play(source)

Troubleshooting

No audio output / silent file: Check soundfile is installed: pip install soundfile

espeak-ng errors on Windows: Download and install espeak-ng before pip install kokoro

Poor pronunciation of technical terms: Add phoneme hints using [brackets]: "Install [k-oʊ-k-ɔ-r-oʊ] TTS"

Slow on CPU: Normal. CPU speed is ~1–8× real-time — fast enough for most use cases

Compare with Other TTS Models

Model	VRAM	Speed	Voice Cloning	Quality
Kokoro	None	Very Fast	No	Good
F5-TTS	4GB+	Fast	Yes (5s)	Excellent
XTTS-v2	4GB	Medium	Yes (30s)	Good

Use Kokoro when: You need fast, reliable TTS with built-in voices, no GPU, or lightweight embedding.

Use F5-TTS when: You need to clone a specific voice.