What You'll Build
Add high-quality text-to-speech to any project using Kokoro — an 82M parameter model that generates near-real-time audio on CPU and blazing-fast on any GPU. No GPU required.
Why Kokoro:
- 82M parameters — tiny enough to embed in any application
- Runs on CPU at near real-time speed
- Multiple built-in voices — American and British English, different styles
- Open weights — deploy anywhere, no API limits
- 9M+ HuggingFace downloads (as of 2025)
Requirements
| Component | Value |
|---|---|
| GPU | Optional (any GPU helps) |
| CPU | Any modern CPU |
| RAM | 2GB |
| Storage | 350MB |
| Python | 3.10+ |
Zero GPU requirement is Kokoro's key advantage. Perfect for servers, edge devices, Raspberry Pi 5, or any machine that needs TTS.
Installation
pip install kokoro>=0.9.4 soundfile
That's it. Kokoro downloads the model automatically on first use.
For phoneme support (recommended):
# Linux/Mac
pip install kokoro[en]
# Windows — install espeak-ng first
# https://github.com/espeak-ng/espeak-ng/releases
pip install kokoro
Basic Usage
Generate Speech
from kokoro import KPipeline
import soundfile as sf
import numpy as np
# Initialize pipeline (downloads model automatically)
pipeline = KPipeline(lang_code='a') # 'a' = American English, 'b' = British
# Generate speech
text = "Hello! This is Kokoro TTS running entirely locally on your machine."
samples = []
for audio, sr, phonemes in pipeline(text, voice='af_heart', speed=1.0):
samples.append(audio)
# Save to file
audio_out = np.concatenate(samples)
sf.write('output.wav', audio_out, sr)
print(f"Generated {len(audio_out)/sr:.1f}s of audio")
Available Voices
# American English voices
american_voices = ['af_heart', 'af_bella', 'af_nicole', 'af_sky',
'am_adam', 'am_michael']
# British English voices
british_voices = ['bf_emma', 'bf_isabella', 'bm_george', 'bm_lewis']
# Try different voices
for voice in ['af_heart', 'af_bella', 'am_adam']:
for audio, sr, _ in pipeline("Testing voice quality.", voice=voice):
sf.write(f'{voice}_sample.wav', audio, sr)
Stream to Speakers (Real-Time)
import sounddevice as sd
# pip install sounddevice
pipeline = KPipeline(lang_code='a')
for audio, sr, _ in pipeline("Streaming audio directly to speakers!",
voice='af_heart'):
sd.play(audio, sr)
sd.wait()
Performance
| Hardware | Real-time Factor | 100-word generation |
|---|---|---|
| RTX 4090 | 60–100× | < 0.5 seconds |
| RTX 4070 | 30–60× | ~1 second |
| CPU (modern 8-core) | 3–8× | 3–8 seconds |
| CPU (older 4-core) | 1–3× | 5–15 seconds |
Real-time factor > 1× = faster than real-time. Even on CPU, Kokoro is typically fast enough for live use.
Advanced Usage
Long Document TTS
from kokoro import KPipeline
import soundfile as sf
import numpy as np
def text_to_audio_file(text: str, output_path: str, voice: str = 'af_heart'):
pipeline = KPipeline(lang_code='a')
chunks = []
for audio, sr, _ in pipeline(text, voice=voice, speed=1.0):
chunks.append(audio)
sf.write(output_path, np.concatenate(chunks), sr)
return sr
# Convert an entire article
with open('article.txt') as f:
text = f.read()
text_to_audio_file(text, 'article.wav')
FastAPI Server (Production)
from fastapi import FastAPI
from kokoro import KPipeline
import soundfile as sf
import numpy as np
import io
app = FastAPI()
pipeline = KPipeline(lang_code='a')
@app.post("/tts")
async def generate_tts(text: str, voice: str = 'af_heart'):
chunks = []
for audio, sr, _ in pipeline(text, voice=voice):
chunks.append(audio)
buffer = io.BytesIO()
sf.write(buffer, np.concatenate(chunks), sr, format='WAV')
buffer.seek(0)
return Response(content=buffer.read(), media_type="audio/wav")
Batch Processing
# Generate multiple lines efficiently
texts = [
"Welcome to the tutorial.",
"Today we'll learn about local AI.",
"No cloud services required.",
]
for i, text in enumerate(texts):
chunks = list(pipeline(text, voice='am_adam'))
sf.write(f'line_{i:03d}.wav',
np.concatenate([a for a, _, _ in chunks]),
chunks[0][1])
Merge files:
# Using ffmpeg
ffmpeg -i "concat:line_000.wav|line_001.wav|line_002.wav" output.wav
Voice Characteristics
| Voice | Gender | Accent | Style |
|---|---|---|---|
| af_heart | Female | American | Warm, natural |
| af_bella | Female | American | Clear, professional |
| af_nicole | Female | American | Energetic |
| af_sky | Female | American | Soft, gentle |
| am_adam | Male | American | Deep, authoritative |
| am_michael | Male | American | Neutral, clear |
| bf_emma | Female | British | Formal, crisp |
| bm_george | Male | British | Classic BBC |
Embedding in Applications
Kokoro's small size makes it ideal for embedding:
# Example: Discord bot with TTS
import discord
from kokoro import KPipeline
import numpy as np, io, soundfile as sf
pipeline = KPipeline(lang_code='a')
async def speak_in_channel(vc, text):
chunks = list(pipeline(text, voice='af_heart'))
audio = np.concatenate([a for a, _, _ in chunks])
sr = chunks[0][1]
buf = io.BytesIO()
sf.write(buf, audio, sr, format='WAV')
buf.seek(0)
source = discord.FFmpegPCMAudio(buf, pipe=True)
vc.play(source)
Troubleshooting
No audio output / silent file: Check soundfile is installed: pip install soundfile
espeak-ng errors on Windows: Download and install espeak-ng before pip install kokoro
Poor pronunciation of technical terms: Add phoneme hints using [brackets]: "Install [k-oʊ-k-ɔ-r-oʊ] TTS"
Slow on CPU: Normal. CPU speed is ~1–8× real-time — fast enough for most use cases
Compare with Other TTS Models
| Model | VRAM | Speed | Voice Cloning | Quality |
|---|---|---|---|---|
| Kokoro | None | Very Fast | No | Good |
| F5-TTS | 4GB+ | Fast | Yes (5s) | Excellent |
| XTTS-v2 | 4GB | Medium | Yes (30s) | Good |
Use Kokoro when: You need fast, reliable TTS with built-in voices, no GPU, or lightweight embedding.
Use F5-TTS when: You need to clone a specific voice.