What You'll Build
Clone any voice from just 5 seconds of audio using F5-TTS — no training required. Generate speech in that cloned voice for as long as you need. Entirely local, no API costs, no data sent to cloud services.
Why F5-TTS stands out:
- Only 5 seconds of reference audio needed (most models need 30+ seconds)
- Runs on CPU if you don't have a GPU
- Emotional and expressive voice synthesis
- Open source (MIT license)
Requirements
| Component | Minimum | Notes |
|---|---|---|
| GPU | Any 4GB+ GPU | CPU works too (slower) |
| VRAM | 4GB | RTX 4090 runs in real-time |
| RAM | 8GB | 16GB recommended |
| Reference audio | 5 seconds | WAV, 24kHz recommended |
F5-TTS is extremely lightweight compared to other TTS models. An RTX 4090 generates audio 10–20× faster than real-time.
Installation
Python Package (Easiest)
# Create virtual environment
python -m venv f5-tts-env
source f5-tts-env/bin/activate # Linux/Mac
f5-tts-env\Scripts\activate # Windows
# Install F5-TTS
pip install f5-tts
From Source (More Control)
git clone https://github.com/SWivid/F5-TTS
cd F5-TTS
pip install -e .
The models download automatically on first run (~1.5GB).
Basic Usage
Command Line
# Clone a voice and generate speech
f5-tts_infer-cli \
--model F5TTS_v1_Base \
--ref_audio "reference.wav" \
--ref_text "This is what the reference audio says." \
--gen_text "This is the text I want to generate in the cloned voice." \
--output_file "output.wav"
Python API
from f5_tts.api import F5TTS
tts = F5TTS()
wav, sr, _ = tts.infer(
ref_file="reference.wav",
ref_text="Transcription of the reference audio.",
gen_text="Text to generate in the cloned voice.",
file_wave="output.wav",
)
Gradio Web Interface
# Launch the interactive UI
f5-tts_infer-gradio
Navigate to http://localhost:7860
Preparing Reference Audio
Best results with:
- Clean audio — no background noise
- Single speaker, no music
- 5–15 seconds duration (longer helps marginally)
- WAV format, 24kHz sample rate
- Neutral to slightly expressive speech
Transcription tip: Provide the exact transcript of your reference audio for best results. Even small errors hurt quality.
# Convert audio format if needed (requires ffmpeg)
ffmpeg -i input.mp3 -ar 24000 -ac 1 reference.wav
Performance
| Hardware | Real-time Factor | Notes |
|---|---|---|
| RTX 4090 | 10–20× | Generates 10–20s audio per 1s wait |
| RTX 4070 | 5–10× | Excellent for production use |
| RTX 3060 (12GB) | 3–5× | Comfortable for most workflows |
| CPU (modern) | 0.5–2× | Slower but works |
Real-time factor > 1× = faster than real-time. A 60-second audio clip generates in 3–12 seconds on RTX 4070–4090.
Multi-Speaker and Batch Generation
from f5_tts.api import F5TTS
tts = F5TTS()
# Generate a long podcast-style script
segments = [
("host_ref.wav", "I have an RTX 4090.", "Welcome to today's episode."),
("guest_ref.wav", "I have a 3090.", "We're discussing local AI."),
]
for ref_file, ref_text, gen_text in segments:
tts.infer(ref_file=ref_file, ref_text=ref_text, gen_text=gen_text,
file_wave=f"segment_{segments.index((ref_file,ref_text,gen_text))}.wav")
Merge segments with ffmpeg:
ffmpeg -i "concat:segment_0.wav|segment_1.wav" -acodec copy output.wav
E2/Vocos Model Variant
F5-TTS includes two model variants:
| Model | Quality | Speed | Best For |
|---|---|---|---|
| F5-TTS (default) | Best | Moderate | Quality-focused |
| E2 TTS | Good | Faster | High-volume generation |
# Use E2 TTS for faster generation
f5-tts_infer-cli --model E2TTS_v1_Base --ref_audio ref.wav \
--ref_text "Reference." --gen_text "Output text."
Troubleshooting
Poor voice cloning quality: Check reference audio transcript — even one word difference causes degradation
Out of memory: Add --device cpu flag or reduce nfe_step parameter (default 32)
Robotic output: Reference audio may have too much background noise. Try a cleaner recording.
Wrong language: F5-TTS works best with English. For other languages, try Kokoro TTS
Use Cases
- Content creators: Consistent voice for long-form content without recording every word
- Game developers: NPC voice generation without voice actor costs
- Accessibility: Read documents aloud in a familiar voice
- Localization: Prototype dubbed versions in English
Compare TTS Models
| Model | VRAM | Speed | Cloning | Quality |
|---|---|---|---|---|
| F5-TTS | 4GB+ | Fast | 5s ref | Excellent |
| Kokoro | <1GB | Very Fast | No cloning | Good |
| XTTS-v2 | 4GB | Medium | 30s ref | Good |