self-hosted/ai
§01·recipe · tts

F5-TTS: Zero-Shot Voice Cloning with 5 Seconds of Audio

ttsbeginner4GB+ VRAMMay 13, 2026
prerequisites
  • NVIDIA GPU with ≥ 4GB VRAM (or CPU — F5-TTS runs on both)
  • Python 3.10+
  • 5 seconds of reference audio (WAV format)

What You'll Build

Clone any voice from just 5 seconds of audio using F5-TTS — no training required. Generate speech in that cloned voice for as long as you need. Entirely local, no API costs, no data sent to cloud services.

Why F5-TTS stands out:

  • Only 5 seconds of reference audio needed (most models need 30+ seconds)
  • Runs on CPU if you don't have a GPU
  • Emotional and expressive voice synthesis
  • Open source (MIT license)

Requirements

ComponentMinimumNotes
GPUAny 4GB+ GPUCPU works too (slower)
VRAM4GBRTX 4090 runs in real-time
RAM8GB16GB recommended
Reference audio5 secondsWAV, 24kHz recommended

F5-TTS is extremely lightweight compared to other TTS models. An RTX 4090 generates audio 10–20× faster than real-time.

Installation

Python Package (Easiest)

# Create virtual environment
python -m venv f5-tts-env
source f5-tts-env/bin/activate  # Linux/Mac
f5-tts-env\Scripts\activate     # Windows

# Install F5-TTS
pip install f5-tts

From Source (More Control)

git clone https://github.com/SWivid/F5-TTS
cd F5-TTS
pip install -e .

The models download automatically on first run (~1.5GB).

Basic Usage

Command Line

# Clone a voice and generate speech
f5-tts_infer-cli \
  --model F5TTS_v1_Base \
  --ref_audio "reference.wav" \
  --ref_text "This is what the reference audio says." \
  --gen_text "This is the text I want to generate in the cloned voice." \
  --output_file "output.wav"

Python API

from f5_tts.api import F5TTS

tts = F5TTS()

wav, sr, _ = tts.infer(
    ref_file="reference.wav",
    ref_text="Transcription of the reference audio.",
    gen_text="Text to generate in the cloned voice.",
    file_wave="output.wav",
)

Gradio Web Interface

# Launch the interactive UI
f5-tts_infer-gradio

Navigate to http://localhost:7860

Preparing Reference Audio

Best results with:

  • Clean audio — no background noise
  • Single speaker, no music
  • 5–15 seconds duration (longer helps marginally)
  • WAV format, 24kHz sample rate
  • Neutral to slightly expressive speech

Transcription tip: Provide the exact transcript of your reference audio for best results. Even small errors hurt quality.

# Convert audio format if needed (requires ffmpeg)
ffmpeg -i input.mp3 -ar 24000 -ac 1 reference.wav

Performance

HardwareReal-time FactorNotes
RTX 409010–20×Generates 10–20s audio per 1s wait
RTX 40705–10×Excellent for production use
RTX 3060 (12GB)3–5×Comfortable for most workflows
CPU (modern)0.5–2×Slower but works

Real-time factor > 1× = faster than real-time. A 60-second audio clip generates in 3–12 seconds on RTX 4070–4090.

Multi-Speaker and Batch Generation

from f5_tts.api import F5TTS

tts = F5TTS()

# Generate a long podcast-style script
segments = [
    ("host_ref.wav", "I have an RTX 4090.", "Welcome to today's episode."),
    ("guest_ref.wav", "I have a 3090.", "We're discussing local AI."),
]

for ref_file, ref_text, gen_text in segments:
    tts.infer(ref_file=ref_file, ref_text=ref_text, gen_text=gen_text,
               file_wave=f"segment_{segments.index((ref_file,ref_text,gen_text))}.wav")

Merge segments with ffmpeg:

ffmpeg -i "concat:segment_0.wav|segment_1.wav" -acodec copy output.wav

E2/Vocos Model Variant

F5-TTS includes two model variants:

ModelQualitySpeedBest For
F5-TTS (default)BestModerateQuality-focused
E2 TTSGoodFasterHigh-volume generation
# Use E2 TTS for faster generation
f5-tts_infer-cli --model E2TTS_v1_Base --ref_audio ref.wav \
  --ref_text "Reference." --gen_text "Output text."

Troubleshooting

Poor voice cloning quality: Check reference audio transcript — even one word difference causes degradation

Out of memory: Add --device cpu flag or reduce nfe_step parameter (default 32)

Robotic output: Reference audio may have too much background noise. Try a cleaner recording.

Wrong language: F5-TTS works best with English. For other languages, try Kokoro TTS

Use Cases

  • Content creators: Consistent voice for long-form content without recording every word
  • Game developers: NPC voice generation without voice actor costs
  • Accessibility: Read documents aloud in a familiar voice
  • Localization: Prototype dubbed versions in English

Compare TTS Models

ModelVRAMSpeedCloningQuality
F5-TTS4GB+Fast5s refExcellent
Kokoro<1GBVery FastNo cloningGood
XTTS-v24GBMedium30s refGood