§01·index · /recipes

Recipes

716 community-tested setups for running open-weights AI models on real consumer GPUs.page 7 of 8

multimodalbeginner20GB+
Gemma 4 E4B on RTX 3090: Multimodal Inference via BF16 (with 16 GB of Headroom to Spend)
3dintermediate12GB+
Waypoint 1.5 on RTX 3090: Real-Time Interactive World Model at 720p, 30 FPS
imagebeginner13GB+
Flux.2 Klein 4B on RTX 3090: BFL-Recommended ~13 GB CPU-Offload Path for 4-Step Text-to-Image
imageintermediate13GB+
Qwen-Image on RTX 3090: 20B Text-to-Image via ComfyUI GGUF (Ampere Path — No FP8)
videointermediate24GB+
HunyuanVideo-1.5 on RTX 3090: 480p Step-Distilled Image-to-Video on a Razor-Thin 24 GB Envelope
videointermediate24GB+
Wan 2.2 T2V-A14B on RTX 3090: 720p text-to-video in ComfyUI with FP8 weights (Ampere)
videointermediate14GB+
LightX2V on RTX 3090: 4-Step Text-to-Video with Distilled Wan2.1-14B via INT8 / BF16 Offload
videointermediate22GB+
Mochi 1 on RTX 3090: 49-frame 480p Text-to-Video with Diffusers
videointermediate14GB+
CogVideoX 1.5 5B on RTX 3090: 1360x768 Text-to-Video with Diffusers
videointermediate24GB+
Wan 2.2 TI2V-5B on RTX 3090: 720p Text/Image-to-Video in ComfyUI
llmintermediate22GB+
Qwen3-32B on RTX 3090: UD-Q4_K_XL GGUF via llama.cpp
llmintermediate18GB+
Gemma 4 26B A4B-it on RTX 3090: Local Multimodal Chat via Q4_K_M GGUF + llama.cpp
llmbeginner10GB+
Qwen3-14B on RTX 3090: Q4_K_M GGUF via Ollama or llama.cpp
llmbeginner12GB+
DeepSeek-R1-Distill-Qwen-14B on RTX 3090 via Ollama Q4_K_M GGUF
llmbeginner6GB+
Qwen3-8B on RTX 3090: Q4_K_M GGUF with 18 GB of Headroom for Colocation or Long Context
llmbeginner10GB+
Llama 3.1 8B on RTX 3090: Local Chat via llama.cpp + Unsloth UD-Q4_K_XL GGUF
llmbeginner16GB+
gpt-oss 20B on RTX 3090: MXFP4 Chat at 147 tok/s via Ollama or vLLM
imageintermediate20GB+
HiDream-O1-Image Full BF16 on RTX 3090: 2048×2048 Text-to-Image in ComfyUI
imageintermediate18GB+
LongCat-Image (base T2I) on RTX 3090: Bilingual 6B Text-to-Image via diffusers BF16 + CPU Offload
imageintermediate24GB+
Chroma1-Base (V48) on RTX 3090: Uncensored 8.9B FLUX.1-Schnell De-Distillation via Diffusers BF16
imageintermediate16GB+
Juggernaut Z on RTX 3090: Cinematic Photoreal Fine-Tune of Z-Image Base at BF16 via Diffusers or ComfyUI
imagebeginner16GB+
Z-Image Turbo on RTX 3090: 8-Step 1024x1024 Text-to-Image at BF16 with Diffusers or ComfyUI
multimodalintermediate4GB+
MiniMind-O on RTX 4060 Ti 16GB: 0.1B Omni Model with Headroom to Spare
ttsintermediate8GB+
Foundation-1 on RTX 4060 Ti 16GB: Structured Music Sample Generation
ttsintermediate12GB+
ACE-Step 1.5 XL on RTX 4060 Ti 16GB: Text-to-Music Generation in ComfyUI
ttsbeginner2GB+
Kokoro TTS on RTX 4060 Ti 16GB: 82M-Parameter Text-to-Speech, 54 Voices, Under 3 GB VRAM
videoadvanced16GB+
Sulphur 2 on RTX 4060 Ti 16GB: Uncensored LTX-2.3 Video via ComfyUI GGUF
ttsintermediate8GB+
Qwen3-TTS 1.7B-Base on RTX 4060 Ti 16GB: Multilingual Voice Cloning in 10 Languages with FlashAttention-2
ttsbeginner5GB+
VoxCPM-0.5B on RTX 4060 Ti 16GB: Zero-Shot Voice Cloning TTS in ~5 GB VRAM
ttsintermediate4GB+
OmniVoice on RTX 4060 Ti 16GB: Zero-Shot Voice Cloning Across 646 Languages with Room to Spare
ttsintermediate5GB+
OpenAudio S1 Mini on RTX 4060 Ti 16GB: 13-Language Distilled TTS in ~5 GB VRAM
imageintermediate12GB+
SenseNova U1 (8B-MoT) on RTX 4060 Ti 16GB: VAE-Free Unified Image Gen + Understanding via Q4 GGUF
videointermediate8GB+
LightX2V on RTX 4060 Ti 16GB: 4-Step Text-to-Video via the LightX2V Framework with Distilled Wan2.1-14B
ttsintermediate10GB+
Voxtral Mini 3B on RTX 4060 Ti 16GB: local speech understanding in ~9.5 GB
ttsintermediate12GB+
MOSS-Audio 4B-Instruct on RTX 4060 Ti 16GB: local audio understanding in ~12 GB
ttsbeginner8GB+
VoxCPM2 on RTX 4060 Ti 16GB: 30-Language 48kHz Voice Cloning with Headroom to Spare
multimodalbeginner20GB+
Gemma 4 E4B on RTX 4090: Multimodal Inference via BF16 (with optional Q8_0 / Q4_K_M GGUF)
3dintermediate12GB+
Waypoint 1.5 on RTX 4090: Real-Time Interactive World Model at 720p
videointermediate24GB+
LightX2V on RTX 4090: 4-Step Text-to-Video with Distilled Wan2.1-14B
videointermediate24GB+
Wan 2.2 TI2V-5B on RTX 4090: 720p Text/Image-to-Video in ComfyUI
imagebeginner20GB+
Flux.2 Klein 4B on RTX 4090: BF16 Full-Resident 4-Step Text-to-Image via Diffusers or ComfyUI
imageintermediate16GB+
Juggernaut Z on RTX 4090: Cinematic Photoreal Fine-Tune of Z-Image Base at BF16 via Diffusers or ComfyUI
llmbeginner16GB+
gpt-oss 20B on RTX 4090: MXFP4 chat via Ollama or vLLM
llmbeginner10GB+
DeepSeek-R1-Distill-Qwen-14B on RTX 4090 via Ollama Q4_K_M GGUF
llmintermediate18GB+
Gemma 4 26B A4B-it on RTX 4090: Local Multimodal Chat via Q4_K_M GGUF + llama.cpp
llmintermediate22GB+
Qwen3-32B on RTX 4090: UD-Q4_K_XL GGUF via llama.cpp
llmbeginner10GB+
Qwen3-14B on RTX 4090: Q4_K_M GGUF via Ollama or llama.cpp
llmbeginner6GB+
Qwen3-8B on RTX 4090: Q4_K_M GGUF via Ollama or llama.cpp
imageintermediate21GB+
Qwen-Image on RTX 4090: 20B Text-to-Image via ComfyUI FP8 (Native Path)
imageintermediate18GB+
LongCat-Image (base T2I) on RTX 4090: Bilingual 6B Text-to-Image via diffusers
imageintermediate24GB+
Chroma1-Base (V48) on RTX 4090: Uncensored 8.9B FLUX.1-Schnell De-Distillation via Diffusers BF16
imageintermediate20GB+
HiDream-O1-Image on RTX 4090: 2048×2048 Text-to-Image with BF16 in ComfyUI
imagebeginner16GB+
Z-Image Turbo on RTX 4090: 8-Step 1024x1024 Text-to-Image at BF16 in ~2.3s with Diffusers or ComfyUI
llmbeginner10GB+
Llama 3.1 8B on RTX 4090: Local Chat via llama.cpp + Unsloth UD-Q4_K_XL GGUF
videointermediate14GB+
HunyuanVideo-1.5 on RTX 4090: 480p Step-Distilled Text-to-Video in ~75 Seconds
videointermediate24GB+
Wan 2.2 T2V-A14B on RTX 4090: 720p text-to-video in ComfyUI with FP8 scaled weights
videointermediate14GB+
CogVideoX 1.5 5B on RTX 4090: 1360x768 Text-to-Video with Diffusers
videointermediate20GB+
Mochi 1 on RTX 4090: 49-frame 480p Text-to-Video with Diffusers
imageintermediate10GB+
HiDream-O1-Image on RTX 4060 Ti 16GB: 2048×2048 Text-to-Image with FP8 in ComfyUI
imageintermediate16GB+
Chroma1-Base (V48) on RTX 4060 Ti 16GB: Uncensored 8.9B FLUX.1-Schnell De-Distillation via GGUF in ComfyUI
imageintermediate16GB+
Juggernaut Z on RTX 4060 Ti 16GB: Cinematic Photoreal Fine-Tune of Z-Image Base
imageintermediate16GB+
LongCat-Image (base T2I) on RTX 4060 Ti 16GB: Bilingual 6B Text-to-Image via ComfyUI GGUF
imageintermediate13GB+
Qwen-Image on RTX 4060 Ti 16GB: 20B Text-to-Image via ComfyUI GGUF
videointermediate8GB+
Wan 2.2 TI2V-5B on RTX 4060 Ti 16GB: 720p Text/Image-to-Video in ComfyUI
imageintermediate16GB+
Z-Image Turbo on RTX 4060 Ti 16GB: 8-Step Text-to-Image at BF16 with Diffusers or ComfyUI
multimodalbeginner6GB+
Gemma 4 E4B on RTX 4060 Ti 16GB: Multimodal Inference via Q4_K_M GGUF (with optional Q8_0 / BF16)
llmbeginner16GB+
Qwen3-8B on RTX 4060 Ti 16GB: Q4_K_M GGUF via Ollama or llama.cpp
specializedbeginner4GB+
SAM 3 on RTX 4060 Ti 16GB: Promptable Image and Video Segmentation
ttsintermediate8GB+
Foundation-1 on RTX 4060: Structured Music Sample Generation at the 8 GB Floor
ttsbeginner5GB+
VoxCPM on RTX 4060: Zero-Shot Voice Cloning TTS in ~5 GB VRAM
ttsintermediate4GB+
OmniVoice on RTX 4060: Zero-Shot Voice Cloning Across 646 Languages in 8 GB
ttsintermediate5GB+
OpenAudio S1 Mini on RTX 4060: 13-Language Distilled TTS in ~5 GB VRAM
llmbeginner4GB+
Qwen3-4B on RTX 4060: Q4_K_M GGUF via Ollama or llama.cpp
multimodalintermediate4GB+
MiniMind-O on RTX 4060: 0.1B Omni Model (Text + Speech + Image In/Out)
specializedbeginner4GB+
SAM 3 on RTX 4060: Promptable Image and Video Segmentation
ttsbeginner2GB+
Kokoro TTS on RTX 4060: 82M-Parameter Text-to-Speech, 54 Voices, Under 3 GB VRAM
llmbeginner16GB+
Qwen3-8B on RTX 5060 Ti: Q4_K_M GGUF via Ollama or llama.cpp
multimodalbeginner6GB+
Gemma 4 E4B on RTX 5060 Ti: Multimodal Inference with transformers or llama.cpp
ttsintermediate5GB+
OpenAudio S1 Mini on RTX 5060 Ti: 13-Language Distilled TTS in ~5 GB VRAM
ttsintermediate4GB+
OmniVoice on RTX 5060 Ti: Zero-Shot Voice Cloning Across 646 Languages
imageintermediate12GB+
SenseNova U1 (8B-MoT) on RTX 5060 Ti: VAE-Free Unified Image Gen + Understanding via Q4 GGUF
imageintermediate16GB+
LongCat-Image (base T2I) on RTX 5060 Ti: Bilingual 6B Text-to-Image at 16 GB via ComfyUI GGUF
ttsintermediate8GB+
Foundation-1 on RTX 5060 Ti: Structured Music Sample Generation
ttsintermediate12GB+
ACE-Step 1.5 XL on RTX 5060 Ti: Text-to-Music Generation in ComfyUI
ttsintermediate8GB+
Qwen3-TTS 1.7B-Base on RTX 5060 Ti: Multilingual Voice Cloning in 10 Languages
imageintermediate16GB+
Chroma V48 on RTX 5060 Ti: Uncensored 8.9B Flux.1-Schnell De-Distillation via GGUF in ComfyUI
ttsintermediate12GB+
MOSS-Audio 4B-Instruct on RTX 5060 Ti: local audio understanding in ~12 GB
videointermediate8GB+
Wan 2.2 TI2V-5B on RTX 5060 Ti: 720p Text/Image-to-Video in ComfyUI
imageintermediate13GB+
Qwen-Image on RTX 5060 Ti: 20B Text-to-Image via GGUF Quantization
ttsbeginner2GB+
Kokoro TTS on RTX 5060 Ti: 82M-Parameter Text-to-Speech, 54 Voices, Under 3 GB VRAM
ttsbeginner8GB+
VoxCPM2 on RTX 5060 Ti: 30-Language 48kHz Voice Cloning in ~8 GB VRAM
videointermediate8GB+
LightX2V on RTX 5060 Ti: 4-Step Text-to-Video with Distilled Wan2.1-14B
videoadvanced16GB+
Sulphur 2 on RTX 5060 Ti: Uncensored LTX-2.3 Video via GGUF in ComfyUI
imageintermediate16GB+
Juggernaut Z on RTX 5060 Ti: Cinematic Photoreal Fine-Tune of Z-Image Base
ttsintermediate10GB+
Voxtral Mini 3B on RTX 5060 Ti: local speech understanding in ~9.5 GB
imageintermediate10GB+
HiDream-O1-Image on RTX 5060 Ti: 2048×2048 Text-to-Image with FP8 in ComfyUI
3dintermediate10GB+
Hunyuan3D-2.1 on RTX 5060 Ti: Image-to-Mesh 3D Generation (Shape-Only)
multimodalintermediate4GB+
MiniMind-O on RTX 5060 Ti: 0.1B Omni Model (Text + Speech + Image In/Out)
ttsbeginner5GB+
VoxCPM on RTX 5060 Ti: Zero-Shot Voice Cloning TTS in ~5 GB VRAM
specializedintermediate3GB+
KiMoDo on RTX 5060 Ti: Text-to-3D-Motion Generation Guide

Recipes

Gemma 4 E4B on RTX 3090: Multimodal Inference via BF16 (with 16 GB of Headroom to Spend)

Waypoint 1.5 on RTX 3090: Real-Time Interactive World Model at 720p, 30 FPS

Flux.2 Klein 4B on RTX 3090: BFL-Recommended ~13 GB CPU-Offload Path for 4-Step Text-to-Image

Qwen-Image on RTX 3090: 20B Text-to-Image via ComfyUI GGUF (Ampere Path — No FP8)

HunyuanVideo-1.5 on RTX 3090: 480p Step-Distilled Image-to-Video on a Razor-Thin 24 GB Envelope

Wan 2.2 T2V-A14B on RTX 3090: 720p text-to-video in ComfyUI with FP8 weights (Ampere)

LightX2V on RTX 3090: 4-Step Text-to-Video with Distilled Wan2.1-14B via INT8 / BF16 Offload

Mochi 1 on RTX 3090: 49-frame 480p Text-to-Video with Diffusers

CogVideoX 1.5 5B on RTX 3090: 1360x768 Text-to-Video with Diffusers

Wan 2.2 TI2V-5B on RTX 3090: 720p Text/Image-to-Video in ComfyUI

Qwen3-32B on RTX 3090: UD-Q4_K_XL GGUF via llama.cpp

Gemma 4 26B A4B-it on RTX 3090: Local Multimodal Chat via Q4_K_M GGUF + llama.cpp

Qwen3-14B on RTX 3090: Q4_K_M GGUF via Ollama or llama.cpp

DeepSeek-R1-Distill-Qwen-14B on RTX 3090 via Ollama Q4_K_M GGUF

Qwen3-8B on RTX 3090: Q4_K_M GGUF with 18 GB of Headroom for Colocation or Long Context

Llama 3.1 8B on RTX 3090: Local Chat via llama.cpp + Unsloth UD-Q4_K_XL GGUF

gpt-oss 20B on RTX 3090: MXFP4 Chat at 147 tok/s via Ollama or vLLM

HiDream-O1-Image Full BF16 on RTX 3090: 2048×2048 Text-to-Image in ComfyUI

LongCat-Image (base T2I) on RTX 3090: Bilingual 6B Text-to-Image via diffusers BF16 + CPU Offload

Chroma1-Base (V48) on RTX 3090: Uncensored 8.9B FLUX.1-Schnell De-Distillation via Diffusers BF16

Juggernaut Z on RTX 3090: Cinematic Photoreal Fine-Tune of Z-Image Base at BF16 via Diffusers or ComfyUI

Z-Image Turbo on RTX 3090: 8-Step 1024x1024 Text-to-Image at BF16 with Diffusers or ComfyUI

MiniMind-O on RTX 4060 Ti 16GB: 0.1B Omni Model with Headroom to Spare

Foundation-1 on RTX 4060 Ti 16GB: Structured Music Sample Generation

ACE-Step 1.5 XL on RTX 4060 Ti 16GB: Text-to-Music Generation in ComfyUI

Kokoro TTS on RTX 4060 Ti 16GB: 82M-Parameter Text-to-Speech, 54 Voices, Under 3 GB VRAM

Sulphur 2 on RTX 4060 Ti 16GB: Uncensored LTX-2.3 Video via ComfyUI GGUF

Qwen3-TTS 1.7B-Base on RTX 4060 Ti 16GB: Multilingual Voice Cloning in 10 Languages with FlashAttention-2

VoxCPM-0.5B on RTX 4060 Ti 16GB: Zero-Shot Voice Cloning TTS in ~5 GB VRAM

OmniVoice on RTX 4060 Ti 16GB: Zero-Shot Voice Cloning Across 646 Languages with Room to Spare

OpenAudio S1 Mini on RTX 4060 Ti 16GB: 13-Language Distilled TTS in ~5 GB VRAM

SenseNova U1 (8B-MoT) on RTX 4060 Ti 16GB: VAE-Free Unified Image Gen + Understanding via Q4 GGUF

LightX2V on RTX 4060 Ti 16GB: 4-Step Text-to-Video via the LightX2V Framework with Distilled Wan2.1-14B

Voxtral Mini 3B on RTX 4060 Ti 16GB: local speech understanding in ~9.5 GB

MOSS-Audio 4B-Instruct on RTX 4060 Ti 16GB: local audio understanding in ~12 GB

VoxCPM2 on RTX 4060 Ti 16GB: 30-Language 48kHz Voice Cloning with Headroom to Spare

Gemma 4 E4B on RTX 4090: Multimodal Inference via BF16 (with optional Q8_0 / Q4_K_M GGUF)

Waypoint 1.5 on RTX 4090: Real-Time Interactive World Model at 720p

LightX2V on RTX 4090: 4-Step Text-to-Video with Distilled Wan2.1-14B

Wan 2.2 TI2V-5B on RTX 4090: 720p Text/Image-to-Video in ComfyUI

Flux.2 Klein 4B on RTX 4090: BF16 Full-Resident 4-Step Text-to-Image via Diffusers or ComfyUI

Juggernaut Z on RTX 4090: Cinematic Photoreal Fine-Tune of Z-Image Base at BF16 via Diffusers or ComfyUI

gpt-oss 20B on RTX 4090: MXFP4 chat via Ollama or vLLM

DeepSeek-R1-Distill-Qwen-14B on RTX 4090 via Ollama Q4_K_M GGUF

Gemma 4 26B A4B-it on RTX 4090: Local Multimodal Chat via Q4_K_M GGUF + llama.cpp

Qwen3-32B on RTX 4090: UD-Q4_K_XL GGUF via llama.cpp

Qwen3-14B on RTX 4090: Q4_K_M GGUF via Ollama or llama.cpp

Qwen3-8B on RTX 4090: Q4_K_M GGUF via Ollama or llama.cpp

Qwen-Image on RTX 4090: 20B Text-to-Image via ComfyUI FP8 (Native Path)

LongCat-Image (base T2I) on RTX 4090: Bilingual 6B Text-to-Image via diffusers

Chroma1-Base (V48) on RTX 4090: Uncensored 8.9B FLUX.1-Schnell De-Distillation via Diffusers BF16

HiDream-O1-Image on RTX 4090: 2048×2048 Text-to-Image with BF16 in ComfyUI

Z-Image Turbo on RTX 4090: 8-Step 1024x1024 Text-to-Image at BF16 in ~2.3s with Diffusers or ComfyUI

Llama 3.1 8B on RTX 4090: Local Chat via llama.cpp + Unsloth UD-Q4_K_XL GGUF

HunyuanVideo-1.5 on RTX 4090: 480p Step-Distilled Text-to-Video in ~75 Seconds

Wan 2.2 T2V-A14B on RTX 4090: 720p text-to-video in ComfyUI with FP8 scaled weights

CogVideoX 1.5 5B on RTX 4090: 1360x768 Text-to-Video with Diffusers

Mochi 1 on RTX 4090: 49-frame 480p Text-to-Video with Diffusers

HiDream-O1-Image on RTX 4060 Ti 16GB: 2048×2048 Text-to-Image with FP8 in ComfyUI

Chroma1-Base (V48) on RTX 4060 Ti 16GB: Uncensored 8.9B FLUX.1-Schnell De-Distillation via GGUF in ComfyUI

Juggernaut Z on RTX 4060 Ti 16GB: Cinematic Photoreal Fine-Tune of Z-Image Base

LongCat-Image (base T2I) on RTX 4060 Ti 16GB: Bilingual 6B Text-to-Image via ComfyUI GGUF

Qwen-Image on RTX 4060 Ti 16GB: 20B Text-to-Image via ComfyUI GGUF

Wan 2.2 TI2V-5B on RTX 4060 Ti 16GB: 720p Text/Image-to-Video in ComfyUI

Z-Image Turbo on RTX 4060 Ti 16GB: 8-Step Text-to-Image at BF16 with Diffusers or ComfyUI

Gemma 4 E4B on RTX 4060 Ti 16GB: Multimodal Inference via Q4_K_M GGUF (with optional Q8_0 / BF16)

Qwen3-8B on RTX 4060 Ti 16GB: Q4_K_M GGUF via Ollama or llama.cpp

SAM 3 on RTX 4060 Ti 16GB: Promptable Image and Video Segmentation

Foundation-1 on RTX 4060: Structured Music Sample Generation at the 8 GB Floor

VoxCPM on RTX 4060: Zero-Shot Voice Cloning TTS in ~5 GB VRAM

OmniVoice on RTX 4060: Zero-Shot Voice Cloning Across 646 Languages in 8 GB

OpenAudio S1 Mini on RTX 4060: 13-Language Distilled TTS in ~5 GB VRAM

Qwen3-4B on RTX 4060: Q4_K_M GGUF via Ollama or llama.cpp

MiniMind-O on RTX 4060: 0.1B Omni Model (Text + Speech + Image In/Out)

SAM 3 on RTX 4060: Promptable Image and Video Segmentation

Kokoro TTS on RTX 4060: 82M-Parameter Text-to-Speech, 54 Voices, Under 3 GB VRAM

Qwen3-8B on RTX 5060 Ti: Q4_K_M GGUF via Ollama or llama.cpp

Gemma 4 E4B on RTX 5060 Ti: Multimodal Inference with transformers or llama.cpp

OpenAudio S1 Mini on RTX 5060 Ti: 13-Language Distilled TTS in ~5 GB VRAM