self-hosted/ai
§01·recipe · image

Qwen-Image on RTX 5060 Ti: 20B Text-to-Image via GGUF Quantization

imageintermediate13GB+ VRAMMay 19, 2026
models
tools
prerequisites
  • NVIDIA RTX 5060 Ti (16GB VRAM) or equivalent 16GB-class card
  • Python 3.10+
  • ComfyUI installed and up to date
  • ~14 GB free disk space for the diffusion model alone (~25 GB total with text encoder + VAE)

What You'll Build

A local Qwen-Image text-to-image setup on a 16GB RTX 5060 Ti. The official FP8 build of Qwen-Image — a 20B-parameter MMDiT foundation model from Alibaba's Tongyi Lab — ships as a 20.4 GB diffusion weight (per the ComfyUI native tutorial) and will not fit on a 16GB card. This recipe uses city96's GGUF redistribution (Q4_K_S, 12.1 GB on disk) loaded through the ComfyUI-GGUF custom node, which leaves enough headroom for the Qwen2.5-VL-7B text encoder, the VAE, and activations.

Hardware data: RTX 5060 Ti (16GB VRAM) · 20B-parameter MMDiT at 4-bit · See benchmark data

⚠️ Known issue: the official qwen_image_fp8_e4m3fn.safetensors build is 20.4 GB (ComfyUI native docs) and will OOM on a 16GB card. Use the GGUF path below, not the native FP8 workflow.

Requirements

ComponentMinimumTested
GPU16 GB VRAM (NVIDIA, CUDA-capable)RTX 5060 Ti (16 GB)
RAM32 GB system RAM recommended for text-encoder offload
Storage~14 GB for diffusion model, ~25 GB total with encoder + VAE + Lightning LoRA
SoftwareComfyUI (current build), Python 3.10+, ComfyUI-GGUF custom node

The 20B parameter count is stated explicitly in the Qwen-Image GitHub README ("20B MMDiT image foundation model") and the city96 GGUF card — at BF16 the weights alone are 40.9 GB, which is why a 16GB card requires either GGUF or FP8 quantization plus CPU offload.

Installation

1. Update ComfyUI

Pull the latest ComfyUI build. Qwen-Image support in core ComfyUI is recent — the official Qwen-Image tutorial explicitly notes "Make sure your ComfyUI is updated."

2. Install the ComfyUI-GGUF custom node

The native ComfyUI loader does not read GGUF files; install city96's loader. From your ComfyUI root:

git clone https://github.com/city96/ComfyUI-GGUF ComfyUI/custom_nodes/ComfyUI-GGUF
pip install --upgrade gguf

Windows portable build users substitute the embedded Python:

git clone https://github.com/city96/ComfyUI-GGUF ComfyUI/custom_nodes/ComfyUI-GGUF
.\python_embeded\python.exe -s -m pip install -r .\ComfyUI\custom_nodes\ComfyUI-GGUF\requirements.txt

Source: ComfyUI-GGUF README.

3. Download the quantized diffusion weights

From city96/Qwen-Image-gguf, pull a quant that fits 16 GB with overhead. The full size table (from the model card) is:

QuantSize on disk
Q4_011.9 GB
Q4_K_S12.1 GB
Q4_K_M13.1 GB
Q5_K_S14.1 GB
Q5_K_M14.9 GB

For a 16 GB card, Q4_K_S (12.1 GB) is the sweet spot — the ComfyUI Wiki Qwen-Image guide reports qwen-image-Q4_K_S.gguf peaks at "56% VRAM" on a 24 GB RTX 4090D (~13.4 GB observed), which leaves a 16 GB card with comfortable margin. Q4_K_M (13.1 GB file) also fits if you want a slight quality bump; Q5 quants typically need text-encoder offload to clear OOM.

Place the file at ComfyUI/models/diffusion_models/qwen-image-Q4_K_S.gguf (the destination is per the city96 model card).

4. Download the text encoder and VAE

These are the same files the FP8 native workflow uses. Per the official ComfyUI tutorial:

ComfyUI/models/text_encoders/qwen_2.5_vl_7b_fp8_scaled.safetensors
ComfyUI/models/vae/qwen_image_vae.safetensors

Both download from the official Comfy-Org HuggingFace mirror linked from the tutorial.

5. (Recommended) Download the Lightning LoRA

To bring per-image latency down from minutes to under a minute, install the official Lightning LoRA:

ComfyUI/models/loras/Qwen-Image-Lightning-8steps-V1.0.safetensors

The 8-step variant is the ComfyUI tutorial's recommended acceleration LoRA for this model.

6. Load the workflow

The city96 GGUF model card ships a ready-to-use workflow at media/qwen-image_workflow.json in the repo. Drag the JSON onto the ComfyUI canvas — it pre-wires the Unet Loader (GGUF) node (from the bootleg category, per the ComfyUI-GGUF README), the Qwen2.5-VL text encoder, and the VAE.

Running

With ComfyUI launched (python main.py from the ComfyUI root, or the portable build's launcher):

python main.py --listen 127.0.0.1 --port 8188

Open http://127.0.0.1:8188, load the workflow JSON, enter a prompt, and queue a generation. First-run latency is dominated by the safetensors / GGUF load into VRAM; subsequent runs reuse the in-memory model.

Results

  • Speed: on the closest 16 GB-class reference card we have a citation for — an NVIDIA RTX A4000 (16 GB) running 4-bit GGUF + Lightning, per sandner.art's local-generation writeup — generation times for a ~1 MP image are approximately 7 minutes at 50 steps, ~2 minutes at 8 steps (Lightning), and ~1 minute at 4 steps (Lightning). Expect broadly similar order-of-magnitude figures on the 5060 Ti; check /check/qwen-image/rtx-5060-ti for empirical 5060 Ti numbers once a community benchmark lands.
  • VRAM usage: Q4_K_S diffusion weights are 12.1 GB on disk (city96 file table); at runtime the ComfyUI Wiki guide measures Q4_K_S at ~13.4 GB peak on a 24 GB card (56% utilization). On a 16 GB card you should have enough headroom for the encoder + VAE + activations without offload. See /check/qwen-image/rtx-5060-ti.
  • Quality notes: Qwen-Image is positioned as a strong text-rendering model (GitHub README); 4-bit quants are widely reported to retain most of that capability, with Q5/Q6 giving incremental quality at proportional VRAM cost.

For the full benchmark data, see /check/qwen-image/rtx-5060-ti.

Troubleshooting

qwen_image_fp8_e4m3fn.safetensors OOMs on load

The FP8 build is 20.4 GB (ComfyUI tutorial) — larger than a 16 GB card's total VRAM. Switch to the GGUF path described above; the FP8 build is for 24 GB+ cards only.

Q5/Q6 GGUF fits on disk but OOMs at runtime

Q5_K_S is 14.1 GB on disk (city96 table); with the Qwen2.5-VL-7B text encoder and VAE also resident, peak VRAM exceeds 16 GB. Drop to Q4_K_S/Q4_K_M, or enable text-encoder CPU offload in ComfyUI.

Lightning LoRA output is over-smoothed or blurry

Per sandner.art: "When using Lightning with GGUF models, experiment with steps … Try the double of steps (7 or more steps for 4-step lora)." Raising the step count compensates for the precision loss introduced by 4-bit quantization.

Per-image time is many minutes, not under a minute

You are probably running without the Lightning LoRA at 50 steps. The same sandner.art write-up measures roughly a 7× speedup from 50-step vanilla to 8-step Lightning on the same 16 GB card. If Lightning is loaded but speed hasn't improved, double-check the LoRA node is wired into the sampler.

Unet Loader (GGUF) node missing

You skipped step 2. Install ComfyUI-GGUF — the native ComfyUI loader cannot read .gguf files. The node lives under the bootleg category in the node browser.