Qwen-Image on RTX 5080: 20B Text-to-Image via ComfyUI GGUF (Blackwell sm_120, 16 GB)

What You'll Build

A local Qwen-Image text-to-image setup on a 16 GB RTX 5080 (Blackwell, sm_120). Qwen-Image is Alibaba Tongyi Lab's 20B-parameter MMDiT image foundation model — the GitHub README describes it as "a 20B MMDiT image foundation model that achieves significant advances in complex text rendering and precise image editing" — released August 4, 2025 under Apache 2.0. The official FP8 build (qwen_image_fp8_e4m3fn.safetensors, 20.43 GB on the Comfy-Org repackager file listing) plus its 9.38 GB FP8 text encoder will not fit a 16 GB card. This recipe uses city96's GGUF redistribution (Q4_K_M, 13.07 GB on disk) loaded through the ComfyUI-GGUF custom node, which leaves enough headroom for the Qwen2.5-VL-7B text encoder, the VAE, and activations.

Hardware data: RTX 5080 (16 GB GDDR7 VRAM, Blackwell, sm_120) · 20B-parameter MMDiT at 4-bit GGUF · See benchmark data

⚠️ Why GGUF and not the FP8 native path on this card? The RTX 5080 is Blackwell (sm_120) and does have native FP8 tensor cores — so unlike the RTX 3090 Ampere sibling, FP8 is not a compute dead-end here. The blocker on the 5080 is purely capacity: the FP8 diffusion build is 20.43 GB on disk (Comfy-Org file listing) and peaks at ~86% of a 24 GB card (~20.6 GB) per the ComfyUI-Wiki Qwen-Image guide — past the 5080's 16 GB envelope before the 9.38 GB text encoder is even counted. The native FP8 path is the one the RTX 5090 32 GB sibling installs; on 16 GB, GGUF is the fitting path.

Requirements

Component	Minimum	Tested
GPU	16 GB VRAM (NVIDIA, CUDA-capable)	RTX 5080 (16 GB GDDR7, Blackwell, sm_120)
RAM	32 GB system RAM recommended for text-encoder offload	—
Storage	~13 GB for Q4_K_M diffusion weights, ~25 GB total with encoder + VAE + Lightning LoRA	—
Software	ComfyUI (current build), Python 3.10+ (cu128 PyTorch wheel), ComfyUI-GGUF custom node	—

The 20B parameter count is stated explicitly in the Qwen-Image GitHub README ("20B MMDiT image foundation model") and the city96 GGUF card — at BF16 the diffusion weights alone are 40.87 GB (qwen-image-BF16.gguf, verified via the HF tree API on 2026-05-28), which is why a 16 GB card requires GGUF or FP8 quantization.

Installation

1. Update ComfyUI and install the cu128 PyTorch wheel

Pull the latest ComfyUI build. Qwen-Image has had native ComfyUI support since 2025.08.05 — the official Qwen-Image tutorial explicitly notes "Make sure your ComfyUI is updated."

The RTX 5080 is Blackwell (sm_120). PyTorch wheels built against CUDA 12.8 (cu128) ship sm_120 kernels — if your ComfyUI Python environment defaults to an older cu121/cu124 wheel, the model will fail at the first inference call. Install the cu128 wheel:

pip install torch torchvision --index-url https://download.pytorch.org/whl/cu128

2. Install the ComfyUI-GGUF custom node

The native ComfyUI loader does not read GGUF files; install city96's loader. From your ComfyUI root:

git clone https://github.com/city96/ComfyUI-GGUF ComfyUI/custom_nodes/ComfyUI-GGUF
pip install --upgrade gguf

Windows portable build users substitute the embedded Python:

git clone https://github.com/city96/ComfyUI-GGUF ComfyUI/custom_nodes/ComfyUI-GGUF
.\python_embeded\python.exe -s -m pip install -r .\ComfyUI\custom_nodes\ComfyUI-GGUF\requirements.txt

Source: ComfyUI-GGUF README.

3. Download the quantized diffusion weights

From city96/Qwen-Image-gguf, pull a quant that fits 16 GB with overhead for the Qwen2.5-VL-7B text encoder, VAE, and activations. The full per-quant file sizes (verified via the HF tree API on 2026-05-28) are:

Quant	Size on disk
Q2_K	7.06 GB
Q3_K_S	8.95 GB
Q3_K_M	9.68 GB
Q4_0	11.85 GB
Q4_K_S	12.14 GB
Q4_K_M	13.07 GB
Q5_K_S	14.12 GB
Q5_K_M	14.93 GB
Q6_K	16.82 GB
Q8_0	21.76 GB
BF16	40.87 GB

For a 16 GB card, Q4_K_M (13.07 GB) is the recommended starting point — the ComfyUI-Wiki Qwen-Image guide reports the neighbouring qwen-image-Q4_K_S.gguf peaks at 56% VRAM on a 24 GB RTX 4090D (~13.4 GB observed at runtime), which leaves a 16 GB card with workable margin once the text encoder is offloaded. Q4_K_S (12.14 GB) is the lower-VRAM fallback; Q5/Q6 quants generally need text-encoder CPU offload to clear OOM on 16 GB.

# From your ComfyUI root, into ComfyUI/models/diffusion_models/
wget https://huggingface.co/city96/Qwen-Image-gguf/resolve/main/qwen-image-Q4_K_M.gguf \
  -O ComfyUI/models/diffusion_models/qwen-image-Q4_K_M.gguf

The destination is per the city96 model card.

Avoid IQ-quants: per city96's own response on issue #255, "The IQ quants are only supported via a very slow fallback. You could try go for Q4_K_S or Q3_K_M instead (basically anything that just has 'Q' instead of 'IQ' in the name)."

4. Download the text encoder and VAE

These are the same files the FP8 native workflow uses. Per the official ComfyUI tutorial:

# Text encoder (FP8 scaled, 9.38 GB) → ComfyUI/models/text_encoders/
wget https://huggingface.co/Comfy-Org/Qwen-Image_ComfyUI/resolve/main/split_files/text_encoders/qwen_2.5_vl_7b_fp8_scaled.safetensors \
  -O ComfyUI/models/text_encoders/qwen_2.5_vl_7b_fp8_scaled.safetensors

# VAE (0.25 GB) → ComfyUI/models/vae/
wget https://huggingface.co/Comfy-Org/Qwen-Image_ComfyUI/resolve/main/split_files/vae/qwen_image_vae.safetensors \
  -O ComfyUI/models/vae/qwen_image_vae.safetensors

Both file sizes are verified via the Comfy-Org repackager tree on 2026-05-28.

5. (Recommended) Download the Lightning LoRA

To bring per-image latency down from minutes to under a minute, install the official 8-step Lightning LoRA — the ComfyUI tutorial's recommended acceleration LoRA for this model:

wget https://huggingface.co/lightx2v/Qwen-Image-Lightning/resolve/main/Qwen-Image-Lightning-8steps-V1.0.safetensors \
  -O ComfyUI/models/loras/Qwen-Image-Lightning-8steps-V1.0.safetensors

6. Load the workflow

The city96 GGUF model card ships a ready-to-use workflow at media/qwen-image_workflow.json in the repo. Drag the JSON onto the ComfyUI canvas — it pre-wires the Unet Loader (GGUF) node (from the bootleg category, per the ComfyUI-GGUF README), the Qwen2.5-VL text encoder, and the VAE.

Running

With ComfyUI launched from the ComfyUI root (or the portable build's launcher):

python main.py --listen 127.0.0.1 --port 8188

Open http://127.0.0.1:8188, load the workflow JSON, enter a prompt, and queue a generation. First-run latency is dominated by the GGUF load into VRAM; subsequent runs reuse the in-memory model.

ComfyUI's Qwen-Image workflow uses SDPA (scaled_dot_product_attention) by default, which always works on sm_120. FlashAttention-2 sm_120 kernels are still tracked open at Dao-AILab/flash-attention#2168 — if you use a custom node or fork that explicitly enables FA2, fall back to SDPA on the 5080 until the upstream wheel lands sm_120 coverage.

Results

Speed: no first-party RTX 5080 Qwen-Image measurement has been published for the GGUF ComfyUI workflow yet, so we deliberately omit a 5080-specific number. The closest published timing is on an RTX 4090D 24 GB (Ada Lovelace, a different architecture) from the ComfyUI-Wiki guide: qwen-image-Q4_K_S.gguf at first generation ≈ 135 s, subsequent ≈ 77 s, and ≈ 100 s / ≈ 45 s with the 8-step Lightning LoRA. The 5080 has roughly 2× the memory bandwidth of the 5060 Ti (960 GB/s vs 448 GB/s) and is one generation newer than the 4090D, so its real numbers will differ — we will not forward-extrapolate them. Check /check/qwen-image/rtx-5080 for an empirical figure once a community benchmark lands, or contribute one via /contribute.
VRAM usage: Q4_K_M diffusion weights are 13.07 GB on disk per the city96 file table; at runtime the ComfyUI-Wiki guide measures the neighbouring Q4_K_S at ~56% of a 24 GB RTX 4090D (~13.4 GB observed), with Q4_K_M sitting incrementally higher on the same hardware class. On a 16 GB 5080 this fits with the text encoder offloaded by ComfyUI's default memory management. See /check/qwen-image/rtx-5080.
Quality notes: Qwen-Image is positioned as a strong text-rendering model in both English and Chinese (GitHub README); 4-bit quants are widely reported to retain most of that capability, with Q5/Q6 giving incremental quality at proportional VRAM cost. The 8-step Lightning LoRA trades a small amount of step-count flexibility for materially lower latency per image.

For the full benchmark data, see /check/qwen-image/rtx-5080.

Troubleshooting

`qwen_image_fp8_e4m3fn.safetensors` OOMs on load

The FP8 build is 20.43 GB on disk (Comfy-Org file listing) — larger than a 16 GB card's total VRAM, and that is before the 9.38 GB FP8 text encoder loads. The 5080's Blackwell sm_120 silicon can run FP8 natively, but the file simply does not fit 16 GB. Use the GGUF path described above; the FP8 build is what the RTX 5090 32 GB sibling installs.

"transformer dispatched to CPU" / ~20 minutes per image (Ampere-specific — not a 5080 failure)

Reported on the canonical repo at QwenLM/Qwen-Image#260 by a community user (no maintainer response as of 2026-05-28) running num_inference_steps=40 on an RTX 3090 and seeing repeated transformer-to-CPU offload messages. That is an Ampere (sm_86) failure mode — the diffusers BF16 path falls back to CPU offload because the full 40.87 GB BF16 weights don't fit 24 GB, and Ampere has no FP8 tensor cores so the FP8 build dequantizes on the fly. It does not fire on the 5080's Blackwell sm_120: on this card the FP8 transformer would run on native FP8 tensor cores if it fit, and in any case this recipe runs the GGUF path (Q4_K_M, 13.07 GB) entirely on-GPU with no per-step CPU swap. If you see this message on a 5080, you are almost certainly running the diffusers BF16 path instead of the ComfyUI-GGUF workflow above.

Q5/Q6 GGUF fits on disk but OOMs at runtime

Q5_K_S is 14.12 GB on disk and Q6_K is 16.82 GB (city96 file table); with the Qwen2.5-VL-7B text encoder (FP8-scaled, 9.38 GB) and VAE also resident, peak VRAM exceeds 16 GB. Drop to Q4_K_M/Q4_K_S, or enable text-encoder CPU offload in ComfyUI.

Model fails at the first inference call on a fresh install

The RTX 5080 needs the cu128 PyTorch wheel for sm_120 kernels. If you installed an older cu121/cu124 wheel, reinstall per step 1: pip install torch torchvision --index-url https://download.pytorch.org/whl/cu128. This is a Blackwell-class requirement shared with the other RTX 50-series cards (5060 Ti, 5070, 5090).

`Unet Loader (GGUF)` node missing

You skipped step 2. Install ComfyUI-GGUF — the native ComfyUI loader cannot read .gguf files. The node lives under the bootleg category in the node browser.

Generation produces a black image

Reported across city96/Qwen-Image-gguf discussions. Most common cause is a mismatched or missing text encoder file — double-check qwen_2.5_vl_7b_fp8_scaled.safetensors is present in ComfyUI/models/text_encoders/ and is the FP8-scaled variant linked from the official ComfyUI tutorial (not the bare Qwen2.5-VL-7B BF16 weights, which are 16.58 GB).