Qwen-Image on RTX 4080: 20B Text-to-Image via ComfyUI GGUF (Ada sm_89, 16 GB)

What You'll Build

A local Qwen-Image text-to-image setup on a 16GB RTX 4080. The official FP8 build of Qwen-Image — a 20B-parameter MMDiT image foundation model from Alibaba's Tongyi Lab, released August 4, 2025 under Apache 2.0 — ships as a 20.4 GB diffusion weight (per the ComfyUI native tutorial) and will not fit on a 16GB card. This recipe uses city96's GGUF redistribution (Q4_K_S, 12.1 GB on disk) loaded through the ComfyUI-GGUF custom node, which leaves enough headroom for the Qwen2.5-VL-7B text encoder, the VAE, and activations on the 4080's 16 GB envelope.

Hardware data: RTX 4080 (Ada Lovelace, sm_89, 16GB VRAM) · 20B-parameter MMDiT at 4-bit · See benchmark data

⚠️ Known issue: the official qwen_image_fp8_e4m3fn.safetensors build is 20.4 GB (ComfyUI native docs) and will OOM on a 16GB card. Use the GGUF path below, not the native FP8 workflow — see the FP8-on-Ada note in Troubleshooting for why "FP8 is native on the 4080" does not rescue the full-precision build here.

Requirements

Component	Minimum	Tested
GPU	16 GB VRAM (NVIDIA, CUDA-capable)	RTX 4080 (Ada, sm_89, 16GB)
RAM	32 GB system RAM recommended for text-encoder offload	—
Storage	~14 GB for diffusion model, ~25 GB total with encoder + VAE + Lightning LoRA	—
Software	ComfyUI (current build), Python 3.10+, ComfyUI-GGUF custom node	—

The 20B parameter count is stated explicitly on the Qwen-Image GitHub README ("a 20B MMDiT image foundation model") and the city96 GGUF card — at BF16 the weights alone are 40.9 GB per city96's file-size table, which is why a 16GB card requires GGUF quantization (or FP8 plus CPU offload).

Installation

1. Update ComfyUI

Pull the latest ComfyUI build. Qwen-Image support in core ComfyUI is recent — the official Qwen-Image tutorial explicitly notes "Make sure your ComfyUI is updated."

Unlike Blackwell GPUs (sm_120), no special CUDA wheel selection is required for the RTX 4080 — the default pip install torch already includes sm_89 kernels and standard ComfyUI installs work out of the box on Ada Lovelace cards.

2. Install the ComfyUI-GGUF custom node

The native ComfyUI loader does not read GGUF files; install city96's loader. From your ComfyUI root:

git clone https://github.com/city96/ComfyUI-GGUF ComfyUI/custom_nodes/ComfyUI-GGUF
pip install --upgrade gguf

Windows portable build users substitute the embedded Python:

git clone https://github.com/city96/ComfyUI-GGUF ComfyUI/custom_nodes/ComfyUI-GGUF
.\python_embeded\python.exe -s -m pip install -r .\ComfyUI\custom_nodes\ComfyUI-GGUF\requirements.txt

Source: ComfyUI-GGUF README.

3. Download the quantized diffusion weights

From city96/Qwen-Image-gguf, pull a quant that fits 16 GB with overhead. The size table (from the model card, verified against the repo's current file tree) is:

Quant	Size on disk
Q4_0	11.9 GB
Q4_K_S	12.1 GB
Q4_K_M	13.1 GB
Q5_K_S	14.1 GB
Q5_K_M	14.9 GB

For a 16 GB card, Q4_K_S (12.1 GB) is the sweet spot — it leaves room for the Qwen2.5-VL-7B text encoder, VAE, and activations to clear on the 4080's 16 GB envelope without forced offload. Q4_K_M (13.1 GB file) also fits if you want a slight quality bump; Q5 quants typically need text-encoder offload to clear OOM.

# from your ComfyUI root
wget -O ComfyUI/models/diffusion_models/qwen-image-Q4_K_S.gguf \
  https://huggingface.co/city96/Qwen-Image-gguf/resolve/main/qwen-image-Q4_K_S.gguf

The destination directory (ComfyUI/models/diffusion_models/) is per the city96 model card.

4. Download the text encoder and VAE

These are the same files the FP8 native workflow uses. Per the official ComfyUI tutorial, they come from the official Comfy-Org HuggingFace repackage:

wget -O ComfyUI/models/text_encoders/qwen_2.5_vl_7b_fp8_scaled.safetensors \
  https://huggingface.co/Comfy-Org/Qwen-Image_ComfyUI/resolve/main/split_files/text_encoders/qwen_2.5_vl_7b_fp8_scaled.safetensors
wget -O ComfyUI/models/vae/qwen_image_vae.safetensors \
  https://huggingface.co/Comfy-Org/Qwen-Image_ComfyUI/resolve/main/split_files/vae/qwen_image_vae.safetensors

The FP8-scaled text encoder is 9.4 GB and the VAE is 0.25 GB (Comfy-Org repackage file tree).

5. (Recommended) Download the Lightning LoRA

To bring per-image latency down from minutes to under a minute, install the official Lightning LoRA:

ComfyUI/models/loras/Qwen-Image-Lightning-8steps-V1.0.safetensors

The 8-step variant is the ComfyUI tutorial's recommended acceleration LoRA for this model.

6. Load the workflow

The city96 GGUF model card ships a ready-to-use workflow at media/qwen-image_workflow.json in the repo. Drag the JSON onto the ComfyUI canvas — it pre-wires the Unet Loader (GGUF) node (from the bootleg category, per the ComfyUI-GGUF README), the Qwen2.5-VL text encoder, and the VAE.

Running

With ComfyUI launched from the ComfyUI root (or the portable build's launcher):

python main.py --listen 127.0.0.1 --port 8188

Open http://127.0.0.1:8188, load the workflow JSON, enter a prompt, and queue a generation. First-run latency is dominated by the GGUF load into VRAM; subsequent runs reuse the in-memory model.

Results

Speed: no RTX 4080-specific, variant-unambiguous benchmark for Qwen-Image was located for this draft, so no speed figure is quoted. The 4080's ~717 GB/s memory bandwidth sits well above a 4060 Ti (~288 GB/s) and below a 4090 (~1008 GB/s), so neither sibling's number transfers cleanly — image-diffusion throughput on this card is bandwidth-bound and would only be misrepresented by extrapolating from a different Ada card. Check /check/qwen-image/rtx-4080 for an empirical number once a community submission seeds the benchmark, and please contribute your own measurement if you run this setup.
VRAM usage: Q4_K_S diffusion weights are 12.1 GB on disk per the city96 file table; on a 16 GB card the diffusion model plus the Qwen2.5-VL-7B text encoder plus the VAE plus activations fit without forced offload — the planned envelope is ~13 GB, the same floor documented for the other 16GB-class GGUF builds of this model. See /check/qwen-image/rtx-4080 for the live measurement once a community benchmark lands.
Quality notes: Qwen-Image is positioned as a strong text-rendering model (GitHub README); 4-bit quants are widely reported to retain most of that capability, with Q5/Q6 giving incremental quality at proportional VRAM cost.

For the full benchmark data, see /check/qwen-image/rtx-4080.

Troubleshooting

"FP8 is native on the 4080 — why not just run the FP8 build?"

The RTX 4080's Ada Lovelace 4th-gen tensor cores do support FP8 (E4M3/E5M2) compute natively, so FP8 is genuinely hardware-accelerated here — unlike on Ampere cards. But that does not help with the official build: qwen_image_fp8_e4m3fn.safetensors is 20.4 GB (ComfyUI tutorial) — larger than the 4080's entire 16 GB VRAM — so the diffusion weight alone OOMs on load before any FP8 acceleration can apply. FP8 acceleration is a compute property; it does not shrink a 20.4 GB file to fit a 16 GB card. The full FP8 native path is for 24 GB+ cards; on a 16 GB 4080 use the GGUF path above.

Q5/Q6 GGUF fits on disk but OOMs at runtime

Q5_K_S is 14.1 GB on disk (city96 table); with the Qwen2.5-VL-7B text encoder and VAE also resident, peak VRAM can exceed 16 GB. Drop to Q4_K_S / Q4_K_M, or enable text-encoder CPU offload in ComfyUI.

`Unet Loader (GGUF)` node missing

You skipped step 2. Install ComfyUI-GGUF — the native ComfyUI loader cannot read .gguf files. The node lives under the bootleg category in the node browser.

Generation produces a black image

Reported in city96/Qwen-Image-gguf discussions. Most common cause is a mismatched or missing text encoder file — double-check qwen_2.5_vl_7b_fp8_scaled.safetensors is present in ComfyUI/models/text_encoders/ and is the FP8-scaled variant linked from the official ComfyUI tutorial.

Generation is extremely slow (minutes per image)

If you skipped the GGUF path and ran a full-precision build on a tight VRAM card, ComfyUI (or diffusers) will spill the transformer to system RAM and stream it back per step — a community user reports ~20 minutes per image at 40 steps on a 24GB-class card under this condition (QwenLM/Qwen-Image issue #260, community report, no maintainer response yet). On a 16 GB 4080 the fix is the same as the cause: use the Q4_K_S GGUF weights from step 3 so the diffusion model stays resident, and add the 8-step Lightning LoRA from step 5. If Lightning is loaded but generation is still slow, confirm the LoRA node is wired into the sampler chain rather than disconnected.