Qwen-Image on RTX 4090: 20B Text-to-Image via ComfyUI FP8 (Native Path)

What You'll Build

A local Qwen-Image text-to-image setup on an RTX 4090 24GB using the canonical FP8 native path that the official ComfyUI Qwen-Image tutorial is built around. Qwen-Image is Alibaba's 20B-parameter MMDiT image foundation model from Tongyi Lab, released August 4, 2025 under Apache 2.0 (HF model card, Tech Report). On a 24 GB card the qwen_image_fp8_e4m3fn.safetensors weights (20.43 GB, from the Comfy-Org repackager) fit alongside the Qwen2.5-VL-7B text encoder (FP8 scaled, 9.38 GB) and the VAE (0.25 GB), with peak runtime VRAM cited at 86% of 24 GB ≈ 20.6 GB on a 4090D by the ComfyUI Wiki Qwen-Image guide.

Hardware data: RTX 4090 (24 GB VRAM, Ada Lovelace, sm_89) · 20B MMDiT at FP8 e4m3fn · See benchmark data

ℹ️ Variant note. This recipe pins the base Qwen-Image model (Qwen/Qwen-Image, the 2025-08-04 text-to-image foundation model). It is distinct from Qwen-Image-Edit and Qwen-Image-Edit-2509 (edit-focused siblings) and from the newer qwen_image_2512_* checkpoints (Comfy-Org repackager files) which are out of scope here.

ℹ️ Runtime pin. The 16 GB sibling recipe (Qwen-Image on RTX 4060 Ti 16GB) uses city96's Q4_K_S GGUF (12.1 GB on disk) because the native FP8 build (20.4 GB) does not fit a 16 GB card. The 4090's 24 GB unlocks the FP8 native path, which is the canonical ComfyUI workflow — this recipe pivots accordingly. The Q8_0 GGUF (21.76 GB) is feasible on 24 GB but has no headroom advantage over native FP8 and ships with the GGUF loader stack; FP8 is the simpler default on this card.

Requirements

Component	Minimum	Tested
GPU	24 GB VRAM (NVIDIA, CUDA-capable)	RTX 4090 (24 GB, Ada, sm_89)
RAM	32 GB system RAM recommended	—
Storage	~30 GB (20.4 GB FP8 weights + 9.4 GB FP8 text encoder + 0.25 GB VAE + optional LoRA)	—
Software	ComfyUI (current build), Python 3.10+	—

The 20B parameter count is stated explicitly on the Qwen-Image GitHub README and HF card. At BF16 the diffusion weights alone are 40.86 GB (Comfy-Org file listing) — well past 24 GB — which is why even a 4090 runs the FP8 build rather than full-precision weights.

Installation

1. Update ComfyUI

Pull the latest ComfyUI build. Native Qwen-Image support is recent — the official ComfyUI Qwen-Image tutorial explicitly notes "Make sure your ComfyUI is updated."

2. Download the FP8 diffusion weights

From the Comfy-Org Qwen-Image repackager, pull the FP8 e4m3fn build:

# From your ComfyUI root, into ComfyUI/models/diffusion_models/
wget https://huggingface.co/Comfy-Org/Qwen-Image_ComfyUI/resolve/main/split_files/diffusion_models/qwen_image_fp8_e4m3fn.safetensors \
  -O ComfyUI/models/diffusion_models/qwen_image_fp8_e4m3fn.safetensors

File size: 20.43 GB (per the Comfy-Org file listing).

3. Download the text encoder and VAE

Same files as on the smaller-VRAM siblings — pull from the Comfy-Org repackager:

# Text encoder (FP8 scaled, 9.38 GB)
wget https://huggingface.co/Comfy-Org/Qwen-Image_ComfyUI/resolve/main/split_files/text_encoders/qwen_2.5_vl_7b_fp8_scaled.safetensors \
  -O ComfyUI/models/text_encoders/qwen_2.5_vl_7b_fp8_scaled.safetensors

# VAE (0.25 GB)
wget https://huggingface.co/Comfy-Org/Qwen-Image_ComfyUI/resolve/main/split_files/vae/qwen_image_vae.safetensors \
  -O ComfyUI/models/vae/qwen_image_vae.safetensors

Destinations confirmed by the Comfy-Org repackager tree and the ComfyUI Wiki guide's model-storage section.

4. (Recommended) Download the Lightning 8-step LoRA

The official ComfyUI tutorial recommends the 8-step Lightning LoRA for accelerated inference; on the 4090 it brings per-image time down materially (see Results below):

wget https://huggingface.co/lightx2v/Qwen-Image-Lightning/resolve/main/Qwen-Image-Lightning-8steps-V1.0.safetensors \
  -O ComfyUI/models/loras/Qwen-Image-Lightning-8steps-V1.0.safetensors

5. Load the workflow

The official ComfyUI tutorial ships a ready-to-use JSON. Drag the workflow JSON onto the ComfyUI canvas — it pre-wires the Load Diffusion Model node (pointing at qwen_image_fp8_e4m3fn.safetensors), the Qwen2.5-VL FP8 text encoder, and the VAE. No custom nodes are required for the native FP8 path (unlike the GGUF path, which needs city96's ComfyUI-GGUF loader).

Running

Launch ComfyUI from the ComfyUI root:

python main.py --listen 127.0.0.1 --port 8188

Open http://127.0.0.1:8188, load the workflow JSON, enter a prompt, and queue a generation. First-run latency is dominated by the FP8 safetensors load into VRAM; subsequent runs reuse the in-memory model and are noticeably faster (see Results).

The RTX 4090 is Ada Lovelace (sm_89) — the default pip install torch includes sm_89 kernels, so no special CUDA wheel selection is required. Unlike Blackwell GPUs (sm_120, e.g. RTX 5060 Ti / 5090), there is no FlashAttention-2 sm_120 kernel gap to work around on this card.

Results

Speed: the ComfyUI Wiki Qwen-Image guide reports for the native FP8 workflow on an RTX 4090D 24GB (a near-identical Ada Lovelace 24 GB sibling — same VRAM envelope, ~95% of full-4090 compute): first generation ≈ 94 s, second generation ≈ 71 s at default settings (typically 20–50 steps). With the lightx2v 8-step LoRA loaded into the sampler: first generation ≈ 55 s, second generation ≈ 34 s. With the distilled qwen_image_distill_full_fp8_e4m3fn.safetensors variant: first ≈ 69 s, second ≈ 36 s. Expect very slightly better numbers on a full RTX 4090 (more SMs, same memory bandwidth). See /check/qwen-image/rtx-4090 for the live measurement once a community benchmark seeds the backend.
VRAM usage: 86% of 24 GB ≈ 20.6 GB peak for fp8_e4m3fn, measured on RTX 4090D 24GB by the ComfyUI Wiki guide. The 86% figure holds across the base, lightx2v-LoRA, and distilled variants on that card. Consistent with the on-disk file size table from the Comfy-Org repackager (FP8 weights 20.43 GB + FP8 text encoder 9.38 GB + VAE 0.25 GB, with the text encoder offloaded by ComfyUI's default memory management).
Quality notes: Qwen-Image is positioned by the Qwen-Image GitHub README and HF card as a strong text-rendering model — both English alphabetic and Chinese logographic. FP8 e4m3fn is the format the ComfyUI core team and Qwen ship as the recommended consumer-GPU build; higher-precision BF16 weights exist (40.86 GB) but are out of scope for a 24 GB card. The optional 8-step Lightning LoRA trades a small amount of step-count flexibility for ~40% lower latency per image.

For the full benchmark data, see /check/qwen-image/rtx-4090.

Troubleshooting

Out-of-memory on first generation

The peak runtime VRAM is around 86% on a 24 GB card per the ComfyUI Wiki measurement. If you OOM on a 4090, check: (a) you downloaded qwen_image_fp8_e4m3fn.safetensors (20.43 GB) — not qwen_image_bf16.safetensors (40.86 GB), which does not fit; (b) the text encoder is qwen_2.5_vl_7b_fp8_scaled.safetensors (9.38 GB) — not the full BF16 qwen_2.5_vl_7b.safetensors (16.58 GB), which leaves no headroom; (c) other ComfyUI workflows aren't holding stale models in VRAM (restart ComfyUI). The Comfy-Org repackager file sizes are listed here.

Want lower VRAM headroom — use Q8_0 GGUF instead

The city96/Qwen-Image-gguf Q8_0 quant is 21.76 GB on disk — slightly larger than the FP8 build's 20.43 GB. It also requires city96's ComfyUI-GGUF custom node, whereas the FP8 build uses only the native ComfyUI loader. On a 24 GB card, FP8 is the simpler default; consider GGUF Q8_0 only if you specifically need GGUF tooling (custom quantization, smaller community variants).

Generation takes minutes per image without Lightning LoRA

Without acceleration, the native FP8 workflow defaults to standard step counts and takes ≈ 94 s on first generation (subsequent ≈ 71 s) on a 4090D per the ComfyUI Wiki. Loading the Qwen-Image-Lightning 8-step LoRA cuts that to ≈ 55 s / ≈ 34 s. Make sure the LoRA node is wired into the sampler chain — a disconnected LoRA node does nothing.

Black image output

Reported in city96/Qwen-Image-gguf discussions and applies equally to the FP8 native path — the typical cause is a missing or mismatched text encoder. Verify qwen_2.5_vl_7b_fp8_scaled.safetensors is present in ComfyUI/models/text_encoders/ and that the workflow's text-encoder loader node points at it specifically.

Lightning LoRA + distilled checkpoint combination

The ComfyUI Wiki guide notes that the distilled qwen_image_distill_full_fp8_e4m3fn.safetensors checkpoint and the 8-step Lightning LoRA appear not to compose cleanly — pick one acceleration strategy. The distilled checkpoint alone produces ≈ 69 s / ≈ 36 s on the 4090D measurement.