How much VRAM does Qwen-Image need?

About 21 GB — the minimum this recipe targets.

How hard is this setup?

Intermediate — follow the steps above.

Qwen-Image on RTX 5090: 20B Text-to-Image via ComfyUI FP8 (Blackwell Native Path)

What You'll Build

A local Qwen-Image text-to-image setup on an RTX 5090 32GB using the canonical FP8 native path that the official ComfyUI Qwen-Image tutorial and the ComfyUI examples site are built around. Qwen-Image is Alibaba Tongyi Lab's 20B-parameter MMDiT image foundation model, released August 4, 2025 under Apache 2.0, with strong text-rendering in English and Chinese (GitHub README, Tech Report). The qwen_image_fp8_e4m3fn.safetensors weights (19.03 GB on the Comfy-Org repackager file listing) load alongside the Qwen2.5-VL-7B FP8 text encoder (9.38 GB) and the VAE (0.25 GB).

Hardware data: RTX 5090 (32 GB GDDR7 VRAM, Blackwell, sm_120) · 20B MMDiT at FP8 e4m3fn · See benchmark data

ℹ️ Variant note. This recipe pins the base Qwen-Image model (Qwen/Qwen-Image, the 2025-08-04 text-to-image foundation model). It is distinct from Qwen-Image-Edit and Qwen-Image-Edit-2509/-2511 (edit-focused siblings) and from the newer qwen_image_2512_* checkpoints (Comfy-Org repackager files) which are out of scope here. The Comfy-Org repackager keeps qwen_image_* (base) and qwen_image_2512_* (Dec-2025 update) as separate files; pick the one without the 2512_ infix for this recipe.

ℹ️ Why FP8 on this card. The 5090 is Blackwell (sm_120) and has native FP8 tensor cores — the same hardware-accelerated FP8 compute as the RTX 4090 (Ada, sm_89), so the FP8 build is a real speed win and not a memory-only escape hatch. This is in contrast to the RTX 3090 sibling which forces the GGUF path because Ampere (sm_86) has no FP8 tensor cores. The 5090's 32 GB envelope cannot fit BF16 either — the full-precision diffusion weights are 38.05 GB (Comfy-Org file listing) — so FP8 native is the canonical path here.

Requirements

Component	Minimum	Tested
GPU	32 GB VRAM (NVIDIA, Blackwell or 24 GB+ Ada/Hopper)	RTX 5090 (32 GB, Blackwell, sm_120)
RAM	32 GB system RAM recommended	—
Storage	~30 GB (19.0 GB FP8 weights + 9.4 GB FP8 text encoder + 0.25 GB VAE + optional LoRA)	—
Software	ComfyUI (current build, Jan 2026 or later), Python 3.10+	—

The 20B parameter count is stated explicitly in the Qwen-Image GitHub README ("20B MMDiT image foundation model that achieves significant advances in complex text rendering") and on the HF model card. At BF16 the diffusion weights alone are 38.05 GB (Comfy-Org file listing) — past the 5090's 32 GB envelope — which is why even a 32 GB card runs the FP8 build rather than full-precision weights.

Installation

1. Update ComfyUI

Pull the latest ComfyUI build (Jan 2026 or later — the official ComfyUI Qwen-Image tutorial explicitly notes "Make sure your ComfyUI is updated"). A current build also brings in initial NVFP4 checkpoint support per Comfy-Org/ComfyUI PR #11635 (merged 2026-01-06) and the LoRA-on-NVFP4 follow-up PR #11837 (merged 2026-01-13) — see the optional NVFP4 path below.

2. Download the FP8 diffusion weights

From the Comfy-Org Qwen-Image repackager, pull the FP8 e4m3fn build:

# From your ComfyUI root, into ComfyUI/models/diffusion_models/
wget https://huggingface.co/Comfy-Org/Qwen-Image_ComfyUI/resolve/main/split_files/diffusion_models/qwen_image_fp8_e4m3fn.safetensors \
  -O ComfyUI/models/diffusion_models/qwen_image_fp8_e4m3fn.safetensors

File size: 19.03 GB (per the Comfy-Org file listing).

3. Download the text encoder and VAE

# Text encoder (FP8 scaled, 9.38 GB)
wget https://huggingface.co/Comfy-Org/Qwen-Image_ComfyUI/resolve/main/split_files/text_encoders/qwen_2.5_vl_7b_fp8_scaled.safetensors \
  -O ComfyUI/models/text_encoders/qwen_2.5_vl_7b_fp8_scaled.safetensors

# VAE (0.25 GB)
wget https://huggingface.co/Comfy-Org/Qwen-Image_ComfyUI/resolve/main/split_files/vae/qwen_image_vae.safetensors \
  -O ComfyUI/models/vae/qwen_image_vae.safetensors

Destinations confirmed by the Comfy-Org repackager tree and the ComfyUI-Wiki Qwen-Image guide.

4. (Recommended) Download the Lightning 8-step LoRA

The official ComfyUI tutorial recommends the 8-step Lightning LoRA for accelerated inference:

wget https://huggingface.co/lightx2v/Qwen-Image-Lightning/resolve/main/Qwen-Image-Lightning-8steps-V1.0.safetensors \
  -O ComfyUI/models/loras/Qwen-Image-Lightning-8steps-V1.0.safetensors

5. Load the workflow

The official ComfyUI tutorial ships a ready-to-use JSON. Drag the workflow JSON onto the ComfyUI canvas — it pre-wires the Load Diffusion Model node (pointing at qwen_image_fp8_e4m3fn.safetensors), the Qwen2.5-VL FP8 text encoder, and the VAE. No custom nodes are required for the native FP8 path (unlike the GGUF path, which needs city96's ComfyUI-GGUF loader).

Running

Launch ComfyUI from the ComfyUI root:

python main.py --listen 127.0.0.1 --port 8188

Open http://127.0.0.1:8188, load the workflow JSON, enter a prompt, and queue a generation. First-run latency is dominated by the FP8 safetensors load into VRAM; subsequent runs reuse the in-memory model.

Blackwell wheel selection. The RTX 5090 is Blackwell (sm_120). PyTorch wheels built against CUDA 12.8 (cu128) ship sm_120 kernels — install with pip install torch --index-url https://download.pytorch.org/whl/cu128 if your current ComfyUI Python environment defaults to an older cu121/cu124 wheel.

Attention backend. ComfyUI's diffusers integration uses SDPA (scaled_dot_product_attention) by default for the Qwen-Image workflow, which always works on sm_120. FlashAttention-2 sm_120 kernels are still tracked open at Dao-AILab/flash-attention#2168 — if you have a custom node or fork that explicitly enables FA2, you may need to fall back to SDPA on the 5090 until the upstream wheel lands sm_120 coverage.

Results

Speed: no first-party RTX 5090 measurement has been published for the base Qwen-Image FP8 native ComfyUI workflow yet — the closest published timing is from the ComfyUI-Wiki Qwen-Image guide on an RTX 4090D 24GB (Ada Lovelace, different arch): first generation ≈ 94 s, second generation ≈ 71 s at default settings; with the lightx2v 8-step LoRA ≈ 55 s / ≈ 34 s. The 5090 is Blackwell (sm_120) with materially higher compute and memory bandwidth than Ada, so timings will differ — we deliberately omit a 5090-specific number here pending a first-party benchmark. See /check/qwen-image/rtx-5090 for the live measurement once a community benchmark seeds the backend, or contribute one via /contribute.
VRAM usage: the FP8 native workflow on a 24 GB sibling card peaks at 86% of 24 GB ≈ 20.6 GB, measured on RTX 4090D 24GB by the ComfyUI-Wiki guide. Runtime peak is workload-driven (FP8 diffusion weights 19.0 GB + FP8 text encoder 9.4 GB + VAE 0.25 GB with the text encoder offloaded by ComfyUI's default memory management) and transfers cleanly to the 5090 — the larger 32 GB envelope leaves ~11 GB of headroom rather than tightening the workload.
Spending the headroom. The 5090's 11 GB of free VRAM on top of the FP8 native peak unlocks options the 4090 doesn't have: (a) keep the Qwen2.5-VL-7B text encoder fully on-GPU instead of letting ComfyUI offload it for tight 24 GB cards (cuts encoder-stage latency on long prompts); (b) load multiple LoRAs into the sampler chain concurrently for layered style work (each Qwen-Image LoRA from lightx2v/Qwen-Image-Lightning is a few hundred MB); (c) batch larger images or higher step counts without paging; (d) keep a second ComfyUI workflow loaded in parallel for fast A/B switching.
Quality notes: Qwen-Image is positioned by the Qwen-Image GitHub README and HF card as a strong text-rendering model — both English alphabetic and Chinese logographic. FP8 e4m3fn is the format the ComfyUI core team and Qwen ship as the recommended consumer-GPU build; higher-precision BF16 weights exist (38.05 GB) but exceed the 32 GB envelope. The optional 8-step Lightning LoRA trades a small amount of step-count flexibility for materially lower latency per image.

For the full benchmark data, see /check/qwen-image/rtx-5090.

Troubleshooting

Out-of-memory on the first generation

The FP8 native workflow peaks at ~20.6 GB on a 24 GB card per the ComfyUI-Wiki measurement; a 32 GB 5090 sees the same workload peak with ~11 GB headroom. If you OOM on the 5090, check: (a) you downloaded qwen_image_fp8_e4m3fn.safetensors (19.03 GB) and not qwen_image_bf16.safetensors (38.05 GB), which does not fit; (b) the text encoder is qwen_2.5_vl_7b_fp8_scaled.safetensors (9.38 GB) — the full BF16 variant is roughly 16 GB and tightens headroom unnecessarily; (c) other ComfyUI workflows are not holding stale models in VRAM (restart ComfyUI). All file sizes verified via the Comfy-Org repackager tree.

Try NVFP4 for a Blackwell-only speedup (experimental)

The Comfy-Org repackager also ships qwen_image_nvfp4.safetensors (18.41 GB, per the file listing, uploaded 2026-01-06 by ComfyUI maintainer Kosinkadink). NVFP4 (NVIDIA's 4-bit microscaling format) has hardware acceleration only on Blackwell sm_120 — Ada and Ampere cannot use the format's tensor-core path. ComfyUI core landed initial NVFP4 checkpoint support in PR #11635 (merged 2026-01-06), LoRA-on-NVFP4 support in PR #11837 (merged 2026-01-13), and an adaptive memory loading fix in PR #11845 (merged 2026-02-01). The official ComfyUI Qwen-Image tutorial at docs.comfy.org and comfyanonymous.github.io/ComfyUI_examples/qwen_image/ do not yet document an NVFP4 workflow, so this is an experimental path on the recipe's date — but the file and loader both ship with a current ComfyUI build, and the format genuinely fits the 5090's Blackwell hardware in a way it cannot on Ada or Ampere. If you want the smallest possible diffusion footprint with native sm_120 compute, swap step 2 above for the NVFP4 file and load it via the same Load Diffusion Model node.

Want lower disk footprint via GGUF instead

The city96/Qwen-Image-gguf Q8_0 quant is 20.27 GB on disk — slightly larger than the FP8 build's 19.03 GB. It also requires city96's ComfyUI-GGUF custom node, whereas the FP8 build uses only the native ComfyUI loader. On a 32 GB card, native FP8 is the simpler default; the GGUF path is what the smaller RTX 5060 Ti sibling and RTX 4060 Ti 16 GB sibling use because the FP8 build does not fit a 16 GB card. Consider GGUF on the 5090 only if you specifically need GGUF tooling (custom quantization, smaller community variants).

Black image output

Reported in the city96/Qwen-Image-gguf discussions and applies equally to the FP8 native path — the typical cause is a missing or mismatched text encoder. Verify qwen_2.5_vl_7b_fp8_scaled.safetensors is present in ComfyUI/models/text_encoders/ and that the workflow's text-encoder loader node points at it specifically.

Lightning LoRA + distilled checkpoint combination

The ComfyUI-Wiki guide notes that the distilled qwen_image_distill_full_fp8_e4m3fn.safetensors checkpoint and the 8-step Lightning LoRA appear not to compose cleanly — pick one acceleration strategy at a time.