How much VRAM does Qwen-Image need?

About 13 GB — the minimum this recipe targets.

How hard is this setup?

Intermediate — follow the steps above.

Qwen-Image on RTX 3090: 20B Text-to-Image via ComfyUI GGUF (Ampere Path — No FP8)

What You'll Build

A local Qwen-Image text-to-image setup on an RTX 3090 (Ampere, sm_86). Qwen-Image is Alibaba Tongyi Lab's 20B-parameter MMDiT image foundation model, released August 4, 2025 under Apache 2.0, with strong text-rendering in English and Chinese (GitHub README). The diffusers BF16 path needs roughly 40 GB of VRAM for the diffusion weights alone (per the city96 GGUF file table, BF16 weights total 40.87 GB on disk) — too large to fit a 24 GB card without quantization. The 4090 sibling recipe at /check/qwen-image/rtx-4090 takes the ComfyUI native FP8 path (qwen_image_fp8_e4m3fn.safetensors, 20.4 GB) but that path is not available on the 3090 — Ampere sm_86 has no FP8 tensor cores, so the FP8 weights have to be dequantized on the fly, which negates the size advantage. This recipe uses the same path that the 4060 Ti 16 GB sibling at /check/qwen-image/rtx-4060-ti-16gb does — city96's GGUF redistribution loaded through the ComfyUI-GGUF custom node — but with the extra ~8 GB of 3090 headroom you can pick a higher-precision quant (Q4_K_M or Q5_K_M) without offload pressure.

Hardware data: RTX 3090 (24 GB VRAM, Ampere sm_86) · 20B-parameter MMDiT at 4–5 bit · See benchmark data

⚠️ Why not the 4090's FP8 path? The official qwen_image_fp8_e4m3fn.safetensors build (20.4 GB, per the ComfyUI native tutorial) is what the RTX 4090 sibling recipe installs. On Ampere there are no FP8 tensor cores — the weights load but are dequantized to BF16 at runtime, ballooning effective memory back toward the 40 GB BF16 footprint. The GGUF path below is the working 24 GB Ampere route.

Requirements

Component	Minimum	Tested
GPU	24 GB VRAM (NVIDIA, CUDA-capable, Ampere/Ada/Hopper/Blackwell)	RTX 3090 (Ampere, sm_86)
RAM	32 GB system RAM recommended for text-encoder offload	—
Storage	~13 GB for Q4_K_M, ~15 GB for Q5_K_M, ~25 GB total with text encoder + VAE	—
Software	ComfyUI (current build), Python 3.10+, ComfyUI-GGUF custom node	—

The 20B parameter count is stated explicitly in the Qwen-Image GitHub README ("20B MMDiT image foundation model that achieves significant advances in complex text rendering and precise image editing").

Installation

1. Update ComfyUI

Pull the latest ComfyUI build. Qwen-Image has had native ComfyUI support since 2025.08.05 — the official Qwen-Image tutorial explicitly notes "Make sure your ComfyUI is updated."

2. Install the ComfyUI-GGUF custom node

The native ComfyUI loader does not read GGUF files; install city96's loader. From your ComfyUI root:

git clone https://github.com/city96/ComfyUI-GGUF ComfyUI/custom_nodes/ComfyUI-GGUF
pip install --upgrade gguf

Windows portable build users substitute the embedded Python:

git clone https://github.com/city96/ComfyUI-GGUF ComfyUI/custom_nodes/ComfyUI-GGUF
.\python_embeded\python.exe -s -m pip install -r .\ComfyUI\custom_nodes\ComfyUI-GGUF\requirements.txt

Source: ComfyUI-GGUF README.

3. Download the quantized diffusion weights

From city96/Qwen-Image-gguf, pull a quant that fits your 24 GB envelope with headroom for the Qwen2.5-VL-7B text encoder, VAE, and activations. The full per-quant file sizes (verified via the HF tree API on 2026-05-22) are:

Quant	Size on disk
Q2_K	7.06 GB
Q3_K_S	8.95 GB
Q3_K_M	9.68 GB
Q4_0	11.85 GB
Q4_K_S	12.14 GB
Q4_K_M	13.07 GB
Q5_K_S	14.12 GB
Q5_K_M	14.93 GB
Q6_K	16.82 GB
Q8_0	21.76 GB
BF16	40.87 GB

For a 24 GB Ampere card, Q4_K_M (13.07 GB) is the recommended starting point — same quant the 4060 Ti 16 GB sibling recipe uses on a tighter envelope, with the 3090's extra ~8 GB of headroom letting you keep the Qwen2.5-VL-7B text encoder fully on-GPU without offload pressure. Q5_K_M (14.93 GB) also fits comfortably and is a small quality bump; Q6_K (16.82 GB) fits but leaves less headroom for the text encoder, so you may want offload at that tier.

Avoid IQ-quants: per city96's own response on issue #255, "The IQ quants are only supported via a very slow fallback. You could try go for Q4_K_S or Q3_K_M instead (basically anything that just has 'Q' instead of 'IQ' in the name)."

Place the file at ComfyUI/models/diffusion_models/qwen-image-Q4_K_M.gguf (the destination is per the city96 model card).

4. Download the text encoder and VAE

These are the same files the FP8 native workflow uses. Per the official ComfyUI tutorial:

ComfyUI/models/text_encoders/qwen_2.5_vl_7b_fp8_scaled.safetensors
ComfyUI/models/vae/qwen_image_vae.safetensors

Both download from the official Comfy-Org HuggingFace mirror linked from the tutorial.

5. (Recommended) Download the Lightning LoRA

To bring per-image latency down from minutes to under a minute, install the official Lightning LoRA:

ComfyUI/models/loras/Qwen-Image-Lightning-8steps-V1.0.safetensors

The 8-step variant is the ComfyUI tutorial's recommended acceleration LoRA for this model.

6. Load the workflow

The city96 GGUF model card ships a ready-to-use workflow at media/qwen-image_workflow.json in the repo. Drag the JSON onto the ComfyUI canvas — it pre-wires the Unet Loader (GGUF) node (from the bootleg category, per the ComfyUI-GGUF README), the Qwen2.5-VL text encoder, and the VAE.

Running

With ComfyUI launched (python main.py from the ComfyUI root, or the portable build's launcher):

python main.py --listen 127.0.0.1 --port 8188

Open http://127.0.0.1:8188, load the workflow JSON, enter a prompt, and queue a generation. First-run latency is dominated by the GGUF load into VRAM; subsequent runs reuse the in-memory model.

Unlike Blackwell GPUs (sm_120), no cu128-specific wheel selection is required for the 3090 — the default pip install torch already includes Ampere sm_86 kernels, and PyTorch / FlashAttention have shipped sm_86 support since their initial releases.

Results

VRAM usage: Q4_K_M diffusion weights are 13.07 GB on disk per the city96 file table. At runtime, peak VRAM = diffusion weights + Qwen2.5-VL-7B text encoder (the FP8-scaled variant) + VAE + activations; on a 24 GB Ampere card this comfortably stays under the envelope without offload. The ComfyUI Wiki Qwen-Image guide measures Q4_K_S at ~56% of a 24 GB RTX 4090D (~13.4 GB observed at runtime); Q4_K_M and Q5_K_M sit incrementally higher on the same hardware class. See /check/qwen-image/rtx-3090 for empirical 3090-specific measurements once a community submission lands.
Speed: no first-party RTX 3090-specific Qwen-Image speed measurement was located for this draft. The 4060 Ti 16 GB sibling recipe also omits a speed line for the same reason (variant ambiguity in the closest available third-party walkthrough). The 3090's Ampere arch is one generation older than Ada (4060 Ti / 4090) — expect end-to-end per-image latency roughly in line with the 4060 Ti 16 GB on the same Q4 quant, slightly faster due to the 3090's higher memory bandwidth (936 GB/s vs the 4060 Ti's 288 GB/s) but with no FP8 acceleration on either path. Check /check/qwen-image/rtx-3090 for an empirical number once a community benchmark lands.
Quality notes: Qwen-Image is positioned as a strong text-rendering model (GitHub README, benchmarks vs closed-source on the README header image). The 24 GB Ampere envelope unlocks Q5_K_M (14.93 GB) which a 16 GB card would struggle to fit alongside the text encoder — a modest quality lift over the Q4_K_S used in the 16 GB sibling.

For the full benchmark data, see /check/qwen-image/rtx-3090.

Troubleshooting

Generation runs ~20 minutes per image with `transformer dispatched to CPU` log lines

Reported on the canonical repo at QwenLM/Qwen-Image#260 by an RTX 3090 user running num_inference_steps=40 and seeing repeated transformer-to-CPU offload messages. This is the diffusers BF16 path falling back to CPU offload because the full 40 GB BF16 weights don't fit 24 GB. The fix is to switch to the ComfyUI-GGUF path described above — Q4_K_M (13 GB) keeps the entire diffusion stack on-GPU on the 3090, eliminating per-step CPU swap and bringing latency back into the "minutes-and-under" range. The issue thread has no maintainer response as of 2026-05-22.

`qwen_image_fp8_e4m3fn.safetensors` runs slowly or OOMs on Ampere

The FP8 file is 20.4 GB on disk (ComfyUI native tutorial) so it nominally fits 24 GB, but Ampere (sm_86) has no FP8 tensor cores and dequantizes on the fly — runtime memory pressure ends up closer to the BF16 figure, and you lose the speed advantage that motivates the FP8 path on Ada/Hopper/Blackwell. Use the GGUF path instead.

Q5/Q6 GGUF fits on disk but OOMs at runtime

Q6_K is 16.82 GB on disk (city96 file table); with the Qwen2.5-VL-7B text encoder (FP8-scaled, ~7 GB) and VAE also resident, peak VRAM at Q6 can press into the upper-20s. Drop to Q5_K_M or Q4_K_M, or enable text-encoder CPU offload in ComfyUI.

`Unet Loader (GGUF)` node missing

You skipped step 2. Install ComfyUI-GGUF — the native ComfyUI loader cannot read .gguf files. The node lives under the bootleg category in the node browser.

Generation produces a black image

Reported across city96/Qwen-Image-gguf discussions. Most common cause is a mismatched or missing text encoder file — double-check qwen_2.5_vl_7b_fp8_scaled.safetensors is present in ComfyUI/models/text_encoders/ and is the FP8-scaled variant linked from the official ComfyUI tutorial (not the bare Qwen2.5-VL-7B BF16 weights).

Long generation times without Lightning LoRA

Without the Lightning LoRA at 50 steps, generation can take several minutes per image even on a 24 GB Ampere card. The 8-step Lightning LoRA from the official ComfyUI tutorial materially reduces per-image latency; load it into the LoRA node wired into the sampler. If Lightning is loaded but generation is still slow, confirm the LoRA node is in the sampler chain rather than disconnected.