Qwen-Image on RTX 3090 Ti: 20B Text-to-Image via ComfyUI GGUF (Ampere Path — No FP8)

What You'll Build

A local Qwen-Image text-to-image setup on an RTX 3090 Ti (Ampere, sm_86). Qwen-Image is Alibaba Tongyi Lab's 20B-parameter MMDiT image foundation model, released August 4, 2025 under Apache 2.0, with strong text-rendering in English and Chinese (GitHub README). The diffusers BF16 path needs roughly 40 GB of VRAM for the diffusion weights alone (per the city96 GGUF file table, BF16 weights total 40.87 GB on disk) — too large to fit a 24 GB card without quantization. The 4090 sibling recipe at /check/qwen-image/rtx-4090 takes the ComfyUI native FP8 path (qwen_image_fp8_e4m3fn.safetensors, 20.4 GB) but that path is not available on the 3090 Ti — Ampere sm_86 has no FP8 tensor cores, so the FP8 weights have to be dequantized on the fly, which negates the size advantage. This recipe uses the same path the 4060 Ti 16 GB sibling at /check/qwen-image/rtx-4060-ti-16gb and the 3090 sibling at /check/qwen-image/rtx-3090 do — city96's GGUF redistribution loaded through the ComfyUI-GGUF custom node — but with the 3090 Ti's full 24 GB envelope you can pick a higher-precision quant (Q4_K_M or Q5_K_M) without offload pressure.

Hardware data: RTX 3090 Ti (24 GB VRAM, Ampere sm_86) · 20B-parameter MMDiT at 4–5 bit · See benchmark data

⚠️ Why not the 4090's FP8 path? The official qwen_image_fp8_e4m3fn.safetensors build (20.4 GB, per the ComfyUI native tutorial) is what the RTX 4090 sibling recipe installs. On Ampere there are no FP8 tensor cores — the weights load but are dequantized to BF16 at runtime, ballooning effective memory back toward the 40 GB BF16 footprint. The GGUF path below is the working 24 GB Ampere route, and the 3090 Ti is on the same Ampere sm_86 silicon as the plain 3090, just clocked higher with a slightly wider memory bus (1008 GB/s vs 936 GB/s).

Requirements

Component	Minimum	Tested
GPU	24 GB VRAM (NVIDIA, CUDA-capable, Ampere/Ada/Hopper/Blackwell)	RTX 3090 Ti (Ampere, sm_86)
RAM	32 GB system RAM recommended for text-encoder offload	—
Storage	~13 GB for Q4_K_M, ~15 GB for Q5_K_M, ~25 GB total with text encoder + VAE	—
Software	ComfyUI (current build), Python 3.10+, ComfyUI-GGUF custom node	—

The 20B parameter count is stated explicitly in the Qwen-Image GitHub README ("20B MMDiT image foundation model that achieves significant advances in complex text rendering and precise image editing").

Installation

1. Update ComfyUI

Pull the latest ComfyUI build. Qwen-Image has had native ComfyUI support since 2025.08.05 — the official Qwen-Image tutorial explicitly notes "Make sure your ComfyUI is updated."

2. Install the ComfyUI-GGUF custom node

The native ComfyUI loader does not read GGUF files; install city96's loader. From your ComfyUI root:

git clone https://github.com/city96/ComfyUI-GGUF ComfyUI/custom_nodes/ComfyUI-GGUF
pip install --upgrade gguf

Windows portable build users substitute the embedded Python:

git clone https://github.com/city96/ComfyUI-GGUF ComfyUI/custom_nodes/ComfyUI-GGUF
.\python_embeded\python.exe -s -m pip install -r .\ComfyUI\custom_nodes\ComfyUI-GGUF\requirements.txt

Source: ComfyUI-GGUF README.

3. Download the quantized diffusion weights

From city96/Qwen-Image-gguf, pull a quant that fits your 24 GB envelope with headroom for the Qwen2.5-VL-7B text encoder, VAE, and activations. The full per-quant file sizes (verified via the HF tree API on 2026-05-28) are:

Quant	Size on disk
Q2_K	7.06 GB
Q3_K_S	8.95 GB
Q3_K_M	9.68 GB
Q4_0	11.85 GB
Q4_K_S	12.14 GB
Q4_K_M	13.07 GB
Q5_K_S	14.12 GB
Q5_K_M	14.93 GB
Q6_K	16.82 GB
Q8_0	21.76 GB
BF16	40.87 GB

For a 24 GB Ampere card, Q4_K_M (13.07 GB) is the recommended starting point — same quant the 4060 Ti 16 GB sibling recipe uses on a tighter envelope, with the 3090 Ti's extra ~8 GB of headroom letting you keep the Qwen2.5-VL-7B text encoder fully on-GPU without offload pressure. Q5_K_M (14.93 GB) also fits comfortably and is a small quality bump; Q6_K (16.82 GB) fits but leaves less headroom for the text encoder, so you may want offload at that tier.

Avoid IQ-quants: per city96's own response on issue #255, "The IQ quants are only supported via a very slow fallback. You could try go for Q4_K_S or Q3_K_M instead (basically anything that just has 'Q' instead of 'IQ' in the name)."

Place the file at ComfyUI/models/diffusion_models/qwen-image-Q4_K_M.gguf (the destination is per the city96 model card).

4. Download the text encoder and VAE

These are the same files the FP8 native workflow uses. Per the official ComfyUI tutorial:

ComfyUI/models/text_encoders/qwen_2.5_vl_7b_fp8_scaled.safetensors
ComfyUI/models/vae/qwen_image_vae.safetensors

Both download from the official Comfy-Org HuggingFace mirror linked from the tutorial.

5. (Recommended) Download the Lightning LoRA

To bring per-image latency down from minutes to under a minute, install the official Lightning LoRA:

ComfyUI/models/loras/Qwen-Image-Lightning-8steps-V1.0.safetensors

The 8-step variant is the ComfyUI tutorial's recommended acceleration LoRA for this model.

6. Load the workflow

The city96 GGUF model card ships a ready-to-use workflow at media/qwen-image_workflow.json in the repo. Drag the JSON onto the ComfyUI canvas — it pre-wires the Unet Loader (GGUF) node (from the bootleg category, per the ComfyUI-GGUF README), the Qwen2.5-VL text encoder, and the VAE.

Running

With ComfyUI launched (python main.py from the ComfyUI root, or the portable build's launcher):

python main.py --listen 127.0.0.1 --port 8188

Open http://127.0.0.1:8188, load the workflow JSON, enter a prompt, and queue a generation. First-run latency is dominated by the GGUF load into VRAM; subsequent runs reuse the in-memory model.

Unlike Blackwell GPUs (sm_120), no cu128-specific wheel selection is required for the 3090 Ti — the default pip install torch already includes Ampere sm_86 kernels, and PyTorch / FlashAttention have shipped sm_86 support since their initial releases.

Results

VRAM usage: Q4_K_M diffusion weights are 13.07 GB on disk per the city96 file table. At runtime, peak VRAM = diffusion weights + Qwen2.5-VL-7B text encoder (the FP8-scaled variant) + VAE + activations; on a 24 GB Ampere card this comfortably stays under the envelope without offload. The ComfyUI Wiki Qwen-Image guide measures Q4_K_S at ~56% of a 24 GB RTX 4090D (~13.4 GB observed at runtime); Q4_K_M and Q5_K_M sit incrementally higher on the same hardware class, and the 3090 Ti's 24 GB envelope is identical. See /check/qwen-image/rtx-3090-ti for empirical 3090 Ti-specific measurements once a community submission lands.
Speed: no first-party RTX 3090 Ti-specific Qwen-Image speed measurement was located for this draft. The 3090 sibling recipe also omits a speed line for the same reason (the closest available third-party walkthrough is variant-ambiguous). The 3090 Ti is the same Ampere sm_86 architecture as the plain 3090, with a small memory-bandwidth uplift (1008 GB/s vs 936 GB/s — roughly 8%); per-image latency on the same Q4 quant should land slightly below the 3090 figure once one is measured, still one Ada/Blackwell generation behind the 4090 / 5090 paths and with no FP8 acceleration available. Check /check/qwen-image/rtx-3090-ti for an empirical number once a community benchmark lands.
Quality notes: Qwen-Image is positioned as a strong text-rendering model (GitHub README, benchmarks vs closed-source on the README header image). The 24 GB Ampere envelope unlocks Q5_K_M (14.93 GB) which a 16 GB card would struggle to fit alongside the text encoder — a modest quality lift over the Q4_K_S used in the 16 GB sibling.

For the full benchmark data, see /check/qwen-image/rtx-3090-ti.

Troubleshooting

Generation runs ~20 minutes per image with `transformer dispatched to CPU` log lines

Reported on the canonical repo at QwenLM/Qwen-Image#260 by an RTX 3090 user running num_inference_steps=40 and seeing repeated transformer-to-CPU offload messages. The same failure mode applies to any 24 GB Ampere sm_86 card — including the 3090 Ti — because the binding constraint is the diffusers BF16 path falling back to CPU offload when the full 40 GB BF16 weights don't fit 24 GB. The fix is to switch to the ComfyUI-GGUF path described above — Q4_K_M (13 GB) keeps the entire diffusion stack on-GPU, eliminating per-step CPU swap and bringing latency back into the "minutes-and-under" range. The issue thread has no maintainer response as of 2026-05-28.

`qwen_image_fp8_e4m3fn.safetensors` runs slowly or OOMs on Ampere

The FP8 file is 20.4 GB on disk (ComfyUI native tutorial) so it nominally fits 24 GB, but Ampere (sm_86) has no FP8 tensor cores and dequantizes on the fly — runtime memory pressure ends up closer to the BF16 figure, and you lose the speed advantage that motivates the FP8 path on Ada/Hopper/Blackwell. The 3090 Ti and 3090 both share this constraint. Use the GGUF path instead.

Q5/Q6 GGUF fits on disk but OOMs at runtime

Q6_K is 16.82 GB on disk (city96 file table); with the Qwen2.5-VL-7B text encoder (FP8-scaled, ~7 GB) and VAE also resident, peak VRAM at Q6 can press into the upper-20s. Drop to Q5_K_M or Q4_K_M, or enable text-encoder CPU offload in ComfyUI.

`Unet Loader (GGUF)` node missing

You skipped step 2. Install ComfyUI-GGUF — the native ComfyUI loader cannot read .gguf files. The node lives under the bootleg category in the node browser.

Generation produces a black image

Reported across city96/Qwen-Image-gguf discussions. Most common cause is a mismatched or missing text encoder file — double-check qwen_2.5_vl_7b_fp8_scaled.safetensors is present in ComfyUI/models/text_encoders/ and is the FP8-scaled variant linked from the official ComfyUI tutorial (not the bare Qwen2.5-VL-7B BF16 weights).

Long generation times without Lightning LoRA

Without the Lightning LoRA at 50 steps, generation can take several minutes per image even on a 24 GB Ampere card. The 8-step Lightning LoRA from the official ComfyUI tutorial materially reduces per-image latency; load it into the LoRA node wired into the sampler. If Lightning is loaded but generation is still slow, confirm the LoRA node is in the sampler chain rather than disconnected.