How much VRAM does Qwen-Image need?

About 13 GB — the minimum this recipe targets.

How hard is this setup?

Intermediate — follow the steps above.

Qwen-Image on RTX 4060 Ti 16GB: 20B Text-to-Image via ComfyUI GGUF

What You'll Build

A local Qwen-Image text-to-image setup on a 16GB RTX 4060 Ti. The official FP8 build of Qwen-Image — a 20B-parameter MMDiT image foundation model from Alibaba's Tongyi Lab, released August 4, 2025 under Apache 2.0 — ships as a 20.4 GB diffusion weight (per the ComfyUI native tutorial) and will not fit on a 16GB card. This recipe uses city96's GGUF redistribution (Q4_K_S, 12.1 GB on disk) loaded through the ComfyUI-GGUF custom node, which leaves enough headroom for the Qwen2.5-VL-7B text encoder, the VAE, and activations on the 4060 Ti's 16 GB envelope.

Hardware data: RTX 4060 Ti 16GB (Ada Lovelace, sm_89) · 20B-parameter MMDiT at 4-bit · See benchmark data

⚠️ Known issue: the official qwen_image_fp8_e4m3fn.safetensors build is 20.4 GB (ComfyUI native docs) and will OOM on a 16GB card. Use the GGUF path below, not the native FP8 workflow.

Requirements

Component	Minimum	Tested
GPU	16 GB VRAM (NVIDIA, CUDA-capable)	RTX 4060 Ti 16GB (Ada, sm_89)
RAM	32 GB system RAM recommended for text-encoder offload	—
Storage	~14 GB for diffusion model, ~25 GB total with encoder + VAE + Lightning LoRA	—
Software	ComfyUI (current build), Python 3.10+, ComfyUI-GGUF custom node	—

The 20B parameter count is stated explicitly on the Qwen-Image GitHub README ("a 20B MMDiT image foundation model") and the city96 GGUF card — at BF16 the weights alone are 40.9 GB per city96's file-size table, which is why a 16GB card requires either GGUF or FP8 quantization plus CPU offload.

Installation

1. Update ComfyUI

Pull the latest ComfyUI build. Qwen-Image support in core ComfyUI is recent — the official Qwen-Image tutorial explicitly notes "Make sure your ComfyUI is updated."

2. Install the ComfyUI-GGUF custom node

The native ComfyUI loader does not read GGUF files; install city96's loader. From your ComfyUI root:

git clone https://github.com/city96/ComfyUI-GGUF ComfyUI/custom_nodes/ComfyUI-GGUF
pip install --upgrade gguf

Windows portable build users substitute the embedded Python:

git clone https://github.com/city96/ComfyUI-GGUF ComfyUI/custom_nodes/ComfyUI-GGUF
.\python_embeded\python.exe -s -m pip install -r .\ComfyUI\custom_nodes\ComfyUI-GGUF\requirements.txt

Source: ComfyUI-GGUF README.

3. Download the quantized diffusion weights

From city96/Qwen-Image-gguf, pull a quant that fits 16 GB with overhead. The full size table (from the model card) is:

Quant	Size on disk
Q4_0	11.9 GB
Q4_K_S	12.1 GB
Q4_K_M	13.1 GB
Q5_K_S	14.1 GB
Q5_K_M	14.9 GB

For a 16 GB card, Q4_K_S (12.1 GB) is the sweet spot — the ComfyUI Wiki Qwen-Image guide reports qwen-image-Q4_K_S.gguf at "56% VRAM" in its GGUF reference table (the wiki's native FP8 row pins to "RTX4090D 24GB"; the GGUF row is not GPU-attributed, so treat 56% as a comparable-class indicator rather than a precise 13.4 GB peak). On a 16 GB card you should have enough headroom for the Qwen2.5-VL-7B text encoder, VAE, and activations to clear without offload. Q4_K_M (13.1 GB file) also fits if you want a slight quality bump; Q5 quants typically need text-encoder offload to clear OOM.

Place the file at ComfyUI/models/diffusion_models/qwen-image-Q4_K_S.gguf (the destination is per the city96 model card).

4. Download the text encoder and VAE

These are the same files the FP8 native workflow uses. Per the official ComfyUI tutorial:

ComfyUI/models/text_encoders/qwen_2.5_vl_7b_fp8_scaled.safetensors
ComfyUI/models/vae/qwen_image_vae.safetensors

Both download from the official Comfy-Org HuggingFace mirror linked from the tutorial.

5. (Recommended) Download the Lightning LoRA

To bring per-image latency down from minutes to under a minute, install the official Lightning LoRA:

ComfyUI/models/loras/Qwen-Image-Lightning-8steps-V1.0.safetensors

The 8-step variant is the ComfyUI tutorial's recommended acceleration LoRA for this model.

6. Load the workflow

The city96 GGUF model card ships a ready-to-use workflow at media/qwen-image_workflow.json in the repo. Drag the JSON onto the ComfyUI canvas — it pre-wires the Unet Loader (GGUF) node (from the bootleg category, per the ComfyUI-GGUF README), the Qwen2.5-VL text encoder, and the VAE.

Running

With ComfyUI launched (python main.py from the ComfyUI root, or the portable build's launcher):

python main.py --listen 127.0.0.1 --port 8188

Open http://127.0.0.1:8188, load the workflow JSON, enter a prompt, and queue a generation. First-run latency is dominated by the safetensors / GGUF load into VRAM; subsequent runs reuse the in-memory model.

Unlike Blackwell GPUs (sm_120), no special CUDA wheel selection is required for the 4060 Ti — the default pip install torch already includes sm_89 kernels and standard ComfyUI installs work out of the box on Ada Lovelace cards.

Results

VRAM usage: Q4_K_S diffusion weights are 12.1 GB on disk per the city96 file table; on a 16 GB card the diffusion model plus the Qwen2.5-VL-7B text encoder plus the VAE plus activations should fit without forced offload. The ComfyUI Wiki guide reports the Q4_K_S workflow at 56% VRAM in its reference table (GPU not attributed on that row), consistent with a sub-15 GB peak. See /check/qwen-image/rtx-4060-ti-16gb for the live measurement once a community benchmark lands.
Speed: no 4060 Ti 16GB-specific, variant-unambiguous citation was located for this draft. Quoting the public sandner.art Qwen-Image-and-Edit walkthrough verbatim is unsafe here because that article covers both Qwen-Image and Qwen-Image-Edit-2509 and its inference-speed table does not disambiguate which variant the figures came from. Check /check/qwen-image/rtx-4060-ti-16gb for an empirical number once a community submission seeds the benchmark.
Quality notes: Qwen-Image is positioned as a strong text-rendering model (GitHub README); 4-bit quants are widely reported to retain most of that capability, with Q5/Q6 giving incremental quality at proportional VRAM cost.

For the full benchmark data, see /check/qwen-image/rtx-4060-ti-16gb.

Troubleshooting

`qwen_image_fp8_e4m3fn.safetensors` OOMs on load

The FP8 build is 20.4 GB (ComfyUI tutorial) — larger than a 16 GB card's total VRAM. Switch to the GGUF path described above; the FP8 build is for 24 GB+ cards only.

Q5/Q6 GGUF fits on disk but OOMs at runtime

Q5_K_S is 14.1 GB on disk (city96 table); with the Qwen2.5-VL-7B text encoder and VAE also resident, peak VRAM can exceed 16 GB. Drop to Q4_K_S / Q4_K_M, or enable text-encoder CPU offload in ComfyUI.

`Unet Loader (GGUF)` node missing

You skipped step 2. Install ComfyUI-GGUF — the native ComfyUI loader cannot read .gguf files. The node lives under the bootleg category in the node browser.

Generation produces a black image

Reported in city96/Qwen-Image-gguf discussions. Most common cause is a mismatched or missing text encoder file — double-check qwen_2.5_vl_7b_fp8_scaled.safetensors is present in ComfyUI/models/text_encoders/ and is the FP8-scaled variant linked from the official ComfyUI tutorial.

Long generation times without Lightning LoRA

Without the Lightning LoRA at 50 steps, generation can take several minutes per image even on a 16 GB Ada card. The 8-step Lightning LoRA from the official ComfyUI tutorial materially reduces per-image latency; load it into the LoRA node wired into the sampler. If Lightning is loaded but generation is still slow, confirm the LoRA node is in the sampler chain rather than disconnected.