How much VRAM does Qwen-Image need?

About 12 GB — the minimum this recipe targets.

How hard is this setup?

Intermediate — follow the steps above.

Qwen-Image on RTX 3060: 20B Text-to-Image via ComfyUI GGUF Q3 (Ampere sm_86, 12 GB)

What You'll Build

A local Qwen-Image text-to-image setup on a 12 GB RTX 3060 (Ampere, GA106, sm_86). Qwen-Image is Alibaba Tongyi Lab's "20B MMDiT image foundation model" (GitHub README), released under Apache 2.0, with strong text rendering in English and Chinese. The 12 GB envelope is the binding constraint here: the official FP8 diffusion build is 20.43 GB on disk and the BF16 build is 40.87 GB — both far past a 12 GB card. Even the Q4_K_M GGUF build that fits a 16 GB card (13.07 GB on disk, verified via the HF tree API) exceeds the ~11 GB usable on a 12 GB card with a display attached. This recipe leads with city96's Q3_K_M GGUF build (9.68 GB on disk) loaded through the ComfyUI-GGUF custom node — a quant city96 specifically engineered with high-precision first/last layers so the low-bitrate tiers stay usable. GGUF is the only path that fits the 3060, and on Ampere it is also the only path that runs at full speed (see the FP8 note below).

Hardware data: RTX 3060 (12 GB GDDR6 VRAM, Ampere GA106, sm_86) · 20B-parameter MMDiT at 3-bit GGUF · See benchmark data

⚠️ Why Q3, not Q4, on this card? A 12 GB desktop card with a monitor attached exposes only ~10.5–11.3 GB usable VRAM. The 16 GB-class GGUF path leads with Q4_K_M (13.07 GB on disk); on a 12 GB card that quant's diffusion-stage residency leaves no margin once the VAE and activations are counted. Dropping to Q3_K_M (9.68 GB) — or Q3_K_S (8.95 GB) for tighter headroom — keeps the diffusion model well under the 12 GB envelope. Per the city96 GGUF model card, the Q3_K_M, Q3_K_S and Q2_K tiers "use a new dynamic logic where the first/last layer is kept in high precision", with the card noting that "even Q2_K remains somewhat usable" (a per-tier quality comparison is linked from the card).

Requirements

Component	Minimum	Tested
GPU	12 GB VRAM (NVIDIA, CUDA-capable)	RTX 3060 (12 GB GDDR6, Ampere GA106, sm_86)
RAM	32 GB system RAM recommended for text-encoder offload	—
Storage	~10 GB for Q3_K_M diffusion weights, ~22 GB total with encoder + VAE	—
Software	ComfyUI (current build), Python 3.10+, ComfyUI-GGUF custom node	—

The RTX 3060 12 GB is the Ampere GA106 die (3584 CUDA cores, 112 third-gen Tensor cores, 360 GB/s GDDR6 over a 192-bit bus, 170 W; specs per TechPowerUp's GPU database). The 20B parameter count is stated on the Qwen-Image GitHub README ("20B MMDiT image foundation model") and the city96 GGUF card. At BF16 the diffusion weights alone are 40.87 GB (qwen-image-BF16.gguf, verified via the HF tree API on 2026-06-14), which is why a 12 GB card requires a low-bitrate GGUF quant.

Installation

1. Update ComfyUI

Pull the latest ComfyUI build — Qwen-Image has had native ComfyUI support since release, and the official Qwen-Image tutorial walks through the stock workflow ("Make sure your ComfyUI is updated.").

The RTX 3060 is Ampere (sm_86), so the default pip install torch already ships the sm_86 kernels and a standard ComfyUI install works out of the box. There is no cu128-only requirement here — the cu124/cu121 PyTorch wheels are fine (cu128 also works, being forward-compatible). Unlike Blackwell GPUs (sm_120), no special CUDA wheel selection is required.

2. Install the ComfyUI-GGUF custom node

The native ComfyUI loader does not read GGUF files; install city96's loader. From your ComfyUI root:

git clone https://github.com/city96/ComfyUI-GGUF ComfyUI/custom_nodes/ComfyUI-GGUF
pip install --upgrade gguf

Windows portable build users substitute the embedded Python:

git clone https://github.com/city96/ComfyUI-GGUF ComfyUI/custom_nodes/ComfyUI-GGUF
.\python_embeded\python.exe -s -m pip install -r .\ComfyUI\custom_nodes\ComfyUI-GGUF\requirements.txt

Source: ComfyUI-GGUF README.

3. Download the quantized diffusion weights

From city96/Qwen-Image-gguf, pull a Q3 quant that fits the 12 GB envelope. The full per-quant file sizes (verified via the HF tree API on 2026-06-14) include:

Quant	Size on disk	Fits 12 GB?
Q2_K	7.06 GB	Yes (most aggressive)
Q3_K_S	8.95 GB	Yes
Q3_K_M	9.68 GB	Yes (recommended)
Q4_0	11.85 GB	No (too tight with VAE + activations)
Q4_K_S	12.14 GB	No
Q4_K_M	13.07 GB	No (this is the 16 GB-card path)
Q8_0	21.76 GB	No
BF16	40.87 GB	No

For a 12 GB card, Q3_K_M (9.68 GB) is the recommended starting point — it leaves the diffusion stage with workable margin under the ~11 GB usable once ComfyUI offloads the text encoder. Q3_K_S (8.95 GB) is the lower-VRAM fallback; Q2_K (7.06 GB) is the most aggressive tier if you also drive a heavy desktop.

# From your ComfyUI root, into ComfyUI/models/diffusion_models/
wget https://huggingface.co/city96/Qwen-Image-gguf/resolve/main/qwen-image-Q3_K_M.gguf \
  -O ComfyUI/models/diffusion_models/qwen-image-Q3_K_M.gguf

The destination folder is per the city96 model card.

Avoid the IQ-quants on this loader. Per city96's own response on ComfyUI-GGUF issue #255: "The IQ quants are only supported via a very slow fallback. You could try go for Q4_K_S or Q3_K_M instead (basically anything that just has "Q" instead of "IQ" in the name)."

4. Download the text encoder and VAE

Per the official ComfyUI tutorial, the stock workflow uses the FP8-scaled Qwen2.5-VL-7B text encoder and the Qwen-Image VAE:

# Text encoder (FP8 scaled, 9.38 GB) → ComfyUI/models/text_encoders/
wget https://huggingface.co/Comfy-Org/Qwen-Image_ComfyUI/resolve/main/split_files/text_encoders/qwen_2.5_vl_7b_fp8_scaled.safetensors \
  -O ComfyUI/models/text_encoders/qwen_2.5_vl_7b_fp8_scaled.safetensors

# VAE (0.25 GB) → ComfyUI/models/vae/
wget https://huggingface.co/Comfy-Org/Qwen-Image_ComfyUI/resolve/main/split_files/vae/qwen_image_vae.safetensors \
  -O ComfyUI/models/vae/qwen_image_vae.safetensors

Both file sizes are verified via the Comfy-Org repackager tree on 2026-06-14. ComfyUI runs the text encoder first to produce conditioning, then frees it before loading the diffusion model, so the 9.38 GB encoder and the Q3_K_M diffusion weights are not both fully resident at the same peak — this is the same default offload behaviour that lets the 13.07 GB Q4_K_M quant fit a 16 GB card.

Note the encoder is named qwen_2.5_vl_7b_fp8_scaled.safetensors, but the "FP8" here refers only to the file's storage format — the RTX 3060's Ampere cores have no FP8 tensor cores, so ComfyUI dequantizes the encoder to BF16/FP16 at compute time. The file is still smaller on disk (and in VRAM) than the 16.58 GB BF16 encoder, which is what makes it the right choice on a 12 GB card; you simply don't get FP8 compute acceleration (see the Troubleshooting note).

5. (Optional, for tight display headroom) Use a GGUF text encoder

If your 12 GB card also drives a desktop and you see OOM during the text-encode stage, ComfyUI-GGUF can load the text encoder itself as a GGUF quant via its CLIPLoader (gguf) nodes — the ComfyUI-GGUF README notes these "can be used inplace of the regular ones." Swap the 9.38 GB FP8 encoder for a smaller Qwen2.5-VL-7B GGUF (e.g. the 4.68 GB Qwen2.5-VL-7B-Instruct-Q4_K_M.gguf from unsloth's Qwen2.5-VL-7B-Instruct-GGUF, linked from the city96 card's text-encoder row):

wget https://huggingface.co/unsloth/Qwen2.5-VL-7B-Instruct-GGUF/resolve/main/Qwen2.5-VL-7B-Instruct-Q4_K_M.gguf \
  -O ComfyUI/models/text_encoders/Qwen2.5-VL-7B-Instruct-Q4_K_M.gguf

The RTX 3060's PCIe Gen4 x16 host link means that when the text encoder is offloaded to system RAM, the encoder-stage transfer runs over the same Gen4 bandwidth as a 12 GB Ada card — so the offload behaviour and throughput framing match a 4070-class card on this stage; there is no Gen5→Gen4 penalty to worry about here.

6. Load the workflow

The city96 GGUF model card ships a ready-to-use workflow at media/qwen-image_workflow.json in the repo. Drag the JSON onto the ComfyUI canvas — it pre-wires the Unet Loader (GGUF) node (from the bootleg category, per the ComfyUI-GGUF README), the Qwen2.5-VL text encoder, and the VAE. Point the Unet Loader at your qwen-image-Q3_K_M.gguf file.

Running

With ComfyUI launched from the ComfyUI root (or the portable build's launcher):

python main.py --listen 127.0.0.1 --port 8188

Open http://127.0.0.1:8188, load the workflow JSON, enter a prompt, and queue a generation. First-run latency is dominated by the GGUF load into VRAM; subsequent runs reuse the in-memory model.

ComfyUI's Qwen-Image workflow uses SDPA (scaled_dot_product_attention) by default, which works on the RTX 3060 out of the box. Ampere sm_86 also has full prebuilt FlashAttention-2 kernel coverage (FA2 has shipped sm_86 wheels since its 2.x releases), so if a custom node or fork explicitly enables FA2 it runs fine on this card — no eager/sdpa override is needed (that override only applies to Blackwell sm_120 cards, which still lack FA2 kernels).

Results

Speed: no first-party RTX 3060 Qwen-Image benchmark has been published for the GGUF ComfyUI workflow, and the backend has no measured row for this pair — so we deliberately omit a speed number rather than forward-extrapolate one from a faster card. The RTX 3060 is the weakest of the 12 GB-class cards (360 GB/s memory bandwidth and 3584 CUDA cores, well below a 4070's 504 GB/s / 5888 cores), so a 4070 or 3090 figure would be a loose upper bound only and is not transferred here. Expect generation to be noticeably slower than on those cards; one community RTX 3090 user on QwenLM/Qwen-Image#260 reports ~20 minutes per image at 40 steps on the unfit BF16-offload path — the GGUF path in this recipe avoids that CPU-offload thrash, but no clean 3060 GGUF timing is published. Check /check/qwen-image/rtx-3060 for an empirical figure once a community benchmark lands, and please contribute your own measurement.
VRAM usage (derived envelope): the Q3_K_M diffusion weights are 9.68 GB on disk per the city96 file table; with the text encoder offloaded by ComfyUI's default memory management, the diffusion-stage resident set (transformer + 0.25 GB VAE + activations) plans to roughly ~10–11 GB, inside the ~11 GB usable on a 12 GB card. This is a derived envelope from on-disk sizes, not a measured peak — see /check/qwen-image/rtx-3060 and report a measured figure via /contribute.
Quality notes: Qwen-Image is positioned as a strong text-rendering model in English and Chinese (GitHub README). Q3 is a meaningful step below Q4/Q5 in fidelity, but city96 mitigates this for exactly these low-bitrate tiers: per the model card, Q3_K_M / Q3_K_S / Q2_K "use a new dynamic logic where the first/last layer is kept in high precision", with the card stating "even Q2_K remains somewhat usable". Expect slightly softer fine detail and occasional text-rendering slips versus the Q4_K_M build a 16 GB card runs; if you have headroom, Q3_K_M is the better-quality of the two fitting tiers.

For the full benchmark data, see /check/qwen-image/rtx-3060.

Troubleshooting

Generation runs ~20 minutes per image with `transformer dispatched to CPU` log lines

Reported on the canonical repo at QwenLM/Qwen-Image#260 by a user running num_inference_steps=40 on an RTX 3090 and seeing the transformer repeatedly offloaded to CPU. This is the diffusers BF16 path falling back to CPU offload because the full 40.87 GB BF16 weights don't fit — and the RTX 3060's 12 GB makes it worse, not better, than the reporter's 24 GB 3090. The same sm_86 Ampere architecture is shared by both cards, so this CPU-offload thrash applies directly here. The fix is to use the ComfyUI-GGUF path described above — Q3_K_M (9.68 GB) keeps the entire diffusion stack on-GPU on the 3060, eliminating per-step CPU swap. The issue thread is open with no maintainer response as of 2026-06-14, so treat the reporter's number as a community datapoint, not an official benchmark.

Why not just run the FP8 native build? FP8 brings no speed-up on Ampere

The RTX 3060's Ampere (sm_86) tensor cores have no FP8 (E4M3/E5M2) support — FP8 compute first shipped on Hopper (sm_90) and consumer Ada (sm_89). So even setting aside file size, FP8 is a memory format here, not a speed feature: an FP8 weight loads but the runtime dequantizes it to BF16/FP16 per operation at compute time. And the official qwen_image_fp8_e4m3fn.safetensors is 20.43 GB on disk (Comfy-Org repackage file tree) — larger than the 3060's entire 12 GB VRAM — so it OOMs on load regardless. (The NVFP4 single-file build is 19.77 GB and is in any case Blackwell-only — Ampere cannot run NVFP4.) Use the GGUF Q3 path above: it fits, and on Ampere it is also the fastest route because there is no FP8 acceleration to forgo.

Q4_K_M or higher OOMs on a 12 GB card

Q4_K_M is 13.07 GB on disk and Q4_K_S is 12.14 GB (city96 file table) — both exceed the ~11 GB usable on a 12 GB card once the VAE and activations are added. These are the 16 GB-card quants. Use Q3_K_M (9.68 GB) or Q3_K_S (8.95 GB) as described above.

OOM during the text-encode stage

The FP8-scaled Qwen2.5-VL-7B text encoder is 9.38 GB (Comfy-Org file listing). On a 12 GB card driving a desktop, the brief text-encode stage can spike close to the usable ceiling. Switch to a GGUF text encoder via ComfyUI-GGUF's CLIPLoader (gguf) node (step 5) — the 4.68 GB Qwen2.5-VL-7B-Instruct-Q4_K_M.gguf cuts the encoder footprint roughly in half. On the 3060's PCIe Gen4 link the offloaded encoder stage runs at the same bandwidth as a 12 GB Ada card on that stage.

`Unet Loader (GGUF)` node missing

You skipped step 2. Install ComfyUI-GGUF — the native ComfyUI loader cannot read .gguf files. The node lives under the bootleg category in the node browser.

Generation produces a black image

Reported across city96/Qwen-Image-gguf discussions. The most common cause is a mismatched or missing text encoder — double-check qwen_2.5_vl_7b_fp8_scaled.safetensors (or your GGUF substitute from step 5) is present in ComfyUI/models/text_encoders/.