self-hosted/ai
§01·recipe · image

Qwen-Image on RTX 4070: 20B Text-to-Image via ComfyUI GGUF Q3 (Ada sm_89, 12 GB)

imageintermediate12GB+ VRAMJun 9, 2026
models
tools
prerequisites
  • NVIDIA RTX 4070 (12 GB GDDR6X VRAM, Ada Lovelace sm_89) or equivalent 12 GB-class card
  • Python 3.10+
  • ComfyUI installed and up to date
  • ~10 GB free disk space for the Q3_K_M diffusion model alone (~22 GB total with text encoder + VAE)

What You'll Build

A local Qwen-Image text-to-image setup on a 12 GB RTX 4070 (Ada Lovelace, sm_89). Qwen-Image is Alibaba Tongyi Lab's "20B MMDiT image foundation model" (GitHub README), released under Apache 2.0. The 12 GB envelope is the binding constraint here: the official FP8 diffusion build is 20.43 GB on disk and the BF16 build is 40.86 GB — both far past a 12 GB card. Even the Q4_K_M GGUF build that fits a 16 GB card (13.07 GB on disk, verified via the HF tree API) exceeds the ~11 GB usable on a 12 GB card with a display attached. This recipe leads with city96's Q3_K_M GGUF build (9.68 GB on disk) loaded through the ComfyUI-GGUF custom node — a quant city96 specifically engineered with high-precision first/last layers so the low-bitrate tiers stay usable.

Hardware data: RTX 4070 (12 GB GDDR6X VRAM, Ada Lovelace, sm_89) · 20B-parameter MMDiT at 3-bit GGUF · See benchmark data

⚠️ Why Q3, not Q4, on this card? A 12 GB desktop card with a monitor attached exposes only ~10.5–11.3 GB usable VRAM. The 16 GB-class GGUF path leads with Q4_K_M (13.07 GB on disk); on a 12 GB card that quant's diffusion-stage residency leaves no margin once the VAE and activations are counted. Dropping to Q3_K_M (9.68 GB) — or Q3_K_S (8.95 GB) for tighter headroom — keeps the diffusion model well under the 12 GB envelope. Per the city96 GGUF model card, the Q3_K_M, Q3_K_S and Q2_K tiers "use a new dynamic logic where the first/last layer is kept in high precision", with the card noting that "even Q2_K remains somewhat usable" (a per-tier quality comparison is linked from the card).

Requirements

ComponentMinimumTested
GPU12 GB VRAM (NVIDIA, CUDA-capable)RTX 4070 (12 GB GDDR6X, Ada, sm_89)
RAM32 GB system RAM recommended for text-encoder offload
Storage~10 GB for Q3_K_M diffusion weights, ~22 GB total with encoder + VAE
SoftwareComfyUI (current build), Python 3.10+, ComfyUI-GGUF custom node

The 20B parameter count is stated on the Qwen-Image GitHub README ("20B MMDiT image foundation model") and the city96 GGUF card. At BF16 the diffusion weights alone are 40.87 GB (qwen-image-BF16.gguf, verified via the HF tree API on 2026-06-09), which is why a 12 GB card requires a low-bitrate GGUF quant.

Installation

1. Update ComfyUI

Pull the latest ComfyUI build — Qwen-Image has had native ComfyUI support since release, and the official Qwen-Image tutorial walks through the stock workflow.

Unlike Blackwell GPUs (sm_120), no special CUDA wheel selection is required for the RTX 4070 — it is Ada Lovelace (sm_89), so the default pip install torch already ships the sm_89 kernels and a standard ComfyUI install works out of the box. There is no cu128-only requirement here.

2. Install the ComfyUI-GGUF custom node

The native ComfyUI loader does not read GGUF files; install city96's loader. From your ComfyUI root:

git clone https://github.com/city96/ComfyUI-GGUF ComfyUI/custom_nodes/ComfyUI-GGUF
pip install --upgrade gguf

Windows portable build users substitute the embedded Python:

git clone https://github.com/city96/ComfyUI-GGUF ComfyUI/custom_nodes/ComfyUI-GGUF
.\python_embeded\python.exe -s -m pip install -r .\ComfyUI\custom_nodes\ComfyUI-GGUF\requirements.txt

Source: ComfyUI-GGUF README.

3. Download the quantized diffusion weights

From city96/Qwen-Image-gguf, pull a Q3 quant that fits the 12 GB envelope. The full per-quant file sizes (verified via the HF tree API on 2026-06-09) include:

QuantSize on diskFits 12 GB?
Q2_K7.06 GBYes (most aggressive)
Q3_K_S8.95 GBYes
Q3_K_M9.68 GBYes (recommended)
Q4_011.85 GBNo (too tight with VAE + activations)
Q4_K_S12.14 GBNo
Q4_K_M13.07 GBNo (this is the 16 GB-card path)
Q8_021.76 GBNo
BF1640.87 GBNo

For a 12 GB card, Q3_K_M (9.68 GB) is the recommended starting point — it leaves the diffusion stage with workable margin under the ~11 GB usable once ComfyUI offloads the text encoder. Q3_K_S (8.95 GB) is the lower-VRAM fallback; Q2_K (7.06 GB) is the most aggressive tier if you also drive a heavy desktop.

# From your ComfyUI root, into ComfyUI/models/diffusion_models/
wget https://huggingface.co/city96/Qwen-Image-gguf/resolve/main/qwen-image-Q3_K_M.gguf \
  -O ComfyUI/models/diffusion_models/qwen-image-Q3_K_M.gguf

The destination folder is per the city96 model card.

Avoid the IQ-quants on this loader. Per city96's own response on ComfyUI-GGUF issue #255: The IQ quants are only supported via a very slow fallback. You could try go for Q4_K_S or Q3_K_M instead (basically anything that just has "Q" instead of "IQ" in the name).

4. Download the text encoder and VAE

Per the official ComfyUI tutorial, the stock workflow uses the FP8-scaled Qwen2.5-VL-7B text encoder and the Qwen-Image VAE:

# Text encoder (FP8 scaled, 9.38 GB) → ComfyUI/models/text_encoders/
wget https://huggingface.co/Comfy-Org/Qwen-Image_ComfyUI/resolve/main/split_files/text_encoders/qwen_2.5_vl_7b_fp8_scaled.safetensors \
  -O ComfyUI/models/text_encoders/qwen_2.5_vl_7b_fp8_scaled.safetensors

# VAE (0.25 GB) → ComfyUI/models/vae/
wget https://huggingface.co/Comfy-Org/Qwen-Image_ComfyUI/resolve/main/split_files/vae/qwen_image_vae.safetensors \
  -O ComfyUI/models/vae/qwen_image_vae.safetensors

Both file sizes are verified via the Comfy-Org repackager tree on 2026-06-09. ComfyUI runs the text encoder first to produce conditioning, then frees it before loading the diffusion model, so the 9.38 GB encoder and the Q3_K_M diffusion weights are not both fully resident at the same peak — this is the same default offload behaviour that lets the 13.07 GB Q4_K_M quant fit a 16 GB card.

5. (Optional, for tight display headroom) Use a GGUF text encoder

If your 12 GB card also drives a desktop and you see OOM during the text-encode stage, ComfyUI-GGUF can load the text encoder itself as a GGUF quant via its CLIPLoader (gguf) nodes — the ComfyUI-GGUF README notes these "can be used inplace of the regular ones." Swap the 9.38 GB FP8 encoder for a smaller Qwen2.5-VL-7B GGUF (e.g. the 4.68 GB Qwen2.5-VL-7B-Instruct-Q4_K_M.gguf from unsloth's Qwen2.5-VL-7B-Instruct-GGUF, linked from the city96 card's text-encoder row):

wget https://huggingface.co/unsloth/Qwen2.5-VL-7B-Instruct-GGUF/resolve/main/Qwen2.5-VL-7B-Instruct-Q4_K_M.gguf \
  -O ComfyUI/models/text_encoders/Qwen2.5-VL-7B-Instruct-Q4_K_M.gguf

Note the RTX 4070's PCIe Gen4 x16 host link: when the text encoder is offloaded to system RAM, the encoder-stage transfer is bandwidth-limited by Gen4 (~half the bandwidth of a Gen5 card), so that stage runs a little slower than on a PCIe Gen5 board — but it still fits.

6. Load the workflow

The city96 GGUF model card ships a ready-to-use workflow at media/qwen-image_workflow.json in the repo. Drag the JSON onto the ComfyUI canvas — it pre-wires the Unet Loader (GGUF) node (from the bootleg category, per the ComfyUI-GGUF README), the Qwen2.5-VL text encoder, and the VAE. Point the Unet Loader at your qwen-image-Q3_K_M.gguf file.

Running

With ComfyUI launched from the ComfyUI root (or the portable build's launcher):

python main.py --listen 127.0.0.1 --port 8188

Open http://127.0.0.1:8188, load the workflow JSON, enter a prompt, and queue a generation. First-run latency is dominated by the GGUF load into VRAM; subsequent runs reuse the in-memory model.

ComfyUI's Qwen-Image workflow uses SDPA (scaled_dot_product_attention) by default, which works on the RTX 4070 out of the box. The Ada sm_89 architecture also has full prebuilt FlashAttention-2 kernel coverage, so if a custom node or fork explicitly enables FA2 it runs fine on this card — no eager/sdpa override is needed (that override only applies to Blackwell sm_120 cards, which still lack FA2 kernels).

Results

  • Speed: no first-party RTX 4070 Qwen-Image benchmark has been published for the GGUF ComfyUI workflow yet, and the backend has no measured row for this pair — so we deliberately omit a speed number rather than forward-extrapolate one from a different card. The RTX 4070 has ~30% fewer CUDA cores (5888 vs 8448) and ~25% less memory bandwidth (504 vs 672 GB/s) than the 4070 Ti SUPER, well outside the ~10% close-sibling band, so a 4070 Ti SUPER figure would only be a loose upper bound and is not transferred here. Check /check/qwen-image/rtx-4070 for an empirical figure once a community benchmark lands, and please contribute your own measurement.
  • VRAM usage (derived envelope): the Q3_K_M diffusion weights are 9.68 GB on disk per the city96 file table; with the text encoder offloaded by ComfyUI's default memory management, the diffusion-stage resident set (transformer + 0.25 GB VAE + activations) plans to roughly ~10–11 GB, inside the ~11 GB usable on a 12 GB card. This is a derived envelope from on-disk sizes, not a measured peak — see /check/qwen-image/rtx-4070 and report a measured figure via /contribute.
  • Quality notes: Qwen-Image is positioned as a strong text-rendering model in English and Chinese (GitHub README). Q3 is a meaningful step below Q4/Q5 in fidelity, but city96 mitigates this for exactly these low-bitrate tiers: per the model card, Q3_K_M / Q3_K_S / Q2_K "use a new dynamic logic where the first/last layer is kept in high precision", with the card stating "even Q2_K remains somewhat usable". Expect slightly softer fine detail and occasional text-rendering slips versus the Q4_K_M build a 16 GB card runs; if you have headroom, Q3_K_M is the better-quality of the two fitting tiers.

For the full benchmark data, see /check/qwen-image/rtx-4070.

Troubleshooting

Q4_K_M or higher OOMs on a 12 GB card

Q4_K_M is 13.07 GB on disk and Q4_K_S is 12.14 GB (city96 file table) — both exceed the ~11 GB usable on a 12 GB card once the VAE and activations are added. These are the 16 GB-card quants. Use Q3_K_M (9.68 GB) or Q3_K_S (8.95 GB) as described above.

OOM during the text-encode stage

The FP8-scaled Qwen2.5-VL-7B text encoder is 9.38 GB (Comfy-Org file listing). On a 12 GB card driving a desktop, the brief text-encode stage can spike close to the usable ceiling. Switch to a GGUF text encoder via ComfyUI-GGUF's CLIPLoader (gguf) node (step 5) — the 4.68 GB Qwen2.5-VL-7B-Instruct-Q4_K_M.gguf cuts the encoder footprint roughly in half. On the 4070's PCIe Gen4 link the offloaded encoder stage is a little slower than on a Gen5 board, but it clears the OOM.

Why not just run the FP8 native build? FP8 is native on Ada

The RTX 4070's Ada Lovelace 4th-gen tensor cores do support FP8 (E4M3/E5M2) compute natively, so FP8 is genuinely hardware-accelerated on this card. But that does not help with the official build: qwen_image_fp8_e4m3fn.safetensors is 20.43 GB (Comfy-Org repackage file tree) — larger than the 4070's entire 12 GB VRAM — so the diffusion weight alone OOMs on load before any FP8 acceleration can apply. FP8 acceleration is a compute property; it does not shrink a 20.43 GB file to fit a 12 GB card. (The NVFP4 single-file build is 19.77 GB and is in any case Blackwell-only — Ada cannot run NVFP4.) On a 12 GB 4070 use the GGUF path above.

Unet Loader (GGUF) node missing

You skipped step 2. Install ComfyUI-GGUF — the native ComfyUI loader cannot read .gguf files. The node lives under the bootleg category in the node browser.

Generation produces a black image

Reported across city96/Qwen-Image-gguf discussions. The most common cause is a mismatched or missing text encoder — double-check qwen_2.5_vl_7b_fp8_scaled.safetensors (or your GGUF substitute from step 5) is present in ComfyUI/models/text_encoders/.