self-hosted/ai
§01·recipe · image

Qwen-Image on RTX 5070: 20B Text-to-Image via ComfyUI GGUF Q3 (Blackwell sm_120, 12 GB)

imageintermediate12GB+ VRAMJun 5, 2026
models
tools
prerequisites
  • NVIDIA RTX 5070 (12 GB GDDR7 VRAM, Blackwell sm_120) or equivalent 12 GB-class card
  • Python 3.10+
  • ComfyUI installed and up to date
  • PyTorch built against CUDA 12.8 (cu128) for sm_120 kernel coverage
  • ~10 GB free disk space for the Q3_K_M diffusion model alone (~22 GB total with text encoder + VAE)

What You'll Build

A local Qwen-Image text-to-image setup on a 12 GB RTX 5070 (Blackwell, sm_120). Qwen-Image is Alibaba Tongyi Lab's 20B-parameter MMDiT image foundation model, released under Apache 2.0. The 12 GB envelope is the binding constraint here: the official FP8 diffusion build is 20.43 GB on disk and the BF16 build is 40.87 GB — both far past a 12 GB card. Even the Q4_K_M GGUF build that fits a 16 GB card (13.07 GB on disk, verified via the HF tree API) exceeds the ~11 GB usable on a 12 GB card with a display attached. This recipe leads with city96's Q3_K_M GGUF build (9.68 GB on disk) loaded through the ComfyUI-GGUF custom node — a quant city96 specifically engineered with high-precision first/last layers so the low-bitrate tiers stay usable.

Hardware data: RTX 5070 (12 GB GDDR7 VRAM, Blackwell, sm_120) · 20B-parameter MMDiT at 3-bit GGUF · See benchmark data

⚠️ Why Q3, not Q4, on this card? A 12 GB desktop card with a monitor attached exposes only ~10.5–11.3 GB usable VRAM. The 16 GB-class GGUF path leads with Q4_K_M (13.07 GB on disk); on a 12 GB card that quant's diffusion-stage residency leaves no margin once the VAE and activations are counted. Dropping to Q3_K_M (9.68 GB) — or Q3_K_S (8.95 GB) for tighter headroom — keeps the diffusion model well under the 12 GB envelope. Per the city96 GGUF model card, the Q3_K_M, Q3_K_S and Q2_K tiers "use a new dynamic logic where the first/last layer is kept in high precision," with the card noting that "even Q2_K remains somewhat usable" (a per-tier quality comparison is linked from the card).

Requirements

ComponentMinimumTested
GPU12 GB VRAM (NVIDIA, CUDA-capable)RTX 5070 (12 GB GDDR7, Blackwell, sm_120)
RAM32 GB system RAM recommended for text-encoder offload
Storage~10 GB for Q3_K_M diffusion weights, ~22 GB total with encoder + VAE
SoftwareComfyUI (current build), Python 3.10+ (cu128 PyTorch wheel), ComfyUI-GGUF custom node

The 20B parameter count is stated on the Qwen-Image GitHub README and the city96 GGUF card. At BF16 the diffusion weights alone are 40.87 GB (qwen-image-BF16.gguf, verified via the HF tree API on 2026-06-05), which is why a 12 GB card requires a low-bitrate GGUF quant.

Installation

1. Update ComfyUI and install the cu128 PyTorch wheel

Pull the latest ComfyUI build — Qwen-Image has had native ComfyUI support since release, and the official Qwen-Image tutorial walks through the stock workflow.

The RTX 5070 is Blackwell (sm_120). PyTorch wheels built against CUDA 12.8 (cu128) ship sm_120 kernels — if your ComfyUI Python environment defaults to an older cu121/cu124 wheel, the model will fail at the first inference call. Install the cu128 wheel:

pip install torch torchvision --index-url https://download.pytorch.org/whl/cu128

2. Install the ComfyUI-GGUF custom node

The native ComfyUI loader does not read GGUF files; install city96's loader. From your ComfyUI root:

git clone https://github.com/city96/ComfyUI-GGUF ComfyUI/custom_nodes/ComfyUI-GGUF
pip install --upgrade gguf

Windows portable build users substitute the embedded Python:

git clone https://github.com/city96/ComfyUI-GGUF ComfyUI/custom_nodes/ComfyUI-GGUF
.\python_embeded\python.exe -s -m pip install -r .\ComfyUI\custom_nodes\ComfyUI-GGUF\requirements.txt

Source: ComfyUI-GGUF README.

3. Download the quantized diffusion weights

From city96/Qwen-Image-gguf, pull a Q3 quant that fits the 12 GB envelope. The full per-quant file sizes (verified via the HF tree API on 2026-06-05) include:

QuantSize on diskFits 12 GB?
Q2_K7.06 GBYes (most aggressive)
Q3_K_S8.95 GBYes
Q3_K_M9.68 GBYes (recommended)
Q4_011.85 GBNo (too tight with VAE + activations)
Q4_K_S12.14 GBNo
Q4_K_M13.07 GBNo (this is the 16 GB-card path)
Q8_021.76 GBNo
BF1640.87 GBNo

For a 12 GB card, Q3_K_M (9.68 GB) is the recommended starting point — it leaves the diffusion stage with workable margin under the ~11 GB usable once ComfyUI offloads the text encoder. Q3_K_S (8.95 GB) is the lower-VRAM fallback; Q2_K (7.06 GB) is the most aggressive tier if you also drive a heavy desktop.

# From your ComfyUI root, into ComfyUI/models/diffusion_models/
wget https://huggingface.co/city96/Qwen-Image-gguf/resolve/main/qwen-image-Q3_K_M.gguf \
  -O ComfyUI/models/diffusion_models/qwen-image-Q3_K_M.gguf

The destination folder is per the city96 model card.

Avoid the IQ-quants on this loader. Per city96's own response on ComfyUI-GGUF issue #255: "The IQ quants are only supported via a very slow fallback. You could try go for Q4_K_S or Q3_K_M instead (basically anything that just has 'Q' instead of 'IQ' in the name)."

4. Download the text encoder and VAE

Per the official ComfyUI tutorial, the stock workflow uses the FP8-scaled Qwen2.5-VL-7B text encoder and the Qwen-Image VAE:

# Text encoder (FP8 scaled, 9.38 GB) → ComfyUI/models/text_encoders/
wget https://huggingface.co/Comfy-Org/Qwen-Image_ComfyUI/resolve/main/split_files/text_encoders/qwen_2.5_vl_7b_fp8_scaled.safetensors \
  -O ComfyUI/models/text_encoders/qwen_2.5_vl_7b_fp8_scaled.safetensors

# VAE (0.25 GB) → ComfyUI/models/vae/
wget https://huggingface.co/Comfy-Org/Qwen-Image_ComfyUI/resolve/main/split_files/vae/qwen_image_vae.safetensors \
  -O ComfyUI/models/vae/qwen_image_vae.safetensors

Both file sizes are verified via the Comfy-Org repackager tree on 2026-06-05. ComfyUI runs the text encoder first to produce conditioning, then frees it before loading the diffusion model, so the 9.38 GB encoder and the Q3_K_M diffusion weights are not both fully resident at the same peak — this is the same default offload behaviour that lets the 13.07 GB Q4_K_M quant fit a 16 GB card.

5. (Optional, for tight display headroom) Use a GGUF text encoder

If your 12 GB card also drives a desktop and you see OOM during the text-encode stage, ComfyUI-GGUF can load the text encoder itself as a GGUF quant via its CLIPLoader (gguf) nodes — the ComfyUI-GGUF README notes these "can be used inplace of the regular ones." Swap the 9.38 GB FP8 encoder for a smaller Qwen2.5-VL-7B GGUF (e.g. the 4.68 GB Qwen2.5-VL-7B-Instruct-Q4_K_M.gguf from unsloth's Qwen2.5-VL-7B-Instruct-GGUF, linked from the city96 card's text-encoder row):

wget https://huggingface.co/unsloth/Qwen2.5-VL-7B-Instruct-GGUF/resolve/main/Qwen2.5-VL-7B-Instruct-Q4_K_M.gguf \
  -O ComfyUI/models/text_encoders/Qwen2.5-VL-7B-Instruct-Q4_K_M.gguf

6. Load the workflow

The city96 GGUF model card ships a ready-to-use workflow at media/qwen-image_workflow.json in the repo. Drag the JSON onto the ComfyUI canvas — it pre-wires the Unet Loader (GGUF) node (from the bootleg category, per the ComfyUI-GGUF README), the Qwen2.5-VL text encoder, and the VAE. Point the Unet Loader at your qwen-image-Q3_K_M.gguf file.

Running

With ComfyUI launched from the ComfyUI root (or the portable build's launcher):

python main.py --listen 127.0.0.1 --port 8188

Open http://127.0.0.1:8188, load the workflow JSON, enter a prompt, and queue a generation. First-run latency is dominated by the GGUF load into VRAM; subsequent runs reuse the in-memory model.

ComfyUI's Qwen-Image workflow uses SDPA (scaled_dot_product_attention) by default, which always works on sm_120. FlashAttention-2 sm_120 kernels are still tracked open at Dao-AILab/flash-attention#2168 — if you use a custom node or fork that explicitly enables FA2, fall back to SDPA on the 5070 until the upstream wheel lands sm_120 coverage.

Results

  • Speed: no first-party RTX 5070 Qwen-Image benchmark has been published for the GGUF ComfyUI workflow yet, and the backend has no measured row for this pair — so we deliberately omit a speed number rather than forward-extrapolate one from a different card. The RTX 5070 has roughly 25% less memory bandwidth and 31% fewer cores than the 5070 Ti, so a 5070 Ti or 5080 figure would not transfer. Check /check/qwen-image/rtx-5070 for an empirical figure once a community benchmark lands, and please contribute your own measurement.
  • VRAM usage (derived envelope): the Q3_K_M diffusion weights are 9.68 GB on disk per the city96 file table; with the text encoder offloaded by ComfyUI's default memory management, the diffusion-stage resident set (transformer + 0.25 GB VAE + activations) plans to roughly ~10–11 GB, inside the ~11 GB usable on a 12 GB card. This is a derived envelope from on-disk sizes, not a measured peak — see /check/qwen-image/rtx-5070 and report a measured figure via /contribute.
  • Quality notes: Qwen-Image is positioned as a strong text-rendering model in English and Chinese (GitHub README). Q3 is a meaningful step below Q4/Q5 in fidelity, but city96 mitigates this for exactly these low-bitrate tiers: per the model card, Q3_K_M / Q3_K_S / Q2_K "use a new dynamic logic where the first/last layer is kept in high precision," with the card stating "even Q2_K remains somewhat usable." Expect slightly softer fine detail and occasional text-rendering slips versus the Q4_K_M build a 16 GB card runs; if you have headroom, Q3_K_M is the better-quality of the two fitting tiers.

For the full benchmark data, see /check/qwen-image/rtx-5070.

Troubleshooting

Q4_K_M or higher OOMs on a 12 GB card

Q4_K_M is 13.07 GB on disk and Q4_K_S is 12.14 GB (city96 file table) — both exceed the ~11 GB usable on a 12 GB card once the VAE and activations are added. These are the 16 GB-card quants. Use Q3_K_M (9.68 GB) or Q3_K_S (8.95 GB) as described above.

OOM during the text-encode stage

The FP8-scaled Qwen2.5-VL-7B text encoder is 9.38 GB (Comfy-Org file listing). On a 12 GB card driving a desktop, the brief text-encode stage can spike close to the usable ceiling. Switch to a GGUF text encoder via ComfyUI-GGUF's CLIPLoader (gguf) node (step 5) — the 4.68 GB Qwen2.5-VL-7B-Instruct-Q4_K_M.gguf cuts the encoder footprint roughly in half.

"transformer dispatched to CPU" / ~20 minutes per image (Ampere-specific — not a 5070 failure)

Reported on the canonical repo at QwenLM/Qwen-Image#260 by a community user (no maintainer response as of 2026-06-05) running on an RTX 3090 and seeing repeated transformer-to-CPU offload messages. That is an Ampere (sm_86) failure mode — the diffusers BF16/FP8 path falls back to CPU offload because Ampere has no native FP8 tensor cores and the full BF16 weights don't fit 24 GB. It does not fire on the 5070's Blackwell sm_120, and in any case this recipe runs the GGUF path (Q3_K_M, 9.68 GB) entirely on-GPU with no per-step CPU swap. If you see this message on a 5070, you are almost certainly running the diffusers BF16 path instead of the ComfyUI-GGUF workflow above.

Model fails at the first inference call on a fresh install

The RTX 5070 needs the cu128 PyTorch wheel for sm_120 kernels. If you installed an older cu121/cu124 wheel, reinstall per step 1: pip install torch torchvision --index-url https://download.pytorch.org/whl/cu128. This is a Blackwell-class requirement shared with the other RTX 50-series cards (5060 Ti, 5070 Ti, 5080, 5090).

Unet Loader (GGUF) node missing

You skipped step 2. Install ComfyUI-GGUF — the native ComfyUI loader cannot read .gguf files. The node lives under the bootleg category in the node browser.

Generation produces a black image

Reported across city96/Qwen-Image-gguf discussions. The most common cause is a mismatched or missing text encoder — double-check qwen_2.5_vl_7b_fp8_scaled.safetensors (or your GGUF substitute from step 5) is present in ComfyUI/models/text_encoders/.