What You'll Build
A local Qwen-Image text-to-image setup on a 16 GB RTX 5070 Ti (Blackwell, sm_120). Qwen-Image is Alibaba Tongyi Lab's 20B-parameter MMDiT image foundation model — the GitHub README describes it as "a 20B MMDiT image foundation model that achieves significant advances in complex text rendering and precise image editing" — released August 4, 2025 under Apache 2.0. The official FP8 build (qwen_image_fp8_e4m3fn.safetensors, 20.43 GB on the Comfy-Org repackager file listing) plus its 9.38 GB FP8 text encoder will not fit a 16 GB card. This recipe uses city96's GGUF redistribution (Q4_K_M, 13.07 GB on disk) loaded through the ComfyUI-GGUF custom node, which leaves enough headroom for the Qwen2.5-VL-7B text encoder, the VAE, and activations.
Hardware data: RTX 5070 Ti (16 GB GDDR7 VRAM, Blackwell, sm_120) · 20B-parameter MMDiT at 4-bit GGUF · See benchmark data
⚠️ Why GGUF and not the FP8 native path on this card? The RTX 5070 Ti is Blackwell (sm_120) and does have native FP8 tensor cores — so unlike the RTX 3090 Ti Ampere sibling, FP8 is not a compute dead-end here. The blocker on the 5070 Ti is purely capacity: the FP8 diffusion build is 20.43 GB on disk (Comfy-Org file listing), past the 5070 Ti's 16 GB envelope before the 9.38 GB text encoder is even counted. The native FP8 path is the one the RTX 5090 32 GB sibling installs; on 16 GB, GGUF is the fitting path. (The Comfy-Org repackager also publishes a Blackwell-only NVFP4 build at 19.77 GB and the BF16 build at 40.86 GB — both also over the 16 GB envelope.)
Requirements
| Component | Minimum | Tested |
|---|---|---|
| GPU | 16 GB VRAM (NVIDIA, CUDA-capable) | RTX 5070 Ti (16 GB GDDR7, Blackwell, sm_120) |
| RAM | 32 GB system RAM recommended for text-encoder offload | — |
| Storage | ~13 GB for Q4_K_M diffusion weights, ~25 GB total with encoder + VAE + Lightning LoRA | — |
| Software | ComfyUI (current build), Python 3.10+ (cu128 PyTorch wheel), ComfyUI-GGUF custom node | — |
The 20B parameter count is stated explicitly in the Qwen-Image GitHub README ("20B MMDiT image foundation model") and the city96 GGUF card — at BF16 the diffusion weights alone are 40.87 GB (qwen-image-BF16.gguf, verified via the HF tree API on 2026-06-03), which is why a 16 GB card requires GGUF or FP8 quantization.
Installation
1. Update ComfyUI and install the cu128 PyTorch wheel
Pull the latest ComfyUI build. Qwen-Image has had native ComfyUI support since 2025.08.05 — the official Qwen-Image tutorial explicitly notes "Make sure your ComfyUI is updated."
The RTX 5070 Ti is Blackwell (sm_120). PyTorch wheels built against CUDA 12.8 (cu128) ship sm_120 kernels — if your ComfyUI Python environment defaults to an older cu121/cu124 wheel, the model will fail at the first inference call. Install the cu128 wheel:
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu128
2. Install the ComfyUI-GGUF custom node
The native ComfyUI loader does not read GGUF files; install city96's loader. From your ComfyUI root:
git clone https://github.com/city96/ComfyUI-GGUF ComfyUI/custom_nodes/ComfyUI-GGUF
pip install --upgrade gguf
Windows portable build users substitute the embedded Python:
git clone https://github.com/city96/ComfyUI-GGUF ComfyUI/custom_nodes/ComfyUI-GGUF
.\python_embeded\python.exe -s -m pip install -r .\ComfyUI\custom_nodes\ComfyUI-GGUF\requirements.txt
Source: ComfyUI-GGUF README.
3. Download the quantized diffusion weights
From city96/Qwen-Image-gguf, pull a quant that fits 16 GB with overhead for the Qwen2.5-VL-7B text encoder, VAE, and activations. The full per-quant file sizes (verified via the HF tree API on 2026-06-03) are:
| Quant | Size on disk |
|---|---|
| Q2_K | 7.06 GB |
| Q3_K_S | 8.95 GB |
| Q3_K_M | 9.68 GB |
| Q4_0 | 11.85 GB |
| Q4_K_S | 12.14 GB |
| Q4_K_M | 13.07 GB |
| Q5_K_S | 14.12 GB |
| Q5_K_M | 14.93 GB |
| Q6_K | 16.82 GB |
| Q8_0 | 21.76 GB |
| BF16 | 40.87 GB |
For a 16 GB card, Q4_K_M (13.07 GB) is the recommended starting point — it leaves a 16 GB card with workable margin once the text encoder is offloaded by ComfyUI's default memory management. Q4_K_S (12.14 GB) is the lower-VRAM fallback; Q5/Q6 quants generally need text-encoder CPU offload to clear OOM on 16 GB.
# From your ComfyUI root, into ComfyUI/models/diffusion_models/
wget https://huggingface.co/city96/Qwen-Image-gguf/resolve/main/qwen-image-Q4_K_M.gguf \
-O ComfyUI/models/diffusion_models/qwen-image-Q4_K_M.gguf
The destination is per the city96 model card.
Avoid IQ-quants: per city96's own response on issue #255, "The IQ quants are only supported via a very slow fallback. You could try go for Q4_K_S or Q3_K_M instead (basically anything that just has 'Q' instead of 'IQ' in the name)."
4. Download the text encoder and VAE
These are the same files the FP8 native workflow uses. Per the official ComfyUI tutorial:
# Text encoder (FP8 scaled, 9.38 GB) → ComfyUI/models/text_encoders/
wget https://huggingface.co/Comfy-Org/Qwen-Image_ComfyUI/resolve/main/split_files/text_encoders/qwen_2.5_vl_7b_fp8_scaled.safetensors \
-O ComfyUI/models/text_encoders/qwen_2.5_vl_7b_fp8_scaled.safetensors
# VAE (0.25 GB) → ComfyUI/models/vae/
wget https://huggingface.co/Comfy-Org/Qwen-Image_ComfyUI/resolve/main/split_files/vae/qwen_image_vae.safetensors \
-O ComfyUI/models/vae/qwen_image_vae.safetensors
Both file sizes are verified via the Comfy-Org repackager tree on 2026-06-03.
5. (Recommended) Download the Lightning LoRA
To bring per-image latency down from minutes to under a minute, install the official 8-step Lightning LoRA — the ComfyUI tutorial's recommended acceleration LoRA for this model:
wget https://huggingface.co/lightx2v/Qwen-Image-Lightning/resolve/main/Qwen-Image-Lightning-8steps-V1.0.safetensors \
-O ComfyUI/models/loras/Qwen-Image-Lightning-8steps-V1.0.safetensors
6. Load the workflow
The city96 GGUF model card ships a ready-to-use workflow at media/qwen-image_workflow.json in the repo. Drag the JSON onto the ComfyUI canvas — it pre-wires the Unet Loader (GGUF) node (from the bootleg category, per the ComfyUI-GGUF README), the Qwen2.5-VL text encoder, and the VAE.
Running
With ComfyUI launched from the ComfyUI root (or the portable build's launcher):
python main.py --listen 127.0.0.1 --port 8188
Open http://127.0.0.1:8188, load the workflow JSON, enter a prompt, and queue a generation. First-run latency is dominated by the GGUF load into VRAM; subsequent runs reuse the in-memory model.
ComfyUI's Qwen-Image workflow uses SDPA (scaled_dot_product_attention) by default, which always works on sm_120. FlashAttention-2 sm_120 kernels are still tracked open at Dao-AILab/flash-attention#2168 — if you use a custom node or fork that explicitly enables FA2, fall back to SDPA on the 5070 Ti until the upstream wheel lands sm_120 coverage.
Results
- Speed: no first-party RTX 5070 Ti Qwen-Image benchmark has been published for the GGUF ComfyUI workflow yet, and the backend has no measured row for this pair — so we deliberately omit a speed number rather than forward-extrapolate one from a different card. Check /check/qwen-image/rtx-5070-ti for an empirical figure once a community benchmark lands.
- VRAM usage: Q4_K_M diffusion weights are 13.07 GB on disk per the city96 file table; on a 16 GB 5070 Ti this fits with the Qwen2.5-VL-7B text encoder offloaded by ComfyUI's default memory management, leaving room for the VAE and activations. See /check/qwen-image/rtx-5070-ti.
- Quality notes: Qwen-Image is positioned as a strong text-rendering model in both English and Chinese (GitHub README); 4-bit quants are widely reported to retain most of that capability, with Q5/Q6 giving incremental quality at proportional VRAM cost. The 8-step Lightning LoRA trades a small amount of step-count flexibility for materially lower latency per image.
For the full benchmark data, see /check/qwen-image/rtx-5070-ti.
Troubleshooting
qwen_image_fp8_e4m3fn.safetensors OOMs on load
The FP8 build is 20.43 GB on disk (Comfy-Org file listing) — larger than a 16 GB card's total VRAM, and that is before the 9.38 GB FP8 text encoder loads. The 5070 Ti's Blackwell sm_120 silicon can run FP8 natively, but the file simply does not fit 16 GB. Use the GGUF path described above; the FP8 build is what the RTX 5090 32 GB sibling installs.
"transformer dispatched to CPU" / ~20 minutes per image (Ampere-specific — not a 5070 Ti failure)
Reported on the canonical repo at QwenLM/Qwen-Image#260 by a community user (no maintainer response as of 2026-06-03) running num_inference_steps=40 on an RTX 3090 and seeing repeated transformer-to-CPU offload messages. That is an Ampere (sm_86) failure mode — the diffusers BF16 path falls back to CPU offload because the full 40.87 GB BF16 weights don't fit 24 GB, and Ampere has no FP8 tensor cores so the FP8 build dequantizes on the fly. It does not fire on the 5070 Ti's Blackwell sm_120: on this card the FP8 transformer fits natively on FP8 tensor cores (when it has the VRAM headroom), and in any case this recipe runs the GGUF path (Q4_K_M, 13.07 GB) entirely on-GPU with no per-step CPU swap. If you see this message on a 5070 Ti, you are almost certainly running the diffusers BF16 path instead of the ComfyUI-GGUF workflow above.
Q5/Q6 GGUF fits on disk but OOMs at runtime
Q5_K_S is 14.12 GB on disk and Q6_K is 16.82 GB (city96 file table); with the Qwen2.5-VL-7B text encoder (FP8-scaled, 9.38 GB) and VAE also resident, peak VRAM exceeds 16 GB. Drop to Q4_K_M/Q4_K_S, or enable text-encoder CPU offload in ComfyUI.
Model fails at the first inference call on a fresh install
The RTX 5070 Ti needs the cu128 PyTorch wheel for sm_120 kernels. If you installed an older cu121/cu124 wheel, reinstall per step 1: pip install torch torchvision --index-url https://download.pytorch.org/whl/cu128. This is a Blackwell-class requirement shared with the other RTX 50-series cards (5060 Ti, 5070, 5080, 5090).
Unet Loader (GGUF) node missing
You skipped step 2. Install ComfyUI-GGUF — the native ComfyUI loader cannot read .gguf files. The node lives under the bootleg category in the node browser.
Generation produces a black image
Reported across city96/Qwen-Image-gguf discussions. Most common cause is a mismatched or missing text encoder file — double-check qwen_2.5_vl_7b_fp8_scaled.safetensors is present in ComfyUI/models/text_encoders/ and is the FP8-scaled variant linked from the official ComfyUI tutorial (not the bare Qwen2.5-VL-7B BF16 weights, which are 16.58 GB).