How much VRAM does Qwen-Image need?

About 16 GB — the minimum this recipe targets.

How hard is this setup?

Intermediate — follow the steps above.

Qwen-Image on RX 7900 XTX: 20B Text-to-Image via ComfyUI on ROCm (GGUF / BF16)

What You'll Build

A local Qwen-Image text-to-image setup running in ComfyUI on a 24 GB Radeon RX 7900 XTX (RDNA3, Navi 31, gfx1100) through the ROCm stack. Qwen-Image is "a 20B MMDiT image foundation model" from Alibaba's Qwen team (Tongyi Lab), released August 4, 2025 under Apache 2.0, recognised for complex text rendering. It is a heavy model: the full BF16 diffusion weights are 40.9 GB on disk (city96 GGUF card; ComfyUI native tutorial), which does not fit in 24 GB. This recipe leads with the city96 GGUF quants (Q5_K_S 14.1 GB / Q8_0 21.8 GB) loaded through the ComfyUI-GGUF custom node, which run cleanly inside the 7900 XTX's 24 GB envelope and leave room for the Qwen2.5-VL-7B text encoder and the VAE.

Hardware data: RX 7900 XTX (24GB VRAM) · 20B-parameter MMDiT via GGUF on ROCm · See benchmark data

⚠️ This is a ROCm recipe, not CUDA — and FP8 is not the answer here. The RX 7900 XTX runs on AMD's ROCm/HIP stack: there is no cu124/cu128 wheel, no xformers, and no FP8/FP4 path. RDNA3's WMMA units accept FP16, BF16, INT8, INT4 only — there is no FP8 hardware. So the official qwen_image_fp8_e4m3fn.safetensors build (20.4 GB on disk, ComfyUI tutorial) gives you nothing on this card: the FP8 weights upcast to BF16 at load, restoring the full ~40 GB-class footprint with no compute acceleration — so it OOMs just like the raw BF16 build. The honest paths that fit 24 GB are GGUF Q5/Q8 (recommended — comfortable headroom) or, if you insist on full precision, you cannot run the un-quantized BF16 diffusion weights resident on this card. The attention path is PyTorch SDPA (--use-pytorch-cross-attention), never FlashAttention-2 and never xformers.

Requirements

Component	Minimum	Tested
GPU	16 GB VRAM (ROCm-supported AMD card)	RX 7900 XTX (24 GB)
RAM	32 GB system (text-encoder offload)	—
Storage	~24 GB (Q5_K_S diffusion + FP8-scaled encoder + VAE)	per city96 + Comfy-Org trees
Driver	AMD ROCm 7.2.x on Linux	—
Software	ComfyUI + PyTorch (ROCm 7.2 build) + ComfyUI-GGUF, Python 3.10+	—

Qwen-Image is released under Apache 2.0 and the weights are not gated — no access request or login is required. The 20B parameter count is stated on the Qwen-Image GitHub README ("a 20B MMDiT image foundation model"). The diffusion model is paired with a Qwen2.5-VL-7B text encoder and a dedicated VAE (Comfy-Org repackage).

Installation

1. Install ComfyUI

Per the ComfyUI README, clone the repo:

git clone https://github.com/comfyanonymous/ComfyUI.git
cd ComfyUI

2. Install PyTorch for ROCm

The RX 7900 XTX (gfx1100) is an officially ROCm-supported GPU, so it uses the stable ROCm PyTorch wheel. Per the ComfyUI README "AMD GPUs (Linux)" section, the stable install command is:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm7.2

ℹ️ Verify the ROCm tag before you copy it. As of this writing the ComfyUI README pins rocm7.2 as the stable wheel — but the rocmX.Y tag moves over time (6.3 → 6.4 → 7.x). Read the current line in the live ComfyUI README before running. A nightly variant (--pre ... --index-url https://download.pytorch.org/whl/nightly/rocm7.2) "might have some performance improvements" per the README. There is also a separate experimental RDNA-3-specific wheel index (https://rocm.nightlies.amd.com/v2/gfx110X-all/) — on officially-supported Linux you do not need it; the stable whl/rocm7.2 wheel above is the canonical path.

3. Install ComfyUI dependencies

Per the ComfyUI README "Dependencies" section:

pip install -r requirements.txt

4. Install the ComfyUI-GGUF custom node

The native ComfyUI loader cannot read GGUF files. Install city96's loader, which dequantizes GGUF tensors into PyTorch tensors and runs them through ComfyUI's normal model path — so it works on whatever backend ComfyUI is on, ROCm included (it is not CUDA-specific). From your ComfyUI root:

git clone https://github.com/city96/ComfyUI-GGUF ComfyUI/custom_nodes/ComfyUI-GGUF
pip install --upgrade gguf

Source: ComfyUI-GGUF README.

5. Download the quantized diffusion weights

From city96/Qwen-Image-gguf — "a direct GGUF conversion of Qwen/Qwen-Image" — pull a quant. The size table below is verified against the repo's current file tree:

Quant	Size on disk
Q4_K_S	12.1 GB
Q4_K_M	13.1 GB
Q5_K_S	14.1 GB
Q5_K_M	14.9 GB
Q6_K	16.8 GB
Q8_0	21.8 GB
BF16	40.9 GB (does not fit 24 GB)

On the 24 GB 7900 XTX, Q5_K_S (14.1 GB) is the recommended default — high quality with comfortable headroom for the text encoder, VAE, and activations. Q8_0 (21.8 GB) also fits if you offload the text encoder to CPU and want the highest GGUF fidelity. Place the file in ComfyUI/models/diffusion_models/ (per the city96 model card):

# from your ComfyUI root — Q5_K_S recommended (~14.1 GB)
wget -O ComfyUI/models/diffusion_models/qwen-image-Q5_K_S.gguf \
  https://huggingface.co/city96/Qwen-Image-gguf/resolve/main/qwen-image-Q5_K_S.gguf

6. Download the text encoder and VAE

These come from the official Comfy-Org repackage. The FP8-scaled text encoder is 9.4 GB; the full BF16 encoder is 16.6 GB and the VAE is 0.25 GB (verified from the Comfy-Org file tree). On a 24 GB AMD card with CPU offload the FP8-scaled encoder is the smaller download and works fine (it upcasts to BF16 on load, but the text encoder is offloaded to CPU after encoding):

wget -O ComfyUI/models/text_encoders/qwen_2.5_vl_7b_fp8_scaled.safetensors \
  https://huggingface.co/Comfy-Org/Qwen-Image_ComfyUI/resolve/main/split_files/text_encoders/qwen_2.5_vl_7b_fp8_scaled.safetensors
wget -O ComfyUI/models/vae/qwen_image_vae.safetensors \
  https://huggingface.co/Comfy-Org/Qwen-Image_ComfyUI/resolve/main/split_files/vae/qwen_image_vae.safetensors

7. Load the workflow

The city96 GGUF model card ships a ready-to-use workflow at media/qwen-image_workflow.json in the repo. Drag the JSON onto the ComfyUI canvas — it pre-wires the Unet Loader (GGUF) node (from the bootleg category, per the ComfyUI-GGUF README), the Qwen2.5-VL text encoder, and the VAE.

Running

Launch ComfyUI from the repo root. Per the ComfyUI README "Running" section:

python main.py

This starts the server (default http://127.0.0.1:8188). Open it, load the workflow JSON from step 7, select qwen-image-Q5_K_S.gguf in the Unet Loader (GGUF) node, enter a prompt, and queue. Generated PNGs land in ComfyUI/output/.

ComfyUI's default attention backend on this stack is PyTorch scaled-dot-product attention (SDPA). If the auto-selected path misbehaves, force the PyTorch-2.0 cross-attention function explicitly — per ComfyUI's cli_args.py, the flag is "Use the new pytorch 2.0 cross attention function.":

python main.py --use-pytorch-cross-attention

Do not pass --lowvram on a 24 GB 7900 XTX running the GGUF path — it forces the text encoders onto the CPU when you may not need it, slowing you down. Reach for it only if you are running the larger Q8_0 quant and hit memory pressure.

Results

Speed: No RX-7900-XTX-named Qwen-Image benchmark was found in research that could be verified on the source page itself, so no speed figure is quoted. The ComfyUI native tutorial reports a VRAM figure on an RTX 4090D 24 GB (a CUDA card), but that run uses native FP8 acceleration that RDNA3 does not have, so neither its speed nor its VRAM transfers to this card. If you've measured Qwen-Image generation time on a 7900 XTX, please contribute it so it lands on /check/qwen-image/rx-7900-xtx.
VRAM usage: The Q5_K_S diffusion weights are 14.1 GB on disk (city96 file tree); with the FP8-scaled Qwen2.5-VL-7B text encoder (offloaded to CPU after encoding), the VAE, and 1024×1024 activations, the resident envelope sits comfortably within the 24 GB 7900 XTX — the GGUF path is what makes Qwen-Image fit, since the full BF16 diffusion weights are 40.9 GB and do not. See /check/qwen-image/rx-7900-xtx for any community-submitted measurement.
Quality notes: Qwen-Image is positioned as a strong text-rendering model (GitHub README); 4-bit and 5-bit GGUF quants are widely reported to retain most of that capability, with Q6_K / Q8_0 giving incremental quality at proportional VRAM cost. On this 24 GB card you have the headroom to run Q5_K_S or Q8_0 rather than dropping to the tight Q4 tier.

For the full benchmark data, see /check/qwen-image/rx-7900-xtx.

Troubleshooting

"FP8 is smaller — why not just run the official FP8 build?"

On NVIDIA Ada/Blackwell cards the qwen_image_fp8_e4m3fn.safetensors build (20.4 GB) is genuinely FP8-accelerated. The RX 7900 XTX has no FP8 hardware — its WMMA units accept FP16/BF16/INT8/INT4 only (AMD GPUOpen, "WMMA on RDNA3"). Loading the FP8 safetensors on this card upcasts the weights to BF16, so you pay the full ~40 GB-class footprint with no compute speedup — it OOMs just like the raw BF16 build. Use the GGUF quants from step 5 instead; GGUF is the real AMD memory path.

Noise / garbage image output from the GGUF path on ROCm

There is a known open issue where the ComfyUI-GGUF node can produce noisy, corrupted output on AMD GPUs under ROCm while the same workflow renders correctly on CPU — reported on an RX 7900 XT with a low-bit (Q2_K) flux quant in city96/ComfyUI-GGUF issue #300 (open, no maintainer fix yet). If you hit this, first prefer a higher-bit quant (Q5_K_S / Q8_0 rather than Q2/Q3 — which is the recommended default here anyway), keep ComfyUI and ComfyUI-GGUF up to date, and as a diagnostic confirm the workflow renders correctly with the node forced to CPU. Track the issue for a runtime fix.

VAE decode crashes or returns a black image (bf16 VAE on ROCm)

A separate ComfyUI user reports a VAE-decode crash on a 24 GB RX 7900 XTX — error "Memobj map does not have ptr" — in ComfyUI issue #11551 (open). That report is against Z-Image Turbo, a different model, not Qwen-Image, and in the thread's own test matrix the crash is triggered by ComfyUI's default attention backend and resolved by launching with --use-pytorch-cross-attention — the flag this recipe already uses — not by changing VAE precision. Do not add --bf16-vae on this card: it is not a confirmed fix for that crash and is contested on RDNA3. If you instead get a black image at the VAE step (a separate, precision-related symptom), force the VAE to FP32 or fall back to Tiled VAE Decode.

"Torch not compiled with CUDA enabled"

A CUDA build of PyTorch got installed instead of the ROCm build. Per the ComfyUI README troubleshooting note, uninstall and reinstall against the ROCm wheel index:

pip uninstall torch
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm7.2

Confirm the build: python -c "import torch; print(torch.__version__)" should print a +rocm7.2-style suffix, and torch.cuda.is_available() returns True (ROCm masquerades as the cuda device namespace under HIP).

Do not install xformers or FlashAttention

HF and ComfyUI guides written for NVIDIA frequently suggest pip install xformers or a FlashAttention wheel. On RDNA3 these are the wrong path: the ROCm xformers fork is limited, and ComfyUI already routes attention through PyTorch SDPA on this stack. Stick with the default, or force it explicitly with --use-pytorch-cross-attention.