How much VRAM does Qwen-Image need?

About 14 GB — the minimum this recipe targets.

How hard is this setup?

Intermediate — follow the steps above.

Qwen-Image on RX 7800 XT: 20B Text-to-Image via ComfyUI GGUF on ROCm (16 GB)

What You'll Build

A local Qwen-Image text-to-image setup running in ComfyUI on a 16 GB Radeon RX 7800 XT (RDNA3, Navi 32, gfx1101) through the ROCm stack. Qwen-Image is "a 20B MMDiT image foundation model" from Alibaba's Qwen team (Tongyi Lab), released August 4, 2025 under Apache 2.0, recognised for complex text rendering. It is a heavy model: the full BF16 diffusion weights are 40.9 GB on disk (city96 GGUF card), which does not fit even a 24 GB card, let alone 16 GB. This recipe leads with the city96 GGUF quants loaded through the ComfyUI-GGUF custom node — Q4_K_M (13.1 GB) or Q5_K_S (14.1 GB) — with the Qwen2.5-VL-7B text encoder offloaded to CPU, which is what makes a 20B MMDiT fit inside the 7800 XT's 16 GB envelope.

Hardware data: RX 7800 XT (16GB VRAM) · 20B-parameter MMDiT via GGUF Q4/Q5 on ROCm · See benchmark data

⚠️ This is a ROCm recipe, not CUDA — and FP8 is not an escape hatch here. The RX 7800 XT runs on AMD's ROCm/HIP stack: there is no cu124/cu128 wheel, no xformers, and no FP8/FP4 path. RDNA3's WMMA units accept FP16, BF16, INT8, INT4 only — there is no FP8 hardware. So the official qwen_image_fp8_e4m3fn.safetensors build (20.4 GB on disk, ComfyUI tutorial) gives you nothing on this card: the FP8 weights upcast to BF16 at load, restoring the full ~40 GB-class footprint with no compute acceleration. The path that actually fits 16 GB is GGUF Q4/Q5 with the text encoder offloaded to CPU. The attention path is PyTorch SDPA (--use-pytorch-cross-attention), never FlashAttention-2 and never xformers.

ℹ️ Why a lower quant than the 24 GB Radeon recipe. On the 24 GB RX 7900 XTX, Qwen-Image runs comfortably at GGUF Q5/Q8 with headroom. The 16 GB RX 7800 XT is a tighter fit: Q8_0 (21.8 GB) and Q6_K (16.8 GB) both overflow 16 GB, so this recipe drops to Q4_K_M / Q5_K_S (~13–14 GB) and offloads the text encoder. There is no FP8 lever to fall back on (no AMD FP8 hardware), so GGUF + CPU offload is the squeeze.

Requirements

Component	Minimum	Tested
GPU	16 GB VRAM (ROCm-supported AMD card)	RX 7800 XT (16 GB)
RAM	32 GB system (text-encoder CPU offload)	—
Storage	~20 GB (Q5_K_S diffusion + GGUF encoder + VAE)	per city96 + Comfy-Org trees
Driver	AMD ROCm 7.2.x on Linux	—
Software	ComfyUI + PyTorch (ROCm 7.2 build) + ComfyUI-GGUF, Python 3.10+	—

Qwen-Image is released under Apache 2.0 ("Qwen-Image is licensed under Apache 2.0" on the model card) and the weights are not gated — no access request or login is required. The 20B parameter count is stated on the Qwen-Image GitHub README ("a 20B MMDiT image foundation model"). The diffusion model is paired with a Qwen2.5-VL-7B text encoder and a dedicated VAE (Comfy-Org repackage).

Installation

1. Install ComfyUI

Per the ComfyUI README, clone the repo:

git clone https://github.com/comfyanonymous/ComfyUI.git
cd ComfyUI

2. Install PyTorch for ROCm

The RX 7800 XT (gfx1101) is an officially ROCm-supported GPU on Linux, so it uses the stable ROCm PyTorch wheel. Per the ComfyUI README "AMD GPUs (Linux)" section, the stable install command is:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm7.2

ℹ️ Verify the ROCm tag before you copy it. As of this writing the ComfyUI README pins rocm7.2 as the stable wheel — but the rocmX.Y tag moves over time (6.3 → 6.4 → 7.x). Read the current line in the live ComfyUI README before running. A nightly variant (--pre ... --index-url https://download.pytorch.org/whl/nightly/rocm7.2) "might have some performance improvements" per the README. Unlike a build-from-source flow you do not need to set PYTORCH_ROCM_ARCH — the stable wheel ships gfx1101 kernels. If a tool ever ships only gfx1100 kernels, the legacy fallback HSA_OVERRIDE_GFX_VERSION=11.0.0 masquerades the 7800 XT as gfx1100, but it is not required on the current stable ROCm/ComfyUI path.

3. Install ComfyUI dependencies

Per the ComfyUI README "Dependencies" section:

pip install -r requirements.txt

4. Install the ComfyUI-GGUF custom node

The native ComfyUI loader cannot read GGUF files. Install city96's loader, which dequantizes GGUF tensors into PyTorch tensors and runs them through ComfyUI's normal model path — so it works on whatever backend ComfyUI is on, ROCm included (it is not CUDA-specific). From your ComfyUI root:

git clone https://github.com/city96/ComfyUI-GGUF ComfyUI/custom_nodes/ComfyUI-GGUF
pip install --upgrade gguf

Source: ComfyUI-GGUF README.

5. Download the quantized diffusion weights

From city96/Qwen-Image-gguf — "a direct GGUF conversion of Qwen/Qwen-Image" — pull a quant. The size table below is verified against the repo's current file tree:

Quant	Size on disk	Fits 16 GB?
Q4_K_S	12.1 GB	yes (most headroom)
Q4_K_M	13.1 GB	yes (recommended)
Q5_K_S	14.1 GB	yes (best quality that fits)
Q5_K_M	14.9 GB	tight
Q6_K	16.8 GB	no — overflows 16 GB
Q8_0	21.8 GB	no — overflows 16 GB
BF16	40.9 GB	no

On the 16 GB 7800 XT, Q4_K_M (13.1 GB) is the recommended default — it leaves the most room for the VAE and 1024×1024 activations once the encoder is offloaded, and (per the city96 model card) the K_M quants keep the first/last layer in high precision via a dynamic-precision scheme. Q5_K_S (14.1 GB) is the highest-quality quant that still fits; use it if you can spare the headroom. Place the file in ComfyUI/models/diffusion_models/:

# from your ComfyUI root — Q4_K_M recommended (~13.1 GB)
wget -O ComfyUI/models/diffusion_models/qwen-image-Q4_K_M.gguf \
  https://huggingface.co/city96/Qwen-Image-gguf/resolve/main/qwen-image-Q4_K_M.gguf

6. Download the text encoder and VAE

The Qwen2.5-VL-7B text encoder and the VAE come from the official Comfy-Org repackage. On a 16 GB card the encoder must be offloaded to CPU (it loads, encodes the prompt, then frees before the diffusion transformer dominates VRAM). The FP8-scaled encoder is 9.4 GB on disk; note that on AMD it upcasts to BF16 in memory, so keep it on CPU rather than resident:

wget -O ComfyUI/models/text_encoders/qwen_2.5_vl_7b_fp8_scaled.safetensors \
  https://huggingface.co/Comfy-Org/Qwen-Image_ComfyUI/resolve/main/split_files/text_encoders/qwen_2.5_vl_7b_fp8_scaled.safetensors
wget -O ComfyUI/models/vae/qwen_image_vae.safetensors \
  https://huggingface.co/Comfy-Org/Qwen-Image_ComfyUI/resolve/main/split_files/vae/qwen_image_vae.safetensors

ℹ️ Tighter option — a GGUF text encoder. If you want to keep the encoder smaller still, the city96 card links a GGUF build of the encoder at unsloth/Qwen2.5-VL-7B-Instruct-GGUF (e.g. Qwen2.5-VL-7B-Instruct-Q4_K_M.gguf, 4.7 GB), loaded via the CLIPLoader (GGUF) node from ComfyUI-GGUF. This is the lightest-footprint encoder path and pairs well with the Q5_K_S diffusion quant on a 16 GB card.

7. Load the workflow

The city96 GGUF model card ships a ready-to-use workflow at media/qwen-image_workflow.json in the repo. Drag the JSON onto the ComfyUI canvas — it pre-wires the Unet Loader (GGUF) node (from the bootleg category, per the ComfyUI-GGUF README), the Qwen2.5-VL text encoder, and the VAE.

Running

Launch ComfyUI from the repo root with the PyTorch SDPA attention backend forced on (the recommended path on ROCm) and the text encoder offloaded to CPU:

python main.py --use-pytorch-cross-attention --lowvram

Per ComfyUI's cli_args.py, --use-pytorch-cross-attention selects "Use the new pytorch 2.0 cross attention function." — this keeps attention on PyTorch SDPA rather than the unavailable FlashAttention/xformers path. On this 16 GB card --lowvram is appropriate: it pushes the text encoder onto the CPU so only the GGUF diffusion weights, the VAE, and the activations stay resident — the offload that makes the 20B model fit.

This starts the server (default http://127.0.0.1:8188). Open it, load the workflow JSON from step 7, select qwen-image-Q4_K_M.gguf in the Unet Loader (GGUF) node, enter a prompt, and queue. Generated PNGs land in ComfyUI/output/.

Results

Speed: No RX-7800-XT-named Qwen-Image benchmark exists yet. The backend reports verdict: unknown with no benchmarks for this pair (/check/qwen-image/rx-7800-xt), and the 24 GB RX 7900 XTX recipe's speed is also unmeasured — and would not transfer anyway, since the 7800 XT has lower memory bandwidth (624 vs 960 GB/s) and fewer compute units. So no speed figure is quoted. If you've measured Qwen-Image generation time on a 7800 XT, please contribute it so it lands on /check/qwen-image/rx-7800-xt.
VRAM usage: The Q4_K_M diffusion weights are 13.1 GB on disk and Q5_K_S is 14.1 GB (city96 file tree). Because Qwen-Image is a single-pass pipeline, the text encoder loads, encodes the prompt, and is freed (offloaded to CPU) before the diffusion transformer becomes the resident peak — so the GGUF diffusion weights plus the VAE and 1024×1024 activations are the binding load, and Q4/Q5 fit the 16 GB 7800 XT. The full BF16 diffusion weights (40.9 GB) and the FP8 build (which upcasts to BF16, ~40 GB-class) do not fit, which is why the GGUF + offload path is the only one documented here. See /check/qwen-image/rx-7800-xt for any community-submitted measurement.
Quality notes: Qwen-Image is positioned as a strong text-rendering model (GitHub README); 4-bit and 5-bit GGUF quants are widely reported to retain most of that capability, and the city96 K_M quants keep the first/last layer in high precision (city96 card notes). On this 16 GB card Q5_K_S is the highest-fidelity quant that fits; the Q6_K/Q8_0 tiers that the 24 GB Radeon can run overflow 16 GB.

For the full benchmark data, see /check/qwen-image/rx-7800-xt.

Troubleshooting

"FP8 is smaller — why not just run the official FP8 build?"

On NVIDIA Ada/Blackwell cards the qwen_image_fp8_e4m3fn.safetensors build (20.4 GB) is genuinely FP8-accelerated. The RX 7800 XT has no FP8 hardware — its WMMA units accept FP16/BF16/INT8/INT4 only (AMD GPUOpen, "WMMA on RDNA3"). Loading the FP8 safetensors on this card upcasts the weights to BF16, so you pay the full ~40 GB-class footprint with no compute speedup — it OOMs hard on 16 GB. Use the GGUF Q4/Q5 quants from step 5 instead; GGUF is the real AMD memory path.

Out of memory at the diffusion or VAE-decode step

On 16 GB, headroom is tighter than on the 24 GB Radeon. If you OOM: (a) drop from Q5_K_S to Q4_K_M (13.1 GB) or Q4_K_S (12.1 GB); (b) confirm --lowvram is set so the text encoder is offloaded to CPU and only the diffusion weights stay resident; (c) switch the encoder to the lighter GGUF build via CLIPLoader (GGUF); and (d) if the VAE-decode stage spikes, use Tiled VAE Decode to cap decode-time VRAM.

Noise / garbage image output from the GGUF path on ROCm

There is a known open issue where the ComfyUI-GGUF node can produce noisy, corrupted output on AMD GPUs under ROCm while the same workflow renders correctly on CPU — reported on an RX 7900 XT with a low-bit (Q2_K) flux quant in city96/ComfyUI-GGUF issue #300 (open, no maintainer fix yet). If you hit this, first prefer a higher-bit quant (Q4_K_M / Q5_K_S rather than Q2/Q3 — which is the recommended default here anyway), keep ComfyUI and ComfyUI-GGUF up to date, and as a diagnostic confirm the workflow renders correctly with the node forced to CPU. Track the issue for a runtime fix.

VAE decode crashes or returns a black image (VAE on ROCm)

A ComfyUI user reports a VAE-decode crash on a 24 GB RX 7900 XTX — error "Memobj map does not have ptr" — in ComfyUI issue #11551 (open). That report is against Z-Image Turbo, a different model, not Qwen-Image, and in the thread's own test matrix the crash is triggered by ComfyUI's default attention backend and resolved by launching with --use-pytorch-cross-attention — the flag this recipe already uses — not by changing VAE precision. Do not add --bf16-vae on this card: it is not a confirmed fix for that crash and is contested on RDNA3. If you instead get a black image at the VAE step (a separate, precision-related symptom), force the VAE to FP32 or fall back to Tiled VAE Decode.

"Torch not compiled with CUDA enabled"

A CUDA build of PyTorch got installed instead of the ROCm build. Per the ComfyUI README troubleshooting note, uninstall and reinstall against the ROCm wheel index:

pip uninstall torch
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm7.2

Confirm the build: python -c "import torch; print(torch.__version__)" should print a +rocm7.2-style suffix, and torch.cuda.is_available() returns True (ROCm masquerades as the cuda device namespace under HIP).

Do not install xformers or FlashAttention

HF and ComfyUI guides written for NVIDIA frequently suggest pip install xformers or a FlashAttention wheel. On RDNA3 these are the wrong path: the ROCm xformers fork is limited, and ComfyUI already routes attention through PyTorch SDPA on this stack. Stick with the default, or force it explicitly with --use-pytorch-cross-attention.