How much VRAM does Stable Diffusion XL need?

About 8 GB — the minimum this recipe targets.

How hard is this setup?

Intermediate — follow the steps above.

Stable Diffusion XL on RX 7900 XTX: ComfyUI on ROCm (BF16/FP16)

What You'll Build

A local Stable Diffusion XL text-to-image setup running in ComfyUI on a 24 GB Radeon RX 7900 XTX (RDNA3, Navi 31, gfx1100) through the ROCm stack — producing 1024×1024 images from text prompts using the full-precision base model plus the optional refiner. With 24 GB of VRAM the SDXL base checkpoint (~6.94 GB of weights) is never memory-bound: you run the native FP16/BF16 weights with room to spare for the dual CLIP text encoders, the VAE, and a refiner pass, with no need for any quantization.

Hardware data: RX 7900 XTX (24GB VRAM) · FP16/BF16 · ComfyUI on ROCm 7.2 · See benchmark data

⚠️ This is a ROCm recipe, not CUDA. The RX 7900 XTX runs on AMD's ROCm/HIP stack — there is no cu124/cu128 wheel, no xformers install, and no FP8/FP4 path here. RDNA3 has no FP8/FP4 hardware (its WMMA units accept FP16, BF16, INT8, INT4 only), so an FP8 checkpoint would just upcast to FP16 with no memory saving — and at 24 GB you don't need it anyway. The attention path is PyTorch SDPA (ComfyUI's default; the explicit flag is --use-pytorch-cross-attention), not FlashAttention-2 and not xformers. If a guide tells you to pip install xformers or pick a cu12x wheel for this card, it's written for the wrong vendor.

Requirements

Component	Minimum	Tested
GPU	8 GB VRAM (ROCm-supported AMD card)	RX 7900 XTX (24 GB)
RAM	16 GB system	—
Storage	6.94 GB (base only) or ~13 GB (base + refiner)	per HF Files tree
Driver	AMD ROCm 7.2.x on Linux	—
Software	ComfyUI + PyTorch (ROCm 7.2 build), Python 3.10+	—

The base model is released under the CreativeML Open RAIL++-M License and the weights are not gated on Hugging Face — no access request or login is required to download them. SDXL is a latent diffusion model that uses two pretrained text encoders (OpenCLIP-ViT/bigG and CLIP-ViT/L), confirmed by the base repo's model_index.json (text_encoder: CLIPTextModel + text_encoder_2: CLIPTextModelWithProjection).

Installation

1. Install ComfyUI

Per the ComfyUI README, clone the repo:

git clone https://github.com/comfyanonymous/ComfyUI.git
cd ComfyUI

2. Install PyTorch for ROCm

The RX 7900 XTX (gfx1100) is an officially ROCm-supported GPU, so it uses the stable ROCm PyTorch wheel. Per the ComfyUI README "AMD GPUs (Linux)" section, the stable install command is:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm7.2

ℹ️ Verify the ROCm tag before you copy it. As of this writing the ComfyUI README pins rocm7.2 as the stable wheel — but the rocmX.Y tag moves over time (6.3 → 6.4 → 7.x). Read the current line in the live ComfyUI README before running. A nightly variant (https://download.pytorch.org/whl/nightly/rocm7.2) "might have some performance improvements" per the README. There is also a separate experimental RDNA-3-specific wheel index (https://rocm.nightlies.amd.com/v2/gfx110X-all/) that the README lists for Windows+Linux RDNA3 support — on officially-supported Linux you do not need it; the stable whl/rocm7.2 wheel above is the canonical path.

3. Install ComfyUI dependencies

Per the ComfyUI README "Dependencies" section:

pip install -r requirements.txt

4. Download the SDXL checkpoints

Place the single-file checkpoints in ComfyUI/models/checkpoints/. File sizes are verified from the Hugging Face Files tree (base 6,938,078,334 bytes ≈ 6.94 GB; refiner 6,075,981,930 bytes ≈ 6.08 GB):

# Base model (required) — ~6.94 GB
wget -P models/checkpoints/ \
  https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0/resolve/main/sd_xl_base_1.0.safetensors

# Refiner (optional, improves fine detail) — ~6.08 GB
wget -P models/checkpoints/ \
  https://huggingface.co/stabilityai/stable-diffusion-xl-refiner-1.0/resolve/main/sd_xl_refiner_1.0.safetensors

The refiner is a specialized refinement module that processes the final denoising steps after the base model. It is optional — the base model alone produces complete 1024×1024 images — but at 24 GB you have ample room to run both in a base→refiner workflow.

Running

Launch ComfyUI from the repo root. Per the ComfyUI README "Running" section:

python main.py

This starts the server (default http://127.0.0.1:8188). Open it in a browser, load the default SDXL workflow (drag in the built-in "Load Checkpoint → SDXL" template, or build a Load Checkpoint → CLIP Text Encode (×2) → KSampler → VAE Decode → Save Image graph), select sd_xl_base_1.0.safetensors, set the resolution to 1024×1024 (SDXL's native training resolution), enter a prompt, and queue. Generated PNGs land in ComfyUI/output/ with the full workflow embedded.

ComfyUI's default attention backend on this stack is PyTorch's scaled-dot-product attention (SDPA). If the auto-selected path misbehaves, you can force the PyTorch-2.0 cross-attention function explicitly — per ComfyUI's cli_args.py, the flag is documented as "Use the new pytorch 2.0 cross attention function.":

python main.py --use-pytorch-cross-attention

At 24 GB you should not need the memory-saving --use-split-cross-attention fallback (documented as "Use the split cross attention optimization.") — that is for VRAM-constrained cards. Likewise, do not pass --lowvram on a 7900 XTX; per the README it forces the text encoders onto the CPU, which only slows you down when you have memory to spare.

Results

Speed: No RX-7900-XTX-named SDXL iterations-per-second benchmark was found in research that could be verified on the source page itself. AMD's own ComfyUI on Radeon RX 9000 blog lists the RX 7900 XTX (RDNA3) among supported cards and reports an SDXL figure of 4.6 it/s — but that number is measured on an AMD Radeon AI PRO R9700 (32 GB), a different card, not the 7900 XTX, so it is not quoted here as the 7900 XTX's speed. Rather than transfer a number from a different GPU, the Speed figure is omitted. If you've measured SDXL it/s on a 7900 XTX, please contribute it so it lands on /check/sdxl/rx-7900-xtx. As a general note, you can try PYTORCH_TUNABLEOP_ENABLED=1 (see Troubleshooting) which the ComfyUI README says "might speed things up at the cost of a very slow initial run."
VRAM usage: The base checkpoint is ~6.94 GB of weights (HF Files tree); with the dual CLIP text encoders, the VAE, and 1024×1024 activations, an SDXL base run sits comfortably in single-digit-to-low-teens GB — trivially within the 24 GB 7900 XTX, leaving headroom for the refiner stage or a higher batch size. See /check/sdxl/rx-7900-xtx for any community-submitted measurement.
Quality notes: SDXL's native resolution is 1024×1024 — generate at that size (or SDXL-aspect variants like 896×1152) for best results; 512×512 produces noticeably worse composition. The optional refiner sharpens fine detail on the final denoising steps but is not required. There is no quantization tradeoff to consider on this card: run the native FP16/BF16 weights.

For the full benchmark data and other-GPU comparisons, see /check/sdxl/rx-7900-xtx.

Troubleshooting

"Torch not compiled with CUDA enabled"

This means a CUDA build of PyTorch got installed instead of the ROCm build. Per the ComfyUI README troubleshooting note, uninstall and reinstall against the ROCm wheel index:

pip uninstall torch
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm7.2

Confirm the installed build is the ROCm one: python -c "import torch; print(torch.__version__)" should print a +rocm7.2-style suffix, and torch.cuda.is_available() returns True (ROCm masquerades as the cuda device namespace under HIP).

Generation feels slow on the first run — enable TunableOp

Per the ComfyUI README "AMD ROCm Tips": "You can try setting this env variable PYTORCH_TUNABLEOP_ENABLED=1 which might speed things up at the cost of a very slow initial run." TunableOp auto-tunes GEMM kernels for your card on the first pass (slow), then caches the tuned kernels for faster subsequent generations:

PYTORCH_TUNABLEOP_ENABLED=1 python main.py

Out of memory on a smaller card (not the 7900 XTX)

The 24 GB 7900 XTX has no memory pressure for SDXL, but if you adapt this recipe to a smaller ROCm card, the AMD Radeon ComfyUI install docs suggest the --lowvram and --disable-pinned-memory launch flags. Do not use them on the 7900 XTX — they trade speed for memory you don't need.

Do not install xformers or FlashAttention

HF and ComfyUI guides written for NVIDIA frequently suggest pip install xformers or a FlashAttention wheel. On RDNA3 these are the wrong path: the ROCm xformers fork is limited, and ComfyUI already routes attention through PyTorch SDPA on this stack. Stick with the default, or force it explicitly with --use-pytorch-cross-attention.