How much VRAM does Stable Diffusion XL need?

About 8 GB — the minimum this recipe targets.

How hard is this setup?

Intermediate — follow the steps above.

Stable Diffusion XL on RX 7800 XT: ComfyUI on ROCm (BF16/FP16)

What You'll Build

A local Stable Diffusion XL text-to-image setup running in ComfyUI on a 16 GB Radeon RX 7800 XT (RDNA3, Navi 32, gfx1101) through the ROCm stack — producing 1024×1024 images from text prompts using the full-precision base model plus the optional refiner. SDXL is a small model by modern standards: the base checkpoint (~6.94 GB of weights) is never memory-bound on this card, so you run the native FP16/BF16 weights with room to spare for the dual CLIP text encoders, the VAE, and a refiner pass — no quantization needed.

Hardware data: RX 7800 XT (16GB VRAM) · FP16/BF16 · ComfyUI on ROCm 7.2 · See benchmark data

⚠️ This is a ROCm recipe, not CUDA. The RX 7800 XT runs on AMD's ROCm/HIP stack — there is no cu124/cu128 wheel, no xformers install, and no FP8/FP4 path here. RDNA3 has no FP8/FP4 hardware (its WMMA units accept FP16, BF16, INT8, INT4 only), so an FP8 checkpoint would just upcast to FP16 with no memory saving — and at 16 GB the native BF16/FP16 SDXL weights fit comfortably anyway. The attention path is PyTorch SDPA (ComfyUI's default; the explicit flag is --use-pytorch-cross-attention), not FlashAttention-2 and not xformers. If a guide tells you to pip install xformers or pick a cu12x wheel for this card, it's written for the wrong vendor.

Requirements

Component	Minimum	Tested
GPU	8 GB VRAM (ROCm-supported AMD card)	RX 7800 XT (16 GB)
RAM	16 GB system	—
Storage	6.94 GB (base only) or ~13 GB (base + refiner)	per HF Files tree
Driver	AMD ROCm 7.2.x on Linux	—
Software	ComfyUI + PyTorch (ROCm 7.2 build), Python 3.10+	—

The base model is released under the CreativeML Open RAIL++-M License and the weights are not gated on Hugging Face — no access request or login is required to download them. SDXL is a latent diffusion model that uses two pretrained text encoders (OpenCLIP-ViT/G and CLIP-ViT/L), confirmed by the base repo's model_index.json (text_encoder: CLIPTextModel + text_encoder_2: CLIPTextModelWithProjection).

Installation

1. Install ComfyUI

Per the ComfyUI README, clone the repo:

git clone https://github.com/comfyanonymous/ComfyUI.git
cd ComfyUI

2. Install PyTorch for ROCm

The RX 7800 XT (gfx1101) is an officially ROCm-supported GPU on Linux, so it uses the stable ROCm PyTorch wheel. Per the ComfyUI README "AMD GPUs (Linux)" section, the stable install command is:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm7.2

ℹ️ Verify the ROCm tag before you copy it. As of this writing the ComfyUI README pins rocm7.2 as the stable wheel — but the rocmX.Y tag moves over time (6.3 → 6.4 → 7.x). Read the current line in the live ComfyUI README before running. A nightly variant (https://download.pytorch.org/whl/nightly/rocm7.2) "might have some performance improvements" per the README. There is also a separate experimental RDNA-3-specific wheel index — the README lists pip install --pre torch torchvision torchaudio --index-url https://rocm.nightlies.amd.com/v2/gfx110X-all/ for RX 7000-series cards. On officially-supported Linux you do not need it; the stable whl/rocm7.2 wheel above is the canonical path. (The gfx110X-all index name covers the whole RDNA3 family, gfx1100/gfx1101/gfx1102 — so it is the right experimental index for the 7800 XT's gfx1101 if you ever need it.)

3. Install ComfyUI dependencies

Per the ComfyUI README "Dependencies" section:

pip install -r requirements.txt

4. Download the SDXL checkpoints

Place the single-file checkpoints in ComfyUI/models/checkpoints/. File sizes are verified from the Hugging Face Files tree (base 6,938,078,334 bytes ≈ 6.94 GB; refiner 6,075,981,930 bytes ≈ 6.08 GB):

# Base model (required) — ~6.94 GB
wget -P models/checkpoints/ \
  https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0/resolve/main/sd_xl_base_1.0.safetensors

# Refiner (optional, improves fine detail) — ~6.08 GB
wget -P models/checkpoints/ \
  https://huggingface.co/stabilityai/stable-diffusion-xl-refiner-1.0/resolve/main/sd_xl_refiner_1.0.safetensors

The refiner is a specialized refinement module that processes the final denoising steps after the base model. It is optional — the base model alone produces complete 1024×1024 images. On the 16 GB 7800 XT you have room to run both in a base→refiner workflow: ComfyUI swaps the refiner checkpoint in for the final steps, so peak residency stays close to a single-checkpoint run rather than the sum of both.

Running

Launch ComfyUI from the repo root. Per the ComfyUI README "Running" section:

python main.py

This starts the server (default http://127.0.0.1:8188). Open it in a browser, load the default SDXL workflow (drag in the built-in "Load Checkpoint → SDXL" template, or build a Load Checkpoint → CLIP Text Encode (×2) → KSampler → VAE Decode → Save Image graph), select sd_xl_base_1.0.safetensors, set the resolution to 1024×1024 (SDXL's native training resolution), enter a prompt, and queue. Generated PNGs land in ComfyUI/output/ with the full workflow embedded.

ComfyUI's default attention backend on this stack is PyTorch's scaled-dot-product attention (SDPA). If the auto-selected path misbehaves, you can force the PyTorch-2.0 cross-attention function explicitly — per ComfyUI's cli_args.py, the flag is documented as "Use the new pytorch 2.0 cross attention function.":

python main.py --use-pytorch-cross-attention

At 16 GB you should not need the memory-saving --use-split-cross-attention fallback (documented as "Use the split cross attention optimization.") for SDXL — that is for tighter-VRAM cards or much larger models. Likewise, do not pass --lowvram on a 7800 XT for SDXL; per the README it forces the text encoders onto the CPU, which only slows you down when you have memory to spare.

Results

Speed: No RX-7800-XT-named SDXL iterations-per-second benchmark was found in research that could be verified on the source page itself. AMD's own ComfyUI on Radeon RX 9000 blog reports an SDXL figure of 4.6 it/s — but that number is measured on an AMD Radeon AI PRO R9700 (32 GB), a different and much higher-end card, not the 7800 XT, so it is not quoted here as the 7800 XT's speed. The 7800 XT has roughly two-thirds of the 7900 XTX's memory bandwidth (624 vs 960 GB/s) and fewer WMMA units, so transferring any RDNA3 sibling's number would mislead. Rather than carry a number from a different GPU, the Speed figure is omitted. If you've measured SDXL it/s on a 7800 XT, please contribute it so it lands on /check/sdxl/rx-7800-xt. As a general note, you can try PYTORCH_TUNABLEOP_ENABLED=1 (see Troubleshooting) which the ComfyUI README says "might speed things up at the cost of a very slow initial run."
VRAM usage: The base checkpoint is ~6.94 GB of weights (HF Files tree); with the dual CLIP text encoders, the VAE, and 1024×1024 activations, an SDXL base run sits comfortably in single-digit-to-low-teens GB — well within the 16 GB 7800 XT, leaving headroom for the refiner stage or a higher batch size. See /check/sdxl/rx-7800-xt for any community-submitted measurement.
Quality notes: SDXL's native resolution is 1024×1024 — generate at that size (or SDXL-aspect variants like 896×1152) for best results; 512×512 produces noticeably worse composition. The optional refiner sharpens fine detail on the final denoising steps but is not required. There is no quantization tradeoff to consider on this card: run the native FP16/BF16 weights.

For the full benchmark data and other-GPU comparisons, see /check/sdxl/rx-7800-xt.

Troubleshooting

"Torch not compiled with CUDA enabled"

This means a CUDA build of PyTorch got installed instead of the ROCm build. The ComfyUI README troubleshooting note says that if you get the Torch not compiled with CUDA enabled error, you should uninstall torch and reinstall it against the ROCm wheel index:

pip uninstall torch
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm7.2

Confirm the installed build is the ROCm one: python -c "import torch; print(torch.__version__)" should print a +rocm7.2-style suffix, and torch.cuda.is_available() returns True (ROCm masquerades as the cuda device namespace under HIP).

A library ships only gfx1100 kernels and won't load on the 7800 XT

The 7800 XT is gfx1101 (Navi 32), while the flagship 7900 XTX is gfx1100 (Navi 31). Most of the ROCm stack ships kernels for both, but occasionally a library or prebuilt extension only carries gfx1100 kernels and refuses to run on gfx1101. The standard Linux-only fallback is to mask the card as gfx1100 at runtime:

HSA_OVERRIDE_GFX_VERSION=11.0.0 python main.py

This is a legacy fallback, not a default — current ComfyUI on the stable ROCm wheel runs natively on gfx1101 without it. Only reach for it if you hit a "no kernel image is available" / missing-gfx1101-kernel error from a specific library.

Generation feels slow on the first run — enable TunableOp

Per the ComfyUI README "AMD ROCm Tips": "You can try setting this env variable PYTORCH_TUNABLEOP_ENABLED=1 which might speed things up at the cost of a very slow initial run." TunableOp auto-tunes GEMM kernels for your card on the first pass (slow), then caches the tuned kernels for faster subsequent generations:

PYTORCH_TUNABLEOP_ENABLED=1 python main.py

Do not install xformers or FlashAttention

HF and ComfyUI guides written for NVIDIA frequently suggest pip install xformers or a FlashAttention wheel. On RDNA3 these are the wrong path: the ROCm xformers fork is limited, and ComfyUI already routes attention through PyTorch SDPA on this stack. Stick with the default, or force it explicitly with --use-pytorch-cross-attention.