How much VRAM does Chroma V48 need?

About 10 GB — the minimum this recipe targets.

How hard is this setup?

Intermediate — follow the steps above.

Chroma1-Base (V48) on RX 7800 XT: Uncensored 8.9B FLUX.1-Schnell De-Distillation via Q8_0 GGUF in ComfyUI on ROCm

What You'll Build

A local Chroma1-Base text-to-image setup running in ComfyUI on a 16 GB Radeon RX 7800 XT (RDNA3, Navi 32, gfx1101) through the ROCm stack. Chroma1-Base is the 8.9B-parameter, Apache 2.0, uncensored re-derivation of FLUX.1-Schnell published by Lodestone Rock and labelled "Chroma1-Base is Chroma-v.48" verbatim on the official HF card — so it is the literal V48 weights. On 16 GB the native BF16 checkpoint (~17.8 GB on disk) does not fit, and RDNA3 has no FP8 hardware to squeeze it — so this recipe leads the Q8_0 GGUF (9.74 GB), a near-lossless quant that runs comfortably on the card through ComfyUI's GGUF loader, alongside the T5 XXL text encoder and the FLUX VAE.

Hardware data: RX 7800 XT (16GB VRAM) · Q8_0 GGUF (9.74 GB on disk) · ComfyUI on ROCm 7.2 · See benchmark data

⚠️ This is a ROCm recipe, not CUDA. The RX 7800 XT runs on AMD's ROCm/HIP stack — there is no cu124/cu128 wheel, no xformers install, and no FP8/FP4 path here. RDNA3's WMMA units accept FP16, BF16, INT8, and INT4 only — there is no FP8 hardware — so the FP8-scaled Chroma file that an NVIDIA Ada/Blackwell recipe leads with would just upcast to BF16 on this card, doubling its memory with no compute acceleration. On a 16 GB card you also cannot use FP8 as the memory-saving squeeze that an NVIDIA 16 GB recipe would reach for; the AMD-native squeeze is GGUF. The attention path is PyTorch SDPA (ComfyUI's default; the explicit flag is --use-pytorch-cross-attention), not FlashAttention-2 and not xformers. If a guide tells you to pip install xformers, pick a cu12x wheel, or load an fp8_e4m3fn checkpoint "for the tensor cores," it was written for the wrong vendor.

ℹ️ Why GGUF and not the BF16 single-file here. The full BF16 Chroma1-Base checkpoint is ~17.8 GB on disk (HF Files tree) — its resident weights plus the T5 XXL encoder, the FLUX VAE, and 1024×1024 activations exceed the 16 GB budget. Running BF16 is a 24 GB-card tier path (e.g. an RX 7900 XTX). On the 16 GB RX 7800 XT, lead the Q8_0 GGUF (9.74 GB) — it is the highest-quality GGUF quant, near-lossless versus BF16, and leaves headroom on the card.

ℹ️ Why Chroma1-Base and not Chroma1-HD or Chroma1-Radiance. The Chroma family ships several current variants from the same author: Chroma1-Base (the literal V48 weights), Chroma1-HD (a successor retrained from V48 as a finetune-ready base), and Chroma1-Radiance (a different output head — no FLUX VAE, different decoder). This recipe pins Chroma1-Base because that is what V48 specifically is, per the Chroma1-Base HF card. For Chroma1-HD or Chroma1-Radiance, follow their own respective HF cards — weights and workflows differ.

⚠️ The original lodestones/Chroma repo is deprecated. Its README now opens with "THIS REPO IS DEPRECATED!" and directs users to Chroma1-HD, Chroma1-Base or Chroma1-Flash instead. Use Chroma1-Base for V48. (The deprecated repo is still the canonical host for the shared FLUX ae.safetensors VAE, which the Chroma1-Base card itself links — that one file is fine to pull from there.)

Requirements

Component	Minimum	Tested
GPU	10 GB VRAM (ROCm-supported AMD card) for the Q8_0 GGUF path	RX 7800 XT (16 GB)
RAM	32 GB system	—
Storage	~12 GB (Q8_0 GGUF 9.74 GB + T5 XXL fp16 + FLUX VAE)	per HF Files tree
Driver	AMD ROCm 7.2.x on Linux	—
Software	ComfyUI + PyTorch (ROCm 7.2 build) + ComfyUI-GGUF, Python 3.10+	—

Chroma1-Base is released under the Apache 2.0 license per the HF card and the weights are not gated — no access request or login is required to download them. It is built on the FLUX.1-Schnell architecture (8.9B parameters, reduced from 12B by replacing an oversized 3.3B-parameter timestep-encoding layer with a 250M-parameter FFN, per the card's architecture summary) and uses the FLUX-ecosystem T5 XXL text encoder plus the FLUX VAE.

Installation

1. Install ComfyUI

Per the ComfyUI README, clone the repo:

git clone https://github.com/comfyanonymous/ComfyUI.git
cd ComfyUI

2. Install PyTorch for ROCm

The RX 7800 XT (gfx1101) is an officially ROCm-supported GPU, so it uses the stable ROCm PyTorch wheel. Per the ComfyUI README "AMD GPUs (Linux)" section, the stable install command is:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm7.2

ℹ️ Verify the ROCm tag before you copy it. As of this writing the ComfyUI README pins rocm7.2 as the stable wheel — but the rocmX.Y tag moves over time (6.3 → 6.4 → 7.x). Read the current line in the live ComfyUI README before running. The README also lists a nightly variant (https://download.pytorch.org/whl/nightly/rocm7.2) that may carry performance improvements, and a separate experimental RDNA-3 wheel index (https://rocm.nightlies.amd.com/v2/gfx110X-all/) for RDNA3 — on officially-supported Linux you do not need the experimental index; the stable whl/rocm7.2 wheel above is the canonical path.

3. Install ComfyUI dependencies

Per the ComfyUI README "Dependencies" section:

pip install -r requirements.txt

4. Install the ComfyUI-GGUF custom node

The Q8_0 GGUF lead loads through ComfyUI's GGUF loader, provided by the ComfyUI-GGUF custom node by city96. Note that GGUF runs through ComfyUI's standard PyTorch-on-ROCm path here (no special quant kernels — RDNA3 has no FP8/FP4 silicon). Clone it into ComfyUI/custom_nodes:

git clone https://github.com/city96/ComfyUI-GGUF ComfyUI/custom_nodes/ComfyUI-GGUF
pip install -r ComfyUI/custom_nodes/ComfyUI-GGUF/requirements.txt

5. Download the Chroma1-Base (V48) Q8_0 GGUF

On a 16 GB card, pull the Q8_0 GGUF — the highest-quality quant — from the silveroxides/Chroma1-Base-GGUF repo. The file is 9,735,409,824 bytes ≈ 9.74 GB, verified from the HF Files tree. The .gguf goes in ComfyUI/models/diffusion_models/ (the same folder the Chroma1-Base card specifies for the Chroma checkpoint):

wget -P ComfyUI/models/diffusion_models/ \
  https://huggingface.co/silveroxides/Chroma1-Base-GGUF/resolve/main/Chroma1-Base-Q8_0.gguf

If you need even more headroom (to colocate a second model or run very large batches), the Q6_K quant is 7,650,971,808 bytes ≈ 7.65 GB — also verified from the HF Files tree:

# Optional smaller-footprint fallback
wget -P ComfyUI/models/diffusion_models/ \
  https://huggingface.co/silveroxides/Chroma1-Base-GGUF/resolve/main/Chroma1-Base-Q6_K.gguf

6. Download the T5 XXL text encoder and FLUX VAE

The Chroma1-Base card requires the FLUX-ecosystem T5 XXL encoder and the FLUX VAE. The card links the fp16 T5 variant, and the ComfyUI Chroma example page notes "fp16 is recommended, if you don’t have that much memory fp8" — with the Q8_0 GGUF freeing ~8 GB versus BF16, you have room for the fp16 encoder:

# T5 XXL fp16 text encoder → goes in ComfyUI/models/clip/
wget -P ComfyUI/models/clip/ \
  https://huggingface.co/comfyanonymous/flux_text_encoders/resolve/main/t5xxl_fp16.safetensors

# FLUX VAE (ae.safetensors) → goes in ComfyUI/models/vae/
wget -P ComfyUI/models/vae/ \
  https://huggingface.co/lodestones/Chroma/resolve/main/ae.safetensors

Per the card's ComfyUI instructions: place the T5 model in ComfyUI/models/clip, the FLUX VAE in ComfyUI/models/vae, and the Chroma checkpoint (here the GGUF) in ComfyUI/models/diffusion_models. If the fp16 T5 encoder ever feels tight alongside the GGUF, the t5xxl_fp8_e4m3fn variant from the same flux_text_encoders repo loads in roughly half the memory — that fp8 is for the text encoder weights on disk, unrelated to the GPU-FP8-compute path RDNA3 lacks.

Running

Launch ComfyUI from the repo root with the SDPA attention flag. Per the ComfyUI README "Running" section, the base command is python main.py; on RDNA3 add the explicit cross-attention flag:

python main.py --use-pytorch-cross-attention

ComfyUI's attention backend on this stack is PyTorch's scaled-dot-product attention (SDPA). The --use-pytorch-cross-attention flag — documented in ComfyUI's cli_args.py as "Use the new pytorch 2.0 cross attention function." — forces that path explicitly; it is the most reliable attention path on RDNA3 (the default ROCm attention can crash the VAE Decode step on some setups — see Troubleshooting).

This starts the server (default http://127.0.0.1:8188). Open it in a browser and load the Chroma workflow from the ComfyUI Chroma example page — drag its example image onto the canvas to populate the graph, or build a Unet Loader (GGUF) → DualCLIPLoader (T5 + CLIP) → CLIP Text Encode → KSampler → VAE Decode → Save Image graph manually. Then:

Use the Unet Loader (GGUF) node (from ComfyUI-GGUF) and point it at Chroma1-Base-Q8_0.gguf — this replaces the standard Load Diffusion Model node that a BF16 workflow would use.
Point the T5 loader at t5xxl_fp16.safetensors.
Point the VAE loader at ae.safetensors.

Set the latent to 1024×1024 for the first run. The Chroma1-Base diffusers snippet uses num_inference_steps=40 and guidance_scale=3.0 as a starting point; in ComfyUI set the sampler's steps and CFG similarly (20–40 steps work). Generated PNGs land in ComfyUI/output/ with the full workflow embedded.

Results

Speed: Omitted. The backend has no benchmark for this pair yet — /check/chroma-v48/rx-7800-xt currently returns verdict: unknown with no measurements, and no RX-7800-XT-named Chroma1-Base generation-time figure was found in research that could be verified on its source page. Rather than transfer a number from a different GPU, the Speed figure is left out. If you've measured Chroma1-Base it/s or seconds-per-image on a 7800 XT, please contribute it so it lands on /check/chroma-v48/rx-7800-xt.
VRAM usage: Plan for ≥ 10 GB. The Q8_0 GGUF is ~9.74 GB on disk (HF Files tree); the resident weights plus the T5 XXL fp16 encoder, the FLUX VAE, and 1024×1024 activations fit within the 16 GB RX 7800 XT with headroom. The native BF16 path (~17.8 GB on disk) does not fit 16 GB — it is a 24 GB-card tier path. Once a measured number lands, /check/chroma-v48/rx-7800-xt will replace this envelope.
Quality notes: Chroma1-Base is a FLUX.1-Schnell de-distillation — it restores the multi-step diffusion behavior that Schnell distilled away, so it runs more like a FLUX.1-Dev-class model than a 4-step turbo. Use 20–40 steps; don't expect Schnell-tier 4-step speed. Q8_0 is near-lossless versus BF16; if you drop to Q6_K for more headroom, expect a small, usually-imperceptible quality reduction.

For the full benchmark data, see /check/chroma-v48/rx-7800-xt.

Troubleshooting

"Torch not compiled with CUDA enabled"

This means a CUDA build of PyTorch got installed instead of the ROCm build. Per the ComfyUI README, uninstall and reinstall against the ROCm wheel index:

pip uninstall torch
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm7.2

Confirm the installed build is the ROCm one: python -c "import torch; print(torch.__version__)" should print a +rocm7.2-style suffix, and torch.cuda.is_available() returns True (ROCm masquerades as the cuda device namespace under HIP).

ComfyUI crashes at "VAE Decode" on RDNA3

On RDNA3 the default ComfyUI attention path can crash at the VAE Decode stage. The confirmed fix is to force the PyTorch SDPA path with --use-pytorch-cross-attention (also the best-performing attention path on this card):

python main.py --use-pytorch-cross-attention

If repeated generations still destabilize the run, add --disable-smart-memory (documented in cli_args.py as "Force ComfyUI to agressively offload to regular ram instead of keeping models in vram when it can.") and/or --disable-pinned-memory ("Disable pinned memory use."), which stabilize repeated runs on ROCm:

python main.py --use-pytorch-cross-attention --disable-smart-memory --disable-pinned-memory

Do not reach for --use-split-cross-attention to fix this — it is not the working fix on RDNA3.

gfx1101 kernel-mismatch errors

The RX 7800 XT is gfx1101 (Navi 32), not gfx1100 (the 7900 XTX) and not gfx1102 (the RX 7600). It is officially ROCm-supported, so the stable wheel above works directly. If a library ships only gfx1100 kernels and refuses to launch, the legacy fallback is to masquerade as gfx1100 with HSA_OVERRIDE_GFX_VERSION=11.0.0 (Linux-only) before launching:

HSA_OVERRIDE_GFX_VERSION=11.0.0 python main.py --use-pytorch-cross-attention

This is a legacy workaround only — on current ROCm 7.x with the stable PyTorch wheel, gfx1101 is recognized natively and the override should not be needed.

Don't load an FP8 checkpoint expecting a speed-up

An NVIDIA Ada/Blackwell Chroma recipe leads with an fp8_e4m3fn file because those cards have FP8 tensor cores. RDNA3 does not (its WMMA units do FP16/BF16/INT8/INT4 only), so an FP8 safetensors loaded on the 7800 XT upcasts to BF16 — it saves no memory and gains no speed, and at 16 GB the BF16-equivalent footprint will not fit. Use the Q8_0 GGUF (step 5), which is the actual AMD-native way to shrink the model.

Do not install xformers or FlashAttention

HF and ComfyUI guides written for NVIDIA frequently suggest pip install xformers or a FlashAttention wheel. On RDNA3 these are the wrong path: the ROCm xformers fork is limited (no FP32, head-dim ≤ 256) and upstream FlashAttention's compiled kernels target CDNA/MI accelerators, not consumer gfx1101. ComfyUI already routes attention through PyTorch SDPA on this stack — stick with the default, or force it explicitly with --use-pytorch-cross-attention.

If your specific issue isn't covered above, please report it via the submission form so the next reader benefits.