What You'll Build
A local Flux.2 Klein 4B text-to-image setup running in ComfyUI on a 24 GB Radeon RX 7900 XTX (RDNA3, Navi 31, gfx1100) through the ROCm stack — generating 1024×1024 images from text prompts with the Apache-2.0, 4-billion-parameter, step-distilled member of Black Forest Labs' Flux.2 family. The Klein 4B model card states the model "fits in ~13GB VRAM"; with 24 GB on the 7900 XTX you keep the full BF16 transformer, the Qwen3-4B text encoder, and the Flux.2 VAE all resident on the GPU with ample headroom — no quantization, no CPU offload required.
Hardware data: RX 7900 XTX (24GB VRAM) · full BF16 · ComfyUI on ROCm 7.2 · See benchmark data
⚠️ This is a ROCm recipe, not CUDA. The RX 7900 XTX runs on AMD's ROCm/HIP stack — there is no
cu124/cu128wheel, no xformers install, and no FP8/FP4 path here. RDNA3 has no FP8/FP4 hardware (its WMMA units accept FP16, BF16, INT8, INT4 only), so the FP8 single-file that the NVIDIA recipes use for this model would just upcast to BF16 on this card — no memory saving, no speed-up — and at 24 GB you don't need it anyway. The attention path is PyTorch SDPA (ComfyUI's default; the explicit flag is--use-pytorch-cross-attention), not FlashAttention-2 and not xformers. If a guide tells you to download a*-fp8.safetensors,pip install xformers, or pick acu12xwheel for this card, it's written for the wrong vendor.
Requirements
| Component | Minimum | Tested |
|---|---|---|
| GPU | 13 GB VRAM (per the BFL card) | RX 7900 XTX (24 GB) |
| RAM | 16 GB system | — |
| Storage | ~16 GB (BF16 transformer + Qwen3-4B encoder + VAE) | per HF tree API |
| Driver | AMD ROCm 7.2.x on Linux | — |
| Software | ComfyUI + PyTorch (ROCm 7.2 build), Python 3.10+ | — |
The model is released under the Apache 2.0 license (per the model card; the 4B Klein weights are open under Apache 2.0, distinct from the non-commercial 9B Klein variant) and the weights are not gated on Hugging Face — no access request or login is required. Klein 4B is a latent diffusion model whose text encoder is Qwen3-4B (not the T5 family used by Flux.1, and not the Mistral3 family used by Flux.2 dev), confirmed by the repo's model_index.json ("text_encoder": ["transformers", "Qwen3ForCausalLM"]). The BF16 file set is ~16 GB on disk per the HF tree API: the consolidated transformer flux-2-klein-4b.safetensors is 7,751,105,712 bytes (~7.75 GB), the Qwen3-4B text encoder is ~8.05 GB across two shards, and the VAE is ~0.17 GB.
Installation
1. Install ComfyUI
Per the ComfyUI README, clone the repo:
git clone https://github.com/comfyanonymous/ComfyUI.git
cd ComfyUI
2. Install PyTorch for ROCm
The RX 7900 XTX (gfx1100) is an officially ROCm-supported GPU, so it uses the stable ROCm PyTorch wheel. Per the ComfyUI README "AMD GPUs (Linux)" section, the stable install command is:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm7.2
ℹ️ Verify the ROCm tag before you copy it. As of this writing the ComfyUI README pins
rocm7.2as the stable wheel — but therocmX.Ytag moves over time (6.3 → 6.4 → 7.x). Read the current line in the live ComfyUI README before running. A nightly variant (pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/rocm7.2) "might have some performance improvements" per the README. There is also a separate experimental RDNA-3 wheel index (https://rocm.nightlies.amd.com/v2/gfx110X-all/) the README lists for RDNA3 support — on officially-supported Linux you do not need it; the stablewhl/rocm7.2wheel above is the canonical path.
3. Install ComfyUI dependencies
Per the ComfyUI README "Dependencies" section:
pip install -r requirements.txt
4. Download the Flux.2 Klein 4B BF16 files
Place each file in the ComfyUI folder shown. These are the full BF16 weights — the path that fits the 7900 XTX's 24 GB with room to spare (the FP8 single-file the NVIDIA tutorial leads with gives no benefit on RDNA3, which has no FP8 hardware). Sizes are verified from the HF tree API.
# Diffusion model (transformer) — ~7.75 GB BF16
wget -P models/diffusion_models/ \
https://huggingface.co/black-forest-labs/FLUX.2-klein-4B/resolve/main/flux-2-klein-4b.safetensors
# Qwen3-4B text encoder — ~8.05 GB
wget -P models/text_encoders/ \
https://huggingface.co/Comfy-Org/vae-text-encorder-for-flux-klein-4b/resolve/main/split_files/text_encoders/qwen_3_4b.safetensors
# Flux.2 family VAE — ~0.34 GB
wget -P models/vae/ \
https://huggingface.co/Comfy-Org/vae-text-encorder-for-flux-klein-4b/resolve/main/split_files/vae/flux2-vae.safetensors
The VAE (flux2-vae.safetensors) is the Flux.2 family VAE — shared across Klein / Dev / Pro and distinct from the Flux.1 VAE. The text encoder is Qwen3-4B; the Comfy-Org repackaged qwen_3_4b.safetensors above is the full-precision encoder (it is also used by the official ComfyUI workflow). Make sure ComfyUI is recent enough to carry the Klein nodes (git pull && pip install -r requirements.txt if you cloned a while ago).
Running
Launch ComfyUI from the repo root. Per the ComfyUI README "Running" section:
python main.py
This starts the server (default http://127.0.0.1:8188). Open it in a browser and load one of the official Flux.2 Klein workflow JSONs (text-to-image and image-editing, base/distilled, 4B/9B) from the docs.comfy.org Klein tutorial; wire the three files above into the Load Diffusion Model / Load CLIP (Qwen3) / Load VAE nodes. For the distilled 4B variant — which is what flux-2-klein-4b.safetensors is (is_distilled: true in model_index.json) — use 4 steps at CFG 1.0; the base (undistilled) variant uses 25–50 steps at CFG 5.0 instead. Generate at 1024×1024 and queue. Generated PNGs land in ComfyUI/output/ with the full workflow embedded.
ComfyUI's default attention backend on this stack is PyTorch's scaled-dot-product attention (SDPA). If the auto-selected path misbehaves, force the PyTorch-2.0 cross-attention function explicitly — per ComfyUI's cli_args.py, the flag is documented as "Use the new pytorch 2.0 cross attention function.":
python main.py --use-pytorch-cross-attention
At 24 GB you should not need a CPU-offload path: keep the full BF16 transformer, Qwen3-4B encoder, and VAE resident. Do not pass --lowvram on a 7900 XTX — per the README it forces the text encoders onto the CPU, which only slows you down when you have memory to spare.
Results
- Speed: No RX-7900-XTX-named generation-time benchmark for Flux.2 Klein 4B was found in research that could be verified on the source page itself, and the backend has no benchmark for this pair yet (/check/flux-2-klein-4b/rx-7900-xtx returns
verdict: unknown). The only published GPU-named figures for this model are on NVIDIA RTX 5090 (the docs.comfy.org Klein tutorial lists distilled ~1.2s / base ~17s at FP8) — those do not transfer to the 7900 XTX (different vendor, different architecture, FP8-resident vs BF16). Rather than transfer a number from another GPU or invent one, the Speed figure is omitted. If you've measured Klein 4B generation time on a 7900 XTX, please contribute it so it lands on /check/flux-2-klein-4b/rx-7900-xtx. - VRAM usage: Klein 4B "fits in ~13GB VRAM" per BFL's official model card. On the 24 GB 7900 XTX that ~13 GB envelope leaves roughly 11 GB of headroom even with the full BF16 transformer (~7.75 GB on disk), the Qwen3-4B encoder (~8.05 GB), and the VAE all resident — comfortably within 24 GB, with room for higher batch sizes or image-editing workflows. See /check/flux-2-klein-4b/rx-7900-xtx for any community-submitted measurement.
- Quality notes: Klein is the small/distilled member of the Flux.2 family — expect strong prompt adherence at 4 billion parameters with the usual distillation tradeoffs (less stylistic flexibility than the larger Flux.2 base). The 4 distilled steps make iteration fast. There is no quantization tradeoff to weigh on this card: run the native BF16 weights.
For the full benchmark data, see /check/flux-2-klein-4b/rx-7900-xtx.
Troubleshooting
"Torch not compiled with CUDA enabled"
This means a CUDA build of PyTorch got installed instead of the ROCm build. Per the ComfyUI README, uninstall and reinstall against the ROCm wheel index:
pip uninstall torch
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm7.2
Confirm the installed build is the ROCm one: python -c "import torch; print(torch.__version__)" should print a +rocm7.2-style suffix, and torch.cuda.is_available() returns True (ROCm masquerades as the cuda device namespace under HIP).
Black image or driver timeout at VAE decode
On RDNA3 + ROCm the VAE-decode stage is the most commonly-reported failure point for diffusion workflows — users report a black/garbage image or a driver timeout, especially when decoding above 1024 px. This is tracked upstream in ComfyUI Issue #9547 ("VAE decoding on AMD works first time only before switching to tiled or black image") and ROCm Issue #4729 ("VAE decode defaults to FP32 causing driver timeout above 1024 pixel"). It is a VAE-precision interaction, not a VRAM shortage (the reporter on #4729 notes it fails "well under VRAM limit"). Two things to try if you hit it:
- Generate at 1024×1024 first to confirm the rest of the pipeline is healthy — #4729 reports 1024 px decode works while 1536 px times out.
- Try running the VAE in bf16 via ComfyUI's
--bf16-vaeflag (percli_args.py, documented as "Run the VAE in bf16."); some RDNA3 users report it helps, others get cleaner results forcing fp32 — try both and keep whichever produces a correct image on your driver/ROCm version. Track the upstream issues above for a permanent fix.
Do not install xformers or FlashAttention
HF and ComfyUI guides written for NVIDIA frequently suggest pip install xformers or a FlashAttention wheel. On RDNA3 these are the wrong path: the ROCm xformers fork is limited, and ComfyUI already routes attention through PyTorch SDPA on this stack. Stick with the default, or force it explicitly with --use-pytorch-cross-attention.
"Distorted colors / washed-out output"
You're loading the wrong VAE. Klein must use flux2-vae.safetensors (the Flux.2 family VAE per model_index.json) — loading a Flux.1, SDXL, or SD1.5 VAE will produce broken output. Likewise the text encoder must be Qwen3-4B (qwen_3_4b.safetensors), not a Flux.1 T5 file and not a Flux.2-dev Mistral3 encoder.
Note on FLUX.2-dev OOM reports (Issue #11)
flux2 Issue #11 ("3090 24G: cuda out of memory") is sometimes cited for Flux.2 memory trouble, but it reproduces on FLUX.2-dev with the Mistral3 text encoder under a CPU-offload path — a different model and a different encoder than this recipe's Klein 4B / Qwen3-4B, on NVIDIA hardware. The encoder-specific advice in that thread does not apply here. The model-class-independent part (VAE-decode memory pressure) is covered by the ROCm-specific VAE-decode section above, which is the relevant failure mode on this card; on the 7900 XTX you run Klein 4B fully resident at 24 GB, so the offload path that triggers Issue #11 isn't part of this recipe at all.