self-hosted/ai
§01·recipe · image

Flux.2 Klein 4B on RX 7900 XTX: ComfyUI on ROCm, Full BF16 4-Step Text-to-Image

imageintermediate13GB+ VRAMJun 17, 2026

This intermediate recipe sets up Flux.2-Klein-4B on the RX 7900 XTX, needing about 13 GB of VRAM.

models
tools
prerequisites
  • AMD Radeon RX 7900 XTX (24 GB VRAM, RDNA3 / Navi 31 / gfx1100) or equivalent ROCm-supported card
  • Linux (Ubuntu 24.04 / 22.04 or RHEL) with the AMD ROCm stack installed (ROCm 7.2.x)
  • Python 3.10+
  • ~16 GB free disk for the BF16 weights (7.75 GB transformer + 8.05 GB Qwen3-4B text encoder + 0.34 GB VAE)
  • ComfyUI installed (git clone) with PyTorch built for ROCm

What You'll Build

A local Flux.2 Klein 4B text-to-image setup running in ComfyUI on a 24 GB Radeon RX 7900 XTX (RDNA3, Navi 31, gfx1100) through the ROCm stack — generating 1024×1024 images from text prompts with the Apache-2.0, 4-billion-parameter, step-distilled member of Black Forest Labs' Flux.2 family. The Klein 4B model card states the model "fits in ~13GB VRAM"; with 24 GB on the 7900 XTX you keep the full BF16 transformer, the Qwen3-4B text encoder, and the Flux.2 VAE all resident on the GPU with ample headroom — no quantization, no CPU offload required.

Hardware data: RX 7900 XTX (24GB VRAM) · full BF16 · ComfyUI on ROCm 7.2 · See benchmark data

⚠️ This is a ROCm recipe, not CUDA. The RX 7900 XTX runs on AMD's ROCm/HIP stack — there is no cu124/cu128 wheel, no xformers install, and no FP8/FP4 path here. RDNA3 has no FP8/FP4 hardware (its WMMA units accept FP16, BF16, INT8, INT4 only), so the FP8 single-file that the NVIDIA recipes use for this model would just upcast to BF16 on this card — no memory saving, no speed-up — and at 24 GB you don't need it anyway. The attention path is PyTorch SDPA (ComfyUI's default; the explicit flag is --use-pytorch-cross-attention), not FlashAttention-2 and not xformers. If a guide tells you to download a *-fp8.safetensors, pip install xformers, or pick a cu12x wheel for this card, it's written for the wrong vendor.

Requirements

ComponentMinimumTested
GPU13 GB VRAM (per the BFL card)RX 7900 XTX (24 GB)
RAM16 GB system
Storage~16 GB (BF16 transformer + Qwen3-4B encoder + VAE)per HF tree API
DriverAMD ROCm 7.2.x on Linux
SoftwareComfyUI + PyTorch (ROCm 7.2 build), Python 3.10+

The model is released under the Apache 2.0 license (per the model card; the 4B Klein weights are open under Apache 2.0, distinct from the non-commercial 9B Klein variant) and the weights are not gated on Hugging Face — no access request or login is required. Klein 4B is a latent diffusion model whose text encoder is Qwen3-4B (not the T5 family used by Flux.1, and not the Mistral3 family used by Flux.2 dev), confirmed by the repo's model_index.json ("text_encoder": ["transformers", "Qwen3ForCausalLM"]). The BF16 file set is ~16 GB on disk per the HF tree API: the consolidated transformer flux-2-klein-4b.safetensors is 7,751,105,712 bytes (~7.75 GB), the Qwen3-4B text encoder is ~8.05 GB across two shards, and the VAE is ~0.17 GB.

Installation

1. Install ComfyUI

Per the ComfyUI README, clone the repo:

git clone https://github.com/comfyanonymous/ComfyUI.git
cd ComfyUI

2. Install PyTorch for ROCm

The RX 7900 XTX (gfx1100) is an officially ROCm-supported GPU, so it uses the stable ROCm PyTorch wheel. Per the ComfyUI README "AMD GPUs (Linux)" section, the stable install command is:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm7.2

ℹ️ Verify the ROCm tag before you copy it. As of this writing the ComfyUI README pins rocm7.2 as the stable wheel — but the rocmX.Y tag moves over time (6.3 → 6.4 → 7.x). Read the current line in the live ComfyUI README before running. A nightly variant (pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/rocm7.2) "might have some performance improvements" per the README. There is also a separate experimental RDNA-3 wheel index (https://rocm.nightlies.amd.com/v2/gfx110X-all/) the README lists for RDNA3 support — on officially-supported Linux you do not need it; the stable whl/rocm7.2 wheel above is the canonical path.

3. Install ComfyUI dependencies

Per the ComfyUI README "Dependencies" section:

pip install -r requirements.txt

4. Download the Flux.2 Klein 4B BF16 files

Place each file in the ComfyUI folder shown. These are the full BF16 weights — the path that fits the 7900 XTX's 24 GB with room to spare (the FP8 single-file the NVIDIA tutorial leads with gives no benefit on RDNA3, which has no FP8 hardware). Sizes are verified from the HF tree API.

# Diffusion model (transformer) — ~7.75 GB BF16
wget -P models/diffusion_models/ \
  https://huggingface.co/black-forest-labs/FLUX.2-klein-4B/resolve/main/flux-2-klein-4b.safetensors

# Qwen3-4B text encoder — ~8.05 GB
wget -P models/text_encoders/ \
  https://huggingface.co/Comfy-Org/vae-text-encorder-for-flux-klein-4b/resolve/main/split_files/text_encoders/qwen_3_4b.safetensors

# Flux.2 family VAE — ~0.34 GB
wget -P models/vae/ \
  https://huggingface.co/Comfy-Org/vae-text-encorder-for-flux-klein-4b/resolve/main/split_files/vae/flux2-vae.safetensors

The VAE (flux2-vae.safetensors) is the Flux.2 family VAE — shared across Klein / Dev / Pro and distinct from the Flux.1 VAE. The text encoder is Qwen3-4B; the Comfy-Org repackaged qwen_3_4b.safetensors above is the full-precision encoder (it is also used by the official ComfyUI workflow). Make sure ComfyUI is recent enough to carry the Klein nodes (git pull && pip install -r requirements.txt if you cloned a while ago).

Running

Launch ComfyUI from the repo root. Per the ComfyUI README "Running" section:

python main.py

This starts the server (default http://127.0.0.1:8188). Open it in a browser and load one of the official Flux.2 Klein workflow JSONs (text-to-image and image-editing, base/distilled, 4B/9B) from the docs.comfy.org Klein tutorial; wire the three files above into the Load Diffusion Model / Load CLIP (Qwen3) / Load VAE nodes. For the distilled 4B variant — which is what flux-2-klein-4b.safetensors is (is_distilled: true in model_index.json) — use 4 steps at CFG 1.0; the base (undistilled) variant uses 25–50 steps at CFG 5.0 instead. Generate at 1024×1024 and queue. Generated PNGs land in ComfyUI/output/ with the full workflow embedded.

ComfyUI's default attention backend on this stack is PyTorch's scaled-dot-product attention (SDPA). If the auto-selected path misbehaves, force the PyTorch-2.0 cross-attention function explicitly — per ComfyUI's cli_args.py, the flag is documented as "Use the new pytorch 2.0 cross attention function.":

python main.py --use-pytorch-cross-attention

At 24 GB you should not need a CPU-offload path: keep the full BF16 transformer, Qwen3-4B encoder, and VAE resident. Do not pass --lowvram on a 7900 XTX — per the README it forces the text encoders onto the CPU, which only slows you down when you have memory to spare.

Results

  • Speed: No RX-7900-XTX-named generation-time benchmark for Flux.2 Klein 4B was found in research that could be verified on the source page itself, and the backend has no benchmark for this pair yet (/check/flux-2-klein-4b/rx-7900-xtx returns verdict: unknown). The only published GPU-named figures for this model are on NVIDIA RTX 5090 (the docs.comfy.org Klein tutorial lists distilled ~1.2s / base ~17s at FP8) — those do not transfer to the 7900 XTX (different vendor, different architecture, FP8-resident vs BF16). Rather than transfer a number from another GPU or invent one, the Speed figure is omitted. If you've measured Klein 4B generation time on a 7900 XTX, please contribute it so it lands on /check/flux-2-klein-4b/rx-7900-xtx.
  • VRAM usage: Klein 4B "fits in ~13GB VRAM" per BFL's official model card. On the 24 GB 7900 XTX that ~13 GB envelope leaves roughly 11 GB of headroom even with the full BF16 transformer (~7.75 GB on disk), the Qwen3-4B encoder (~8.05 GB), and the VAE all resident — comfortably within 24 GB, with room for higher batch sizes or image-editing workflows. See /check/flux-2-klein-4b/rx-7900-xtx for any community-submitted measurement.
  • Quality notes: Klein is the small/distilled member of the Flux.2 family — expect strong prompt adherence at 4 billion parameters with the usual distillation tradeoffs (less stylistic flexibility than the larger Flux.2 base). The 4 distilled steps make iteration fast. There is no quantization tradeoff to weigh on this card: run the native BF16 weights.

For the full benchmark data, see /check/flux-2-klein-4b/rx-7900-xtx.

Troubleshooting

"Torch not compiled with CUDA enabled"

This means a CUDA build of PyTorch got installed instead of the ROCm build. Per the ComfyUI README, uninstall and reinstall against the ROCm wheel index:

pip uninstall torch
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm7.2

Confirm the installed build is the ROCm one: python -c "import torch; print(torch.__version__)" should print a +rocm7.2-style suffix, and torch.cuda.is_available() returns True (ROCm masquerades as the cuda device namespace under HIP).

Black image or driver timeout at VAE decode

On RDNA3 + ROCm the VAE-decode stage is the most commonly-reported failure point for diffusion workflows — users report a black/garbage image or a driver timeout, especially when decoding above 1024 px. This is tracked upstream in ComfyUI Issue #9547 ("VAE decoding on AMD works first time only before switching to tiled or black image") and ROCm Issue #4729 ("VAE decode defaults to FP32 causing driver timeout above 1024 pixel"). It is a VAE-precision interaction, not a VRAM shortage (the reporter on #4729 notes it fails "well under VRAM limit"). Two things to try if you hit it:

  • Generate at 1024×1024 first to confirm the rest of the pipeline is healthy — #4729 reports 1024 px decode works while 1536 px times out.
  • Try running the VAE in bf16 via ComfyUI's --bf16-vae flag (per cli_args.py, documented as "Run the VAE in bf16."); some RDNA3 users report it helps, others get cleaner results forcing fp32 — try both and keep whichever produces a correct image on your driver/ROCm version. Track the upstream issues above for a permanent fix.

Do not install xformers or FlashAttention

HF and ComfyUI guides written for NVIDIA frequently suggest pip install xformers or a FlashAttention wheel. On RDNA3 these are the wrong path: the ROCm xformers fork is limited, and ComfyUI already routes attention through PyTorch SDPA on this stack. Stick with the default, or force it explicitly with --use-pytorch-cross-attention.

"Distorted colors / washed-out output"

You're loading the wrong VAE. Klein must use flux2-vae.safetensors (the Flux.2 family VAE per model_index.json) — loading a Flux.1, SDXL, or SD1.5 VAE will produce broken output. Likewise the text encoder must be Qwen3-4B (qwen_3_4b.safetensors), not a Flux.1 T5 file and not a Flux.2-dev Mistral3 encoder.

Note on FLUX.2-dev OOM reports (Issue #11)

flux2 Issue #11 ("3090 24G: cuda out of memory") is sometimes cited for Flux.2 memory trouble, but it reproduces on FLUX.2-dev with the Mistral3 text encoder under a CPU-offload path — a different model and a different encoder than this recipe's Klein 4B / Qwen3-4B, on NVIDIA hardware. The encoder-specific advice in that thread does not apply here. The model-class-independent part (VAE-decode memory pressure) is covered by the ROCm-specific VAE-decode section above, which is the relevant failure mode on this card; on the 7900 XTX you run Klein 4B fully resident at 24 GB, so the offload path that triggers Issue #11 isn't part of this recipe at all.

common questions
How much VRAM does Flux.2-Klein-4B need?

About 13 GB — the minimum this recipe targets.

Which GPUs is Flux.2-Klein-4B tested on?

RX 7900 XTX (24 GB).

How hard is this setup?

Intermediate — follow the steps above.