How much VRAM does Flux.2-Klein-4B need?

About 13 GB — the minimum this recipe targets.

How hard is this setup?

Intermediate — follow the steps above.

Flux.2 Klein 4B on RX 7900 XTX: ComfyUI on ROCm, Full BF16 4-Step Text-to-Image

What You'll Build

A local Flux.2 Klein 4B text-to-image setup running in ComfyUI on a 24 GB Radeon RX 7900 XTX (RDNA3, Navi 31, gfx1100) through the ROCm stack — generating 1024×1024 images from text prompts with the Apache-2.0, 4-billion-parameter, step-distilled member of Black Forest Labs' Flux.2 family. The Klein 4B model card states the model "fits in ~13GB VRAM"; with 24 GB on the 7900 XTX you keep the full BF16 transformer, the Qwen3-4B text encoder, and the Flux.2 VAE all resident on the GPU with ample headroom — no quantization, no CPU offload required.

Hardware data: RX 7900 XTX (24GB VRAM) · full BF16 · ComfyUI on ROCm 7.2 · See benchmark data

⚠️ This is a ROCm recipe, not CUDA. The RX 7900 XTX runs on AMD's ROCm/HIP stack — there is no cu124/cu128 wheel, no xformers install, and no FP8/FP4 path here. RDNA3 has no FP8/FP4 hardware (its WMMA units accept FP16, BF16, INT8, INT4 only), so the FP8 single-file that the NVIDIA recipes use for this model would just upcast to BF16 on this card — no memory saving, no speed-up — and at 24 GB you don't need it anyway. The attention path is PyTorch SDPA (ComfyUI's default; the explicit flag is --use-pytorch-cross-attention), not FlashAttention-2 and not xformers. If a guide tells you to download a *-fp8.safetensors, pip install xformers, or pick a cu12x wheel for this card, it's written for the wrong vendor.

Requirements

Component	Minimum	Tested
GPU	13 GB VRAM (per the BFL card)	RX 7900 XTX (24 GB)
RAM	16 GB system	—
Storage	~16 GB (BF16 transformer + Qwen3-4B encoder + VAE)	per HF tree API
Driver	AMD ROCm 7.2.x on Linux	—
Software	ComfyUI + PyTorch (ROCm 7.2 build), Python 3.10+	—

The model is released under the Apache 2.0 license (per the model card; the 4B Klein weights are open under Apache 2.0, distinct from the non-commercial 9B Klein variant) and the weights are not gated on Hugging Face — no access request or login is required. Klein 4B is a latent diffusion model whose text encoder is Qwen3-4B (not the T5 family used by Flux.1, and not the Mistral3 family used by Flux.2 dev), confirmed by the repo's model_index.json ("text_encoder": ["transformers", "Qwen3ForCausalLM"]). The BF16 file set is ~16 GB on disk per the HF tree API: the consolidated transformer flux-2-klein-4b.safetensors is 7,751,105,712 bytes (~7.75 GB), the Qwen3-4B text encoder is ~8.05 GB across two shards, and the VAE is ~0.17 GB.

Installation

1. Install ComfyUI

Per the ComfyUI README, clone the repo:

git clone https://github.com/comfyanonymous/ComfyUI.git
cd ComfyUI

2. Install PyTorch for ROCm

The RX 7900 XTX (gfx1100) is an officially ROCm-supported GPU, so it uses the stable ROCm PyTorch wheel. Per the ComfyUI README "AMD GPUs (Linux)" section, the stable install command is:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm7.2

ℹ️ Verify the ROCm tag before you copy it. As of this writing the ComfyUI README pins rocm7.2 as the stable wheel — but the rocmX.Y tag moves over time (6.3 → 6.4 → 7.x). Read the current line in the live ComfyUI README before running. A nightly variant (pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/rocm7.2) "might have some performance improvements" per the README. There is also a separate experimental RDNA-3 wheel index (https://rocm.nightlies.amd.com/v2/gfx110X-all/) the README lists for RDNA3 support — on officially-supported Linux you do not need it; the stable whl/rocm7.2 wheel above is the canonical path.

3. Install ComfyUI dependencies

Per the ComfyUI README "Dependencies" section:

pip install -r requirements.txt

4. Download the Flux.2 Klein 4B BF16 files

Place each file in the ComfyUI folder shown. These are the full BF16 weights — the path that fits the 7900 XTX's 24 GB with room to spare (the FP8 single-file the NVIDIA tutorial leads with gives no benefit on RDNA3, which has no FP8 hardware). Sizes are verified from the HF tree API.

# Diffusion model (transformer) — ~7.75 GB BF16
wget -P models/diffusion_models/ \
  https://huggingface.co/black-forest-labs/FLUX.2-klein-4B/resolve/main/flux-2-klein-4b.safetensors

# Qwen3-4B text encoder — ~8.05 GB
wget -P models/text_encoders/ \
  https://huggingface.co/Comfy-Org/vae-text-encorder-for-flux-klein-4b/resolve/main/split_files/text_encoders/qwen_3_4b.safetensors

# Flux.2 family VAE — ~0.34 GB
wget -P models/vae/ \
  https://huggingface.co/Comfy-Org/vae-text-encorder-for-flux-klein-4b/resolve/main/split_files/vae/flux2-vae.safetensors

The VAE (flux2-vae.safetensors) is the Flux.2 family VAE — shared across Klein / Dev / Pro and distinct from the Flux.1 VAE. The text encoder is Qwen3-4B; the Comfy-Org repackaged qwen_3_4b.safetensors above is the full-precision encoder (it is also used by the official ComfyUI workflow). Make sure ComfyUI is recent enough to carry the Klein nodes (git pull && pip install -r requirements.txt if you cloned a while ago).

Running

Launch ComfyUI from the repo root. Per the ComfyUI README "Running" section:

python main.py

This starts the server (default http://127.0.0.1:8188). Open it in a browser and load one of the official Flux.2 Klein workflow JSONs (text-to-image and image-editing, base/distilled, 4B/9B) from the docs.comfy.org Klein tutorial; wire the three files above into the Load Diffusion Model / Load CLIP (Qwen3) / Load VAE nodes. For the distilled 4B variant — which is what flux-2-klein-4b.safetensors is (is_distilled: true in model_index.json) — use 4 steps at CFG 1.0; the base (undistilled) variant uses 25–50 steps at CFG 5.0 instead. Generate at 1024×1024 and queue. Generated PNGs land in ComfyUI/output/ with the full workflow embedded.

ComfyUI's default attention backend on this stack is PyTorch's scaled-dot-product attention (SDPA). If the auto-selected path misbehaves, force the PyTorch-2.0 cross-attention function explicitly — per ComfyUI's cli_args.py, the flag is documented as "Use the new pytorch 2.0 cross attention function.":

python main.py --use-pytorch-cross-attention

At 24 GB you should not need a CPU-offload path: keep the full BF16 transformer, Qwen3-4B encoder, and VAE resident. Do not pass --lowvram on a 7900 XTX — per the README it forces the text encoders onto the CPU, which only slows you down when you have memory to spare.

Results

Speed: No RX-7900-XTX-named generation-time benchmark for Flux.2 Klein 4B was found in research that could be verified on the source page itself, and the backend has no benchmark for this pair yet (/check/flux-2-klein-4b/rx-7900-xtx returns verdict: unknown). The only published GPU-named figures for this model are on NVIDIA RTX 5090 (the docs.comfy.org Klein tutorial lists distilled ~1.2s / base ~17s at FP8) — those do not transfer to the 7900 XTX (different vendor, different architecture, FP8-resident vs BF16). Rather than transfer a number from another GPU or invent one, the Speed figure is omitted. If you've measured Klein 4B generation time on a 7900 XTX, please contribute it so it lands on /check/flux-2-klein-4b/rx-7900-xtx.
VRAM usage: Klein 4B "fits in ~13GB VRAM" per BFL's official model card. On the 24 GB 7900 XTX that ~13 GB envelope leaves roughly 11 GB of headroom even with the full BF16 transformer (~7.75 GB on disk), the Qwen3-4B encoder (~8.05 GB), and the VAE all resident — comfortably within 24 GB, with room for higher batch sizes or image-editing workflows. See /check/flux-2-klein-4b/rx-7900-xtx for any community-submitted measurement.
Quality notes: Klein is the small/distilled member of the Flux.2 family — expect strong prompt adherence at 4 billion parameters with the usual distillation tradeoffs (less stylistic flexibility than the larger Flux.2 base). The 4 distilled steps make iteration fast. There is no quantization tradeoff to weigh on this card: run the native BF16 weights.

For the full benchmark data, see /check/flux-2-klein-4b/rx-7900-xtx.

Troubleshooting

"Torch not compiled with CUDA enabled"

This means a CUDA build of PyTorch got installed instead of the ROCm build. Per the ComfyUI README, uninstall and reinstall against the ROCm wheel index:

pip uninstall torch
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm7.2

Confirm the installed build is the ROCm one: python -c "import torch; print(torch.__version__)" should print a +rocm7.2-style suffix, and torch.cuda.is_available() returns True (ROCm masquerades as the cuda device namespace under HIP).

Black image or driver timeout at VAE decode

On RDNA3 + ROCm the VAE-decode stage is the most commonly-reported failure point for diffusion workflows — users report a black/garbage image or a driver timeout, especially when decoding above 1024 px. This is tracked upstream in ComfyUI Issue #9547 ("VAE decoding on AMD works first time only before switching to tiled or black image") and ROCm Issue #4729 ("VAE decode defaults to FP32 causing driver timeout above 1024 pixel"). It is a VAE-precision interaction, not a VRAM shortage (the reporter on #4729 notes it fails "well under VRAM limit"). Two things to try if you hit it:

Generate at 1024×1024 first to confirm the rest of the pipeline is healthy — #4729 reports 1024 px decode works while 1536 px times out.
Try running the VAE in bf16 via ComfyUI's --bf16-vae flag (per cli_args.py, documented as "Run the VAE in bf16."); some RDNA3 users report it helps, others get cleaner results forcing fp32 — try both and keep whichever produces a correct image on your driver/ROCm version. Track the upstream issues above for a permanent fix.

Do not install xformers or FlashAttention

HF and ComfyUI guides written for NVIDIA frequently suggest pip install xformers or a FlashAttention wheel. On RDNA3 these are the wrong path: the ROCm xformers fork is limited, and ComfyUI already routes attention through PyTorch SDPA on this stack. Stick with the default, or force it explicitly with --use-pytorch-cross-attention.

"Distorted colors / washed-out output"

You're loading the wrong VAE. Klein must use flux2-vae.safetensors (the Flux.2 family VAE per model_index.json) — loading a Flux.1, SDXL, or SD1.5 VAE will produce broken output. Likewise the text encoder must be Qwen3-4B (qwen_3_4b.safetensors), not a Flux.1 T5 file and not a Flux.2-dev Mistral3 encoder.

Note on FLUX.2-dev OOM reports (Issue #11)

flux2 Issue #11 ("3090 24G: cuda out of memory") is sometimes cited for Flux.2 memory trouble, but it reproduces on FLUX.2-dev with the Mistral3 text encoder under a CPU-offload path — a different model and a different encoder than this recipe's Klein 4B / Qwen3-4B, on NVIDIA hardware. The encoder-specific advice in that thread does not apply here. The model-class-independent part (VAE-decode memory pressure) is covered by the ROCm-specific VAE-decode section above, which is the relevant failure mode on this card; on the 7900 XTX you run Klein 4B fully resident at 24 GB, so the offload path that triggers Issue #11 isn't part of this recipe at all.