How much VRAM does Flux.2-Klein-4B need?

About 13 GB — the minimum this recipe targets.

How hard is this setup?

Intermediate — follow the steps above.

Flux.2 Klein 4B on RX 7800 XT: ComfyUI on ROCm, BF16 4-Step Text-to-Image (16 GB)

What You'll Build

A local Flux.2 Klein 4B text-to-image setup running in ComfyUI on a 16 GB Radeon RX 7800 XT (RDNA3, Navi 32, gfx1101) through the ROCm stack — generating 1024×1024 images from text prompts with the Apache-2.0, 4-billion-parameter, step-distilled member of Black Forest Labs' Flux.2 family. The Klein 4B model card states the model "fits in ~13GB VRAM" and is accessible on cards like the RTX 3090/4070 "and above" — so the full BF16 transformer, the Qwen3-4B text encoder, and the Flux.2 VAE fit on the 7800 XT's 16 GB, but the margin is tight: ~13 GB resident leaves only a few GB of headroom for activations and VAE decode. This recipe leads with the BF16 path and gives you a GGUF fallback to drop to if a workflow pushes peak VRAM past 16 GB.

Hardware data: RX 7800 XT (16GB VRAM) · BF16, fits-with-care · ComfyUI on ROCm 7.2 · See benchmark data

⚠️ This is a ROCm recipe, not CUDA. The RX 7800 XT runs on AMD's ROCm/HIP stack — there is no cu124/cu128 wheel, no xformers install, and no FP8/FP4/NVFP4 path here. RDNA3 has no FP8 or FP4 hardware (its WMMA units accept FP16, BF16, INT8, INT4 only), so an FP8 or NVFP4 single-file that the NVIDIA recipes use for this model would just upcast to BF16 on this card — no memory saving, no speed-up — which is exactly why those squeeze-paths don't help you at 16 GB. To go below the BF16 envelope on this card you use GGUF (via llama.cpp-style quantization), not FP8. The attention path is PyTorch SDPA (ComfyUI's default; the explicit flag is --use-pytorch-cross-attention), not FlashAttention-2 and not xformers. If a guide tells you to download a *-fp8.safetensors, *-nvfp4.safetensors, pip install xformers, or pick a cu12x wheel for this card, it's written for the wrong vendor.

Requirements

Component	Minimum	Tested
GPU	13 GB VRAM (per the BFL card)	RX 7800 XT (16 GB)
RAM	16 GB system	—
Storage	~16 GB BF16 (or ~4.3 GB transformer at GGUF Q8_0)	per HF tree API
Driver	AMD ROCm 7.2.x on Linux	—
Software	ComfyUI + PyTorch (ROCm 7.2 build), Python 3.10+	—

The model is released under the Apache 2.0 license (per the model card; the 4B Klein weights are open under Apache 2.0, distinct from the non-commercial 9B Klein variant) and the weights are not gated on Hugging Face — no access request or login is required. Klein 4B is a latent diffusion model whose text encoder is Qwen3-4B (not the T5 family used by Flux.1, and not the Mistral3 family used by Flux.2 dev), confirmed by the repo's model_index.json ("text_encoder": ["transformers", "Qwen3ForCausalLM"]). The BF16 file set is ~16 GB on disk per the HF tree API: the consolidated transformer flux-2-klein-4b.safetensors is 7,751,105,712 bytes (~7.75 GB), the Qwen3-4B text encoder is ~8.05 GB (8,044,982,048 bytes), and the VAE is ~0.34 GB (336,211,292 bytes).

Installation

1. Install ComfyUI

Per the ComfyUI README, clone the repo:

git clone https://github.com/comfyanonymous/ComfyUI.git
cd ComfyUI

2. Install PyTorch for ROCm

The RX 7800 XT (gfx1101) is an officially ROCm-supported GPU on Linux, so it uses the stable ROCm PyTorch wheel. Per the ComfyUI README "AMD GPUs (Linux)" section, the stable install command is:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm7.2

ℹ️ Verify the ROCm tag before you copy it. As of this writing the ComfyUI README pins rocm7.2 as the stable wheel — but the rocmX.Y tag moves over time (6.3 → 6.4 → 7.x). Read the current line in the live ComfyUI README before running. A nightly variant (pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/rocm7.2) "might have some performance improvements" per the README. There is also a separate experimental RDNA-3 wheel index (https://rocm.nightlies.amd.com/v2/gfx110X-all/) the README lists for RDNA3 support — on officially-supported Linux you do not need it; the stable whl/rocm7.2 wheel above is the canonical path. The RX 7800 XT is gfx1101; if a library ever ships only gfx1100 kernels, the legacy fallback is to export HSA_OVERRIDE_GFX_VERSION=11.0.0 to masquerade as gfx1100 — but on a current ROCm 7.2 stack this is not required and should not be set by default.

3. Install ComfyUI dependencies

Per the ComfyUI README "Dependencies" section:

pip install -r requirements.txt

4. Download the Flux.2 Klein 4B BF16 files

Place each file in the ComfyUI folder shown. These are the full BF16 weights — the path that fits the 7800 XT's 16 GB with care (the FP8/NVFP4 single-files the NVIDIA tutorials lead with give no benefit on RDNA3, which has no FP8/FP4 hardware). Sizes are verified from the HF tree API.

# Diffusion model (transformer) — ~7.75 GB BF16
wget -P models/diffusion_models/ \
  https://huggingface.co/black-forest-labs/FLUX.2-klein-4B/resolve/main/flux-2-klein-4b.safetensors

# Qwen3-4B text encoder — ~8.05 GB
wget -P models/text_encoders/ \
  https://huggingface.co/Comfy-Org/vae-text-encorder-for-flux-klein-4b/resolve/main/split_files/text_encoders/qwen_3_4b.safetensors

# Flux.2 family VAE — ~0.34 GB
wget -P models/vae/ \
  https://huggingface.co/Comfy-Org/vae-text-encorder-for-flux-klein-4b/resolve/main/split_files/vae/flux2-vae.safetensors

The VAE (flux2-vae.safetensors) is the Flux.2 family VAE — shared across Klein / Dev / Pro and distinct from the Flux.1 VAE. The text encoder is Qwen3-4B; the Comfy-Org repackaged qwen_3_4b.safetensors above is the full-precision encoder (it is also used by the official ComfyUI workflow). Make sure ComfyUI is recent enough to carry the Klein nodes (git pull && pip install -r requirements.txt if you cloned a while ago).

5. (Optional) GGUF fallback for tight 16 GB workflows

If a workflow pushes peak VRAM past 16 GB on this card — most likely at the VAE-decode stage above 1024 px — drop the transformer to a GGUF quant instead of BF16. Unsloth publishes a per-tier GGUF set, unsloth/FLUX.2-klein-4B-GGUF, explicitly built from the canonical base_model: black-forest-labs/FLUX.2-klein-4B and loaded via city96's ComfyUI-GGUF node. Per the GGUF repo's file table, the transformer drops from 7.75 GB at BF16 to ~4.3 GB at Q8_0 (4,300,644,928 bytes) or ~3.41 GB at Q6_K (3,409,273,408 bytes) — buying several GB of headroom for the resident Qwen3-4B encoder, activations, and VAE decode. Install the loader node and swap the BF16 Load Diffusion Model node for the GGUF loader:

cd custom_nodes
git clone https://github.com/city96/ComfyUI-GGUF.git
cd ComfyUI-GGUF && pip install -r requirements.txt
# then place flux-2-klein-4b-Q8_0.gguf in models/diffusion_models/ and load it with the
# "Unet Loader (GGUF)" node instead of "Load Diffusion Model"

The text encoder and VAE stay the same files as the BF16 path; only the diffusion transformer is quantized.

Running

Launch ComfyUI from the repo root. Per the ComfyUI README "Running" section:

python main.py

This starts the server (default http://127.0.0.1:8188). Open it in a browser and load one of the official Flux.2 Klein workflow JSONs (text-to-image and image-editing, base/distilled, 4B/9B) from the docs.comfy.org Klein tutorial; wire the three files above into the Load Diffusion Model / Load CLIP (Qwen3) / Load VAE nodes. For the distilled 4B variant — which is what flux-2-klein-4b.safetensors is (is_distilled: true in model_index.json) — use 4 steps at CFG 1.0; the base (undistilled) variant uses 25–50 steps at CFG 5.0 instead. Generate at 1024×1024 and queue. Generated PNGs land in ComfyUI/output/ with the full workflow embedded.

ComfyUI's default attention backend on this stack is PyTorch's scaled-dot-product attention (SDPA). If the auto-selected path misbehaves, force the PyTorch-2.0 cross-attention function explicitly — per ComfyUI's cli_args.py, the flag is documented as "Use the new pytorch 2.0 cross attention function.":

python main.py --use-pytorch-cross-attention

At 16 GB the BF16 path fits but the margin is small. If you see an out-of-memory error rather than a successful decode, the cleanest fixes in order are: (a) keep generations at 1024×1024 (avoid 1536 px+ on this card), (b) let ComfyUI offload the text encoder when it isn't in use (its default low-memory handling), or (c) switch the transformer to the GGUF Q8_0/Q6_K fallback from Installation step 5, which frees ~3–4 GB.

Results

Speed: No RX-7800-XT-named generation-time benchmark for Flux.2 Klein 4B was found in research, and the backend has no benchmark for this pair yet (/check/flux-2-klein-4b/rx-7800-xt returns verdict: unknown). The only published GPU-named figures for this model are on NVIDIA hardware (the docs.comfy.org Klein tutorial lists distilled/base timings at FP8) — those do not transfer to the 7800 XT (different vendor, different architecture, FP8-resident vs BF16). Rather than transfer a number from another GPU or invent one, the Speed figure is omitted. If you've measured Klein 4B generation time on a 7800 XT, please contribute it so it lands on /check/flux-2-klein-4b/rx-7800-xt.
VRAM usage: Klein 4B "fits in ~13GB VRAM" per BFL's official model card, which names the RTX 3090/4070 "and above" as accessible cards. On the 16 GB 7800 XT that ~13 GB envelope fits — with the full BF16 transformer (~7.75 GB on disk), the Qwen3-4B encoder (~8.05 GB), and the VAE all involved — but leaves only a few GB of headroom, so treat 16 GB as fits-with-care rather than comfortable. If a workflow peaks over 16 GB, the GGUF Q8_0/Q6_K fallback (Installation step 5) drops the transformer to ~4.3/~3.41 GB. See /check/flux-2-klein-4b/rx-7800-xt for any community-submitted measurement.
Quality notes: Klein is the small/distilled member of the Flux.2 family — expect strong prompt adherence at 4 billion parameters with the usual distillation tradeoffs (less stylistic flexibility than the larger Flux.2 base). The 4 distilled steps make iteration fast. If you move to a GGUF quant for headroom, lower tiers (Q4 and below) trade some fidelity for memory; Q8_0/Q6_K stay close to BF16.

For the full benchmark data, see /check/flux-2-klein-4b/rx-7800-xt.

Troubleshooting

"Torch not compiled with CUDA enabled"

This means a CUDA build of PyTorch got installed instead of the ROCm build. Per the ComfyUI README, uninstall and reinstall against the ROCm wheel index:

pip uninstall torch
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm7.2

Confirm the installed build is the ROCm one: python -c "import torch; print(torch.__version__)" should print a +rocm7.2-style suffix, and torch.cuda.is_available() returns True (ROCm masquerades as the cuda device namespace under HIP).

Out of memory at 16 GB

Unlike the 24 GB Radeon cards, the 7800 XT has little spare margin once the BF16 transformer + Qwen3-4B encoder are resident. If you hit an OOM: generate at 1024×1024 (not 1536 px+), confirm ComfyUI is offloading the text encoder when idle, and — the most reliable fix — switch the diffusion transformer to the GGUF Q8_0 or Q6_K quant from Installation step 5, which frees ~3–4 GB versus BF16 while keeping quality close. Do not reach for an FP8 or NVFP4 single-file to save memory here: RDNA3 has no FP8/FP4 hardware, so those upcast to BF16 on load and save nothing.

Black image or driver timeout at VAE decode

On RDNA3 + ROCm the VAE-decode stage is the most commonly-reported failure point for diffusion workflows — users report a black/garbage image or a driver timeout, especially when decoding above 1024 px. This is tracked upstream in ComfyUI Issue #9547 ("VAE decoding on AMD works first time only before switching to tiled or black image") and ROCm Issue #4729 ("VAE decode defaults to FP32 causing driver timeout above 1024 pixel"). It is a VAE-precision interaction, not only a VRAM shortage (the reporter on #4729 notes it fails "well under VRAM limit"). Two things to try if you hit it:

Generate at 1024×1024 first to confirm the rest of the pipeline is healthy — #4729 reports 1024 px decode works while 1536 px times out (and on this 16 GB card, 1536 px also risks a true OOM, so 1024 px is doubly preferred).
Try running the VAE in bf16 via ComfyUI's --bf16-vae flag (per cli_args.py, documented as "Run the VAE in bf16."); some RDNA3 users report it helps, others get cleaner results forcing fp32 — try both and keep whichever produces a correct image on your driver/ROCm version. Track the upstream issues above for a permanent fix.

Do not install xformers or FlashAttention

HF and ComfyUI guides written for NVIDIA frequently suggest pip install xformers or a FlashAttention wheel. On RDNA3 these are the wrong path: the ROCm xformers fork is limited, and ComfyUI already routes attention through PyTorch SDPA on this stack. Stick with the default, or force it explicitly with --use-pytorch-cross-attention.

"Distorted colors / washed-out output"

You're loading the wrong VAE. Klein must use flux2-vae.safetensors (the Flux.2 family VAE per model_index.json) — loading a Flux.1, SDXL, or SD1.5 VAE will produce broken output. Likewise the text encoder must be Qwen3-4B (qwen_3_4b.safetensors), not a Flux.1 T5 file and not a Flux.2-dev Mistral3 encoder.

Note on FLUX.2-dev OOM reports (Issue #11)

flux2 Issue #11 ("3090 24G: cuda out of memory") is sometimes cited for Flux.2 memory trouble, but it reproduces on FLUX.2-dev with the Mistral3 text encoder under a CPU-offload path — a different model and a different encoder than this recipe's Klein 4B / Qwen3-4B, and on NVIDIA hardware. The encoder-specific advice in that thread does not apply here, and whether the Mistral3-encoder workaround would even behave the same under ROCm is unverified — don't assume it transfers. The model-class-independent part (VAE-decode memory pressure) is covered by the ROCm-specific VAE-decode section above, which is the relevant failure mode on this card. On the 7800 XT the practical memory lever is the BF16-vs-GGUF transformer choice (Installation step 5), not the offload path that triggers Issue #11.