self-hosted/ai
§01·recipe · image

Flux.2 Klein 4B on RTX 5090: FP8 1.2-Second Generation, Blackwell-Native Speed Win

imagebeginner9GB+ VRAMMay 24, 2026
models
tools
prerequisites
  • NVIDIA RTX 5090 (32 GB VRAM) — Blackwell sm_120
  • Python 3.10+ (Python 3.12 for the official BFL repo)
  • ComfyUI (latest nightly with Klein nodes) or the `diffusers` Python package
  • PyTorch built with cu128 wheels (Blackwell sm_120 support)

What You'll Build

Generate 1024×1024 images locally with Black Forest Labs' Flux.2 Klein 4B — the Apache-2.0, 4-billion-parameter, step-distilled member of the Flux.2 family — on an RTX 5090. The official ComfyUI Flux.2 Klein tutorial publishes RTX 5090 numbers directly: distilled ~1.2 s · 8.4 GB VRAM at FP8. The recipe leads with the FP8 ComfyUI path because the Blackwell sm_120 architecture has native FP8 tensor-core acceleration — a real speed win the canonical ComfyUI org has measured on this exact card. The 32 GB envelope leaves more than 23 GB of headroom for parallel pipelines, larger resolutions, or model colocation.

Hardware data: RTX 5090 (32 GB VRAM) · ~8.4 GB peak VRAM at FP8 distilled · ~1.2 s per 1024×1024 image · See benchmark data

ℹ️ Why the 5090 leads with FP8 instead of BF16. On Ampere (RTX 3090), FP8 weights work but the compute path dequantizes to BF16 at runtime — no speed win, only VRAM savings (see the 3090 sibling recipe which uses CPU offload to land at ~13 GB). On Ada (RTX 4090), full-resident BF16 is the simplest path with 24 GB headroom (see the 4090 sibling recipe which lands at ~20 GB). On Blackwell sm_120 the calculus flips: FP8 tensor cores are hardware-accelerated, so the FP8 path is genuinely faster than BF16 here — and the 8.4 GB envelope is so small that the 5090's 32 GB budget could host the model three times over. This recipe pins FP8 as the primary path; BF16 full-resident is documented as the alternative for users who want maximum quality.

Requirements

ComponentMinimumTested
GPU13 GB VRAM with enable_model_cpu_offload() (BF16); 9 GB for the FP8 ComfyUI pathRTX 5090 (32 GB)
RAM16 GB
Storage~8 GB (FP8 ComfyUI single-file + Qwen3-4B encoder + VAE) or ~16 GB (full BF16 diffusers checkout)
SoftwareComfyUI (latest nightly with Klein nodes) — OR — diffusers (main) + transformers + accelerate

The full BF16 diffusers checkout from black-forest-labs/FLUX.2-klein-4B is one consolidated 7.75 GB transformer file plus the Qwen3-4B text encoder shards and the Flux.2 VAE, per the HF tree API. The official black-forest-labs/FLUX.2-klein-4b-fp8 repository ships a 4.07 GB FP8 single-file used by the ComfyUI path.

Installation

Two supported paths — pick one. The ComfyUI FP8 path (Path A) is the recommended default for an RTX 5090 because docs.comfy.org's published 5090 measurement targets exactly this path. The diffusers full-resident BF16 path (Path B) is for users who prefer Python scripting and don't mind a larger VRAM footprint.

Path A — ComfyUI FP8 (recommended for RTX 5090)

1. Update ComfyUI to a build with Klein support

Klein support landed in ComfyUI's nightly nodes; an older build will fail to load the workflow. From your ComfyUI checkout:

git pull
pip install -r requirements.txt

For Blackwell sm_120 PyTorch support, install the cu128 wheels:

pip install --upgrade torch --index-url https://download.pytorch.org/whl/cu128

Per the official ComfyUI Flux.2 Klein tutorial, the recommended ComfyUI build is the latest nightly (or for Desktop/Cloud, wait for the next stable that includes Klein nodes).

2. Download the required files

Per the official ComfyUI Flux.2 Klein tutorial, the 4B distilled variant needs three files:

FileFolder
flux-2-klein-4b-fp8.safetensors (4.07 GB, distilled) — or flux-2-klein-base-4b-fp8.safetensors (base)ComfyUI/models/diffusion_models/
qwen_3_4b.safetensorsComfyUI/models/text_encoders/
flux2-vae.safetensorsComfyUI/models/vae/

The VAE (flux2-vae.safetensors) is the Flux.2 family VAE — shared across Klein / Dev / Pro and distinct from the Flux.1 VAE. The text encoder is Qwen3-4B, not the T5 family used by Flux.1 — confirmed by Klein's model_index.json which declares "text_encoder": ["transformers", "Qwen3ForCausalLM"].

Path B — Diffusers (Python, full-resident BF16)

1. Install dependencies

pip install -U git+https://github.com/huggingface/diffusers.git transformers accelerate
pip install --upgrade torch --index-url https://download.pytorch.org/whl/cu128

The Flux.2 Klein pipelines (Flux2KleinPipeline) require an up-to-date diffusers — the BFL model card pins git+https://github.com/huggingface/diffusers.git rather than the released wheel.

2. Run the official example

This is the exact snippet published on the model card at huggingface.co/black-forest-labs/FLUX.2-klein-4B, unmodified except that the enable_model_cpu_offload() line is commented out — on a 32 GB RTX 5090 the entire BF16 pipeline fits resident with massive headroom.

import torch
from diffusers import Flux2KleinPipeline

device = "cuda"
dtype = torch.bfloat16

pipe = Flux2KleinPipeline.from_pretrained(
    "black-forest-labs/FLUX.2-klein-4B",
    torch_dtype=dtype,
)
pipe.to(device)
# On 16 GB-class cards, re-enable the next line to keep peak near the
# BFL-stated ~13 GB envelope. On a 32 GB RTX 5090 it is not needed.
# pipe.enable_model_cpu_offload()

prompt = "A cat holding a sign that says hello world"
image = pipe(
    prompt=prompt,
    height=1024,
    width=1024,
    guidance_scale=1.0,
    num_inference_steps=4,
    generator=torch.Generator(device=device).manual_seed(0),
).images[0]
image.save("flux-klein.png")

guidance_scale=1.0 and num_inference_steps=4 are the recommended values for the distilled variant per the BFL model card. The base (undistilled) variant — black-forest-labs/FLUX.2-klein-base-4B — uses 25–50 steps at CFG 5.0 instead, per the Next Diffusion Klein walkthrough.

Running

ComfyUI (Path A)

python main.py --listen

Open http://localhost:8188 and load one of the six official Klein workflow JSONs (text-to-image and image-editing, base/distilled, 4B/9B) downloadable from the docs.comfy.org tutorial page. For the distilled variant: 4 steps at CFG 1.0. For the base variant: 25–50 steps at CFG 5.0.

Diffusers (Path B)

python flux_klein.py

Output flux-klein.png lands in your working directory. First run downloads the BF16 weights from the Hub into ~/.cache/huggingface/. The first generation includes one-time pipeline compilation; steady-state generations are faster.

Results

  • Speed (FP8 ComfyUI, recommended path): Per the official ComfyUI Flux.2 Klein tutorial, the distilled 4B variant runs at "~1.2s (5090)" per 1024×1024 image. The base 4B variant runs at "~17s (5090)" per image (more steps, more compute). These are the only first-party hardware-named timing numbers published for Klein at the time of writing.
  • VRAM usage (FP8 ComfyUI, recommended path): ~8.4 GB peak for the distilled variant on RTX 5090 per the official docs.comfy.org tutorial ("distilled ~1.2s (5090) · 8.4GB VRAM"); ~9.2 GB peak for the base variant ("base ~17s (5090) · 9.2GB VRAM"). Either path leaves more than 23 GB of headroom on the 32 GB envelope — see "Spending the headroom" below. Cross-checks: the FP8 single-file on disk is 4.07 GB per the HF tree API, consistent with the 8.4 GB runtime peak once the Qwen3-4B encoder, VAE, and activations are also resident.
  • VRAM usage (BF16 diffusers, Path B alternative): ~13 GB peak with enable_model_cpu_offload() enabled per BFL's official Klein 4B model card ("The FLUX.2 [klein] 4B model fits in ~13GB VRAM and is accessible on NVIDIA RTX 3090/4070 and above" — RTX 5090 is covered by the "and above" wording); higher (~20 GB peak as measured on the RTX 4090 sibling) without offload, which is the simpler default on a 32 GB card.
  • Quality notes: Klein is the small/distilled member of the Flux.2 family — expect strong prompt adherence at 4-billion parameters with the usual distillation tradeoffs (less stylistic flexibility than the Flux.2 Dev 32B base, which requires an H100-equivalent GPU per the official Flux.2 README). The 4 distilled steps at ~1.2 s each make iteration extremely fast on Blackwell.
  • License: Apache 2.0 — commercial use permitted (per the model card).

For the full benchmark data, see /check/flux-2-klein-4b/rtx-5090.

Spending the headroom

At ~8.4 GB peak for the FP8 distilled path, the RTX 5090's 32 GB budget leaves roughly 23 GB unused — concrete options:

  • Colocate a 14B-class LLM. A Q4_K_M quant of a 14B-class model (Qwen3-14B, DeepSeek-R1-Distill-Qwen-14B) runs at ~9-10 GB resident. Klein 4B FP8 (~8.4 GB) + a 14B-Q4 LLM (~10 GB) + 8K KV cache = ~22 GB, leaves ~10 GB for batch activations. Useful for image-gen + caption-rewrite pipelines on a single card.
  • Higher resolution. Bumping to 2048×2048 BF16 quadruples activation memory but still has plenty of slack at FP8 — the FP8 single-file's 4.07 GB on disk plus encoder/VAE/activations fits well within 32 GB even at the higher resolution.
  • Batch generation. Process multiple prompts concurrently. With ~23 GB headroom and ~6 GB per additional 1024² latent + activation set, a batch of 3-4 images per inference call is feasible without VRAM thrash.
  • Two pipelines resident. Load both flux-2-klein-4b-fp8 (distilled, fast iteration) and flux-2-klein-base-4b-fp8 (base, higher quality) simultaneously — combined ~18 GB peak — and route prompts to either at runtime without reload latency.

Troubleshooting

"Distorted colors / washed-out output"

You're loading the wrong VAE in ComfyUI. Klein must use flux2-vae.safetensors (the Flux.2 family VAE, shared across Klein / Dev / Pro per model_index.json) — loading any other VAE (Flux.1, SDXL, SD1.5) will produce broken output. Confirm the VAE file in ComfyUI/models/vae/ matches the filename above.

"Text encoder shape mismatch / config error"

Klein uses a Qwen3-4B text encoder per its model_index.json (text_encoder = ['transformers', 'Qwen3ForCausalLM']), not the T5 family that Flux.1 used and not the Mistral3 family that Flux.2 dev uses (Issue #11 on the flux2 repo is a Flux.2-dev bnb-4bit report on RTX 3090 — its encoder-specific workaround does not transfer to Klein, which has a different encoder family entirely). Make sure you downloaded qwen_3_4b.safetensors (or the equivalent diffusers shards from the text_encoder/ subfolder), not a Flux.1 T5 file or a Flux.2-dev Mistral3 encoder you might still have on disk.

"FlashAttention-2 crash on first inference"

If you've manually set attn_implementation="flash_attention_2" somewhere in a wrapper or custom node, FA2 sm_120 kernels are still tracked open at Dao-AILab/flash-attention#2168 — confirmed open as of this writing. The diffusers default attention is scaled_dot_product_attention (SDPA), which has native sm_120 kernels in any cu128 PyTorch build and is the always-works fallback. The official Klein snippet does not request FA2; only a custom override would trigger this.

"Should I use the BF16 path instead of FP8 for maximum quality on 32 GB?"

The FP8 single-file is what docs.comfy.org's RTX 5090 measurement was taken on, and the quality delta vs full-resident BF16 is small for a step-distilled model. If you specifically want BF16: use Path B (diffusers), comment out pipe.enable_model_cpu_offload(), and expect a peak around 20 GB (per the 4090 sibling recipe which measured exactly that). Wall-clock-per-image will likely be slightly worse than FP8 on Blackwell because the FP8 path leverages the sm_120 FP8 tensor cores — measure your own throughput and pick based on your quality bar.

"Can I use the 9B variant on a 5090?"

Yes — the docs.comfy.org Klein tutorial lists Klein 9B files alongside 4B. The 9B variant ships a qwen_3_8b_fp8mixed.safetensors text encoder (larger than the 4B variant's qwen_3_4b.safetensors) plus a 9B diffusion model file. The 5090's 32 GB envelope comfortably fits the 9B path, but timing/VRAM numbers for the 9B variant aren't published on the same docs.comfy.org page — report measurements via /contribute once you have them.

"Where do I find the official workflow JSONs?"

The docs.comfy.org Klein tutorial links downloadable workflow JSONs for all six variants (text-to-image and image-editing, base/distilled, 4B/9B). The BFL GitHub repo is the canonical home of the official command-line tooling. If neither has what you need, report your setup via the submission form.