self-hosted/ai
§01·recipe · image

Flux.2 Klein 4B on RTX 4090: BF16 Full-Resident 4-Step Text-to-Image via Diffusers or ComfyUI

imagebeginner20GB+ VRAMMay 20, 2026
models
tools
prerequisites
  • NVIDIA RTX 4090 (24 GB VRAM) — Ada sm_89
  • Python 3.10+ (Python 3.12 for the official BFL repo)
  • ComfyUI (latest nightly with Klein nodes) or the `diffusers` Python package

What You'll Build

Generate 1024×1024 images locally with Black Forest Labs' Flux.2 Klein 4B — the Apache-2.0, 4-billion-parameter, step-distilled member of the Flux.2 family — on an RTX 4090. The 24 GB envelope is generous enough to skip CPU offload entirely and keep the full BF16 transformer + Qwen3-4B text encoder + VAE all resident on the GPU, the simplest reproducible path for this card.

Hardware data: RTX 4090 (24 GB VRAM) · ~20 GB peak VRAM at full-resident BF16 (see Results) · See benchmark data

Requirements

ComponentMinimumTested
GPU13 GB VRAM with enable_model_cpu_offload(); 24 GB for the full-resident pathRTX 4090 (24 GB)
RAM16 GB
Storage~24 GB (full diffusers checkout) or ~8 GB (FP8 ComfyUI single-file)
Softwarediffusers + transformers + accelerate, OR ComfyUI (latest nightly with Klein nodes)

The full diffusers checkout from black-forest-labs/FLUX.2-klein-4B totals ~16 GB of BF16 weights on disk (7.75 GB transformer + 8.05 GB Qwen3-4B text encoder shards + 168 MB VAE per the HF tree API), plus ancillary files. The single-file consolidated weight is also 7.75 GB. The ComfyUI FP8 single-file path uses flux-2-klein-4b-fp8.safetensors instead — see Path B.

Installation

Two supported paths — pick one. The diffusers path is the most direct reproduction of the official Quick Start; the ComfyUI path is preferred if you already have a Flux.1 workflow set up.

Path A — Diffusers (Python, official example)

1. Install dependencies

pip install -U diffusers transformers accelerate

The Flux.2 family pipelines (including Flux2KleinPipeline) live in diffusers mainline; an up-to-date install is sufficient. If you want the bleeding-edge build that the BFL HF card pins, install directly from main: pip install git+https://github.com/huggingface/diffusers.git.

2. Run the official example

This is the exact snippet published on the model card at huggingface.co/black-forest-labs/FLUX.2-klein-4B, unmodified except that the enable_model_cpu_offload() line is commented out — on a 24 GB RTX 4090 there is no need to offload, and keeping everything resident is faster.

import torch
from diffusers import Flux2KleinPipeline

device = "cuda"
dtype = torch.bfloat16

pipe = Flux2KleinPipeline.from_pretrained(
    "black-forest-labs/FLUX.2-klein-4B",
    torch_dtype=dtype,
)
pipe.to(device)
# On 16 GB-class cards, re-enable the next line to keep peak VRAM near the
# BFL-stated ~13 GB envelope. On a 24 GB RTX 4090 it is not needed.
# pipe.enable_model_cpu_offload()

prompt = "A cat holding a sign that says hello world"
image = pipe(
    prompt=prompt,
    height=1024,
    width=1024,
    guidance_scale=1.0,
    num_inference_steps=4,
    generator=torch.Generator(device=device).manual_seed(0),
).images[0]
image.save("flux-klein.png")

guidance_scale=1.0 and num_inference_steps=4 are the recommended values for the distilled variant per the BFL model card.

Path B — ComfyUI

1. Update ComfyUI to a build with Klein support

Klein support landed in ComfyUI's nightly nodes; an older build will fail to load the workflow. From your ComfyUI checkout:

git pull
pip install -r requirements.txt

2. Download the required files

Per the official ComfyUI Flux.2 Klein tutorial, the 4B distilled variant needs four files:

FileFolder
flux-2-klein-4b-fp8.safetensors (distilled) — or flux-2-klein-base-4b-fp8.safetensors (base)ComfyUI/models/diffusion_models/
qwen_3_4b.safetensorsComfyUI/models/text_encoders/
flux2-vae.safetensorsComfyUI/models/vae/

The VAE (flux2-vae.safetensors) is the Flux.2 family VAE — shared across Klein / Dev / Pro and distinct from the Flux.1 VAE. The text encoder is Qwen3-4B, not the T5 family used by Flux.1 — confirmed by Klein's model_index.json which declares "text_encoder": ["transformers", "Qwen3ForCausalLM"].

Running

Diffusers

python flux_klein.py

Output flux-klein.png lands in your working directory. First run downloads the weights from the Hub (~16 GB BF16) into ~/.cache/huggingface/.

ComfyUI

python main.py --listen

Open http://localhost:8188 and load one of the six official Klein workflow JSONs (text-to-image and image-editing, base/distilled, 4B/9B) — downloadable from the docs.comfy.org tutorial page. For the distilled variant: 4 steps at CFG 1.0. For the base variant: 25–50 steps at CFG 5.0, per Next Diffusion's Klein walkthrough.

Results

  • VRAM usage: ~20 GB peak when running BF16 full-resident on RTX 4090, per the community benchmark (works=true, 20 GB peak). Independently consistent with the on-disk BF16 weight envelope from the HF tree API (~16 GB transformer + text encoder + VAE, plus activation headroom). With pipe.enable_model_cpu_offload() enabled, peak drops to BFL's stated "~13GB VRAM" envelope per the official model card, corroborated by the Spheron deploy guide which lists "RTX 4090 ... FLUX.2-klein-4B (~13GB)" verbatim.
  • Speed: BFL describes Klein 4B as "Sub-second inference — Generate or edit images under a second on modern hardware" in the official Flux.2 GitHub README, but does not name RTX 4090 with a measured time. The official ComfyUI Flux.2 Klein tutorial lists distilled ~1.2s · 8.4GB VRAM for the RTX 5090 at FP8 — that figure does not transfer to RTX 4090 (different arch generation, ~30% memory-bandwidth uplift). A measured RTX-4090-named generation-time number has not surfaced in published sources at the time of writing. See /check/flux-2-klein-4b/rtx-4090 for community speed measurements as they land via /contribute.
  • Quality notes: Klein is the small/distilled member of the Flux.2 family — expect strong prompt adherence at 4-billion parameters with the usual distillation tradeoffs (less stylistic flexibility than the Flux.2 Dev 32B base). The 4 distilled steps make iteration extremely fast.
  • License: Apache 2.0 — commercial use permitted (per the model card).

For the full benchmark data, see /check/flux-2-klein-4b/rtx-4090.

Troubleshooting

"Distorted colors / washed-out output"

You're loading the wrong VAE. Klein must use flux2-vae.safetensors (the Flux.2 family VAE, shared across Klein/Dev/Pro per model_index.json) — loading any other VAE (Flux.1, SDXL, SD1.5) will produce broken output. Confirm the VAE file in ComfyUI/models/vae/ matches the filename above.

"Text encoder shape mismatch / config error"

Klein uses a Qwen3-4B text encoder per its model_index.json (text_encoder = ['transformers', 'Qwen3ForCausalLM']), not the T5 family that Flux.1 used. Make sure you downloaded qwen_3_4b.safetensors (or the equivalent diffusers shards from the text_encoder/ subfolder), not a Flux.1 T5 file you might still have on disk.

"OOM unexpectedly on a 24 GB card"

Two common causes:

  1. You re-enabled pipe.enable_model_cpu_offload() and also tried to push a second pipeline to the GPU. The offload path holds weights in CPU RAM and streams to GPU — fine on its own, but if a second resident model is loaded the streaming buffers compete for the leftover budget. Either stay fully resident on the 4090 (the recommended path here) or restart Python before loading anything else.
  2. You bumped resolution past 1024×1024. Activation memory grows quadratically with side length; 2048×2048 BF16 full-resident may exceed 24 GB. Drop back to 1024×1024, or re-enable pipe.enable_model_cpu_offload() for the higher-resolution run.

If neither applies and you still see OOM, re-enable pipe.enable_model_cpu_offload() — it brings peak back into the BFL-stated ~13 GB envelope at a modest throughput cost.

"Where do I find the official workflow JSONs?"

The docs.comfy.org Klein tutorial links downloadable workflow JSONs for all six variants (text-to-image and image-editing, base/distilled, 4B/9B). The BFL GitHub repo is the canonical home of the official command-line tooling. If neither has what you need, report your setup via the submission form.