self-hosted/ai
§01·recipe · image

Flux.2 Klein 4B on RTX 3090: BFL-Recommended ~13 GB CPU-Offload Path for 4-Step Text-to-Image

imagebeginner13GB+ VRAMMay 22, 2026
models
tools
prerequisites
  • NVIDIA RTX 3090 (24 GB VRAM) — Ampere sm_86
  • Python 3.10+ (Python 3.12 for the official BFL repo)
  • ComfyUI (latest nightly with Klein nodes) or the `diffusers` Python package

What You'll Build

Generate 1024×1024 images locally with Black Forest Labs' Flux.2 Klein 4B — the Apache-2.0, 4-billion-parameter, step-distilled member of the Flux.2 family — on an RTX 3090. The Klein 4B model card and the official Flux.2 GitHub README both explicitly name the RTX 3090 as a supported card: "The FLUX.2 [klein] 4B model fits in ~13GB VRAM and is accessible on NVIDIA RTX 3090/4070 and above". The recipe leads with the CPU-offload BF16 path that lands at that ~13 GB envelope — BFL's documented recommendation for this card — leaving roughly 11 GB of headroom on the 3090's 24 GB budget.

Hardware data: RTX 3090 (24 GB VRAM) · ~13 GB peak VRAM with enable_model_cpu_offload() per the BFL model card · See benchmark data

ℹ️ Why the 3090 uses a different path than the 4090 sibling. The 4090 sibling recipe keeps everything resident in BF16 (~20 GB peak) because the Ada arch and 24 GB budget make full-residency the simplest path. On the Ampere 3090 the VRAM tier is the same 24 GB, but the compute-bound DiT denoising loop runs noticeably slower than on Ada per-clock, and the same full-resident path eats almost all of the 24 GB budget. The CPU-offload path documented here is what BFL's own model card recommends for the RTX 3090 — same model, same Apache-2.0 weights, same 4-step distilled output, just with the small extra latency of CPU↔GPU transfer in exchange for substantial VRAM headroom.

Requirements

ComponentMinimumTested
GPU13 GB VRAM with enable_model_cpu_offload() (RTX 3090 / 4070 and above per the BFL card)RTX 3090 (24 GB)
RAM32 GB recommended
Storage~16 GB (full diffusers BF16 checkout) or ~4.1 GB (FP8 ComfyUI single-file)
Softwarediffusers (main) + transformers + accelerate, OR ComfyUI (latest nightly with Klein nodes)

The full BF16 diffusers checkout from black-forest-labs/FLUX.2-klein-4B is ~16 GB on disk per the HF tree API (7.75 GB consolidated transformer + 8.05 GB Qwen3-4B text encoder shards in text_encoder/, plus VAE and config). The official black-forest-labs/FLUX.2-klein-4b-fp8 repository ships an FP8 single-file (4.07 GB) for the ComfyUI path.

Installation

Two supported paths — pick one. Path A is the BFL-recommended Python flow for a 3090; Path B is for users who already have ComfyUI installed.

Path A — Diffusers (Python, BFL-recommended for RTX 3090)

1. Install dependencies

pip install -U git+https://github.com/huggingface/diffusers.git transformers accelerate

The Flux.2 Klein pipelines (Flux2KleinPipeline) require an up-to-date diffusers — the BFL model card explicitly pins git+https://github.com/huggingface/diffusers.git rather than the released wheel.

2. Run the official example

This is the exact snippet published on the model card at huggingface.co/black-forest-labs/FLUX.2-klein-4B, unmodified — pipe.enable_model_cpu_offload() is kept enabled for the 3090 path, which is what holds peak VRAM at the documented ~13 GB.

import torch
from diffusers import Flux2KleinPipeline

device = "cuda"
dtype = torch.bfloat16

pipe = Flux2KleinPipeline.from_pretrained(
    "black-forest-labs/FLUX.2-klein-4B",
    torch_dtype=dtype,
)
pipe.enable_model_cpu_offload()  # keeps peak near the BFL-stated ~13 GB envelope

prompt = "A cat holding a sign that says hello world"
image = pipe(
    prompt=prompt,
    height=1024,
    width=1024,
    guidance_scale=1.0,
    num_inference_steps=4,
    generator=torch.Generator(device=device).manual_seed(0),
).images[0]
image.save("flux-klein.png")

guidance_scale=1.0 and num_inference_steps=4 are the recommended values for the distilled variant per the BFL model card. The base (undistilled) variant — black-forest-labs/FLUX.2-klein-base-4B — uses 25–50 steps at CFG 5.0 instead.

Path B — ComfyUI (FP8 single-file)

1. Update ComfyUI to a build with Klein support

Klein support landed in ComfyUI's nightly nodes; an older build will fail to load the workflow. From your ComfyUI checkout:

git pull
pip install -r requirements.txt

2. Download the required files

Per the official ComfyUI Flux.2 Klein tutorial, the 4B distilled variant needs four files:

FileFolder
flux-2-klein-4b-fp8.safetensors (4.07 GB, distilled) — or flux-2-klein-base-4b-fp8.safetensors (base)ComfyUI/models/diffusion_models/
qwen_3_4b.safetensorsComfyUI/models/text_encoders/
flux2-vae.safetensorsComfyUI/models/vae/

The VAE (flux2-vae.safetensors) is the Flux.2 family VAE — shared across Klein / Dev / Pro and distinct from the Flux.1 VAE. The text encoder is Qwen3-4B, not the T5 family used by Flux.1 — confirmed by Klein's model_index.json, which declares "text_encoder": ["transformers", "Qwen3ForCausalLM"].

Running

Diffusers

python flux_klein.py

Output flux-klein.png lands in your working directory. First run downloads the weights from the Hub (~16 GB BF16) into ~/.cache/huggingface/. The first generation includes one-time pipeline compilation / shader-cache build; steady-state generations are faster.

ComfyUI

python main.py --listen

Open http://localhost:8188 and load one of the six official Klein workflow JSONs (text-to-image and image-editing, base/distilled, 4B/9B) downloadable from the docs.comfy.org tutorial page. For the distilled variant: 4 steps at CFG 1.0. For the base variant: 25–50 steps at CFG 5.0.

Results

  • VRAM usage: ~13 GB peak with the diffusers enable_model_cpu_offload() path, per BFL's official Klein 4B model card ("The FLUX.2 [klein] 4B model fits in ~13GB VRAM and is accessible on NVIDIA RTX 3090/4070 and above") and the official Flux.2 GitHub README ("Klein 4B fits in ~8GB VRAM (RTX 3090/4070 and up)" — the lower 8 GB figure refers to the FP8 ComfyUI single-file path, while the ~13 GB figure refers to the offloaded BF16 diffusers path documented here). The full-resident BF16 path used by the 4090 sibling recipe is also feasible on the 3090's 24 GB tier (peak ~20 GB) but leaves only ~4 GB headroom for ComfyUI overhead, second pipelines, or higher resolutions — the offload path is preferred on this card. See /check/flux-2-klein-4b/rtx-3090 for community benchmark data as it lands.
  • Speed: A first-party RTX-3090-named generation-time number for Klein 4B has not surfaced in published sources at the time of writing. BFL describes Klein 4B as "sub-second inference — Generate or edit images under a second on modern hardware" in the official Flux.2 GitHub README, but the only measured GPU-named figures published so far are on RTX 5090 (the official ComfyUI Flux.2 Klein tutorial lists distilled ~1.2s · 8.4 GB VRAM for the RTX 5090 at FP8). The 5090 number does not transfer to the RTX 3090 — different architecture generation (Blackwell sm_120 vs Ampere sm_86), different quantization (FP8 vs offloaded BF16), and the Ampere DiT denoising loop runs materially slower than Blackwell per step on this workload. Report your measured 3090 generation time via /contribute so /check/flux-2-klein-4b/rtx-3090 gets a real benchmark.
  • Quality notes: Klein is the small/distilled member of the Flux.2 family — expect strong prompt adherence at 4-billion parameters with the usual distillation tradeoffs (less stylistic flexibility than the Flux.2 Dev 32B base). The 4 distilled steps make iteration fast even on Ampere.
  • License: Apache 2.0 — commercial use permitted (per the model card).

For the full benchmark data, see /check/flux-2-klein-4b/rtx-3090.

Troubleshooting

"OOM during VAE decode after the diffusion steps finish"

On a 3090 paired with 16 GB system RAM, the failure mode reported by a 4090 user on flux2 Issue #11 — process killed during VAE decode while diffusion itself completed — applies equally to 3090 setups when CPU offload is enabled (offload trades GPU VRAM for system RAM). The diffusers maintainer recommends offloading to disk; a working workaround the same reporter posted on issuecomment-3596576394 is to decouple the pipeline into a text-to-latent stage followed by an explicit gc.collect() and torch.cuda.empty_cache() before running VAE decode separately. 32 GB system RAM is the realistic floor for the offload path; 16 GB users should expect to need either disk offload or the staged decode workaround.

"Distorted colors / washed-out output"

You're loading the wrong VAE in ComfyUI. Klein must use flux2-vae.safetensors (the Flux.2 family VAE, shared across Klein / Dev / Pro per model_index.json) — loading any other VAE (Flux.1, SDXL, SD1.5) will produce broken output. Confirm the VAE file in ComfyUI/models/vae/ matches the filename above.

"Text encoder shape mismatch / config error"

Klein uses a Qwen3-4B text encoder per its model_index.json (text_encoder = ['transformers', 'Qwen3ForCausalLM']), not the T5 family that Flux.1 used and not the Mistral3 family that Flux.2 dev uses (Issue #11 is a Flux.2-dev report — its workaround code does not transfer to Klein). Make sure you downloaded qwen_3_4b.safetensors (or the equivalent diffusers shards from the text_encoder/ subfolder), not a Flux.1 T5 file or a Flux.2-dev Mistral3 encoder you might still have on disk.

"Can I just keep everything resident on the 24 GB 3090?"

Technically yes — comment out pipe.enable_model_cpu_offload() and the 4090 sibling's full-resident BF16 path (~20 GB peak) will run on the 3090's same 24 GB budget. The tradeoff: only ~4 GB headroom for activations, ComfyUI overhead, second pipelines, or any resolution above 1024×1024. Compute throughput on Ampere is also lower than Ada per-clock on this workload, so the wall-clock win from skipping offload is smaller on the 3090 than on the 4090. For most users on this card the offload path is the right default; reach for full-resident only if you have measured your system has the headroom and need the throughput.

"How do I generate images larger than 1024×1024?"

Activation memory scales quadratically with side length — 2048×2048 BF16 even with offload can push past the 13 GB envelope. Drop back to 1024×1024 for the standard path, or split the work into a low-res Klein generation followed by an upscaler (Real-ESRGAN, SwinIR, etc.). Report success / failure cases via the submission form.

"Where do I find the official workflow JSONs?"

The docs.comfy.org Klein tutorial links downloadable workflow JSONs for all six variants (text-to-image and image-editing, base/distilled, 4B/9B). The BFL GitHub repo is the canonical home of the official command-line tooling. If neither has what you need, report your setup via the submission form.