What You'll Build
Generate 1024×1024 images locally with Black Forest Labs' Flux.2 Klein 4B — the Apache-2.0, 4-billion-parameter, step-distilled member of the Flux.2 family — on an RTX 3090. The Klein 4B model card and the official Flux.2 GitHub README both explicitly name the RTX 3090 as a supported card: "The FLUX.2 [klein] 4B model fits in ~13GB VRAM and is accessible on NVIDIA RTX 3090/4070 and above". The recipe leads with the CPU-offload BF16 path that lands at that ~13 GB envelope — BFL's documented recommendation for this card — leaving roughly 11 GB of headroom on the 3090's 24 GB budget.
Hardware data: RTX 3090 (24 GB VRAM) · ~13 GB peak VRAM with enable_model_cpu_offload() per the BFL model card · See benchmark data
ℹ️ Why the 3090 uses a different path than the 4090 sibling. The 4090 sibling recipe keeps everything resident in BF16 (~20 GB peak) because the Ada arch and 24 GB budget make full-residency the simplest path. On the Ampere 3090 the VRAM tier is the same 24 GB, but the compute-bound DiT denoising loop runs noticeably slower than on Ada per-clock, and the same full-resident path eats almost all of the 24 GB budget. The CPU-offload path documented here is what BFL's own model card recommends for the RTX 3090 — same model, same Apache-2.0 weights, same 4-step distilled output, just with the small extra latency of CPU↔GPU transfer in exchange for substantial VRAM headroom.
Requirements
| Component | Minimum | Tested |
|---|---|---|
| GPU | 13 GB VRAM with enable_model_cpu_offload() (RTX 3090 / 4070 and above per the BFL card) | RTX 3090 (24 GB) |
| RAM | 32 GB recommended | — |
| Storage | ~16 GB (full diffusers BF16 checkout) or ~4.1 GB (FP8 ComfyUI single-file) | — |
| Software | diffusers (main) + transformers + accelerate, OR ComfyUI (latest nightly with Klein nodes) | — |
The full BF16 diffusers checkout from black-forest-labs/FLUX.2-klein-4B is ~16 GB on disk per the HF tree API (7.75 GB consolidated transformer + 8.05 GB Qwen3-4B text encoder shards in text_encoder/, plus VAE and config). The official black-forest-labs/FLUX.2-klein-4b-fp8 repository ships an FP8 single-file (4.07 GB) for the ComfyUI path.
Installation
Two supported paths — pick one. Path A is the BFL-recommended Python flow for a 3090; Path B is for users who already have ComfyUI installed.
Path A — Diffusers (Python, BFL-recommended for RTX 3090)
1. Install dependencies
pip install -U git+https://github.com/huggingface/diffusers.git transformers accelerate
The Flux.2 Klein pipelines (Flux2KleinPipeline) require an up-to-date diffusers — the BFL model card explicitly pins git+https://github.com/huggingface/diffusers.git rather than the released wheel.
2. Run the official example
This is the exact snippet published on the model card at huggingface.co/black-forest-labs/FLUX.2-klein-4B, unmodified — pipe.enable_model_cpu_offload() is kept enabled for the 3090 path, which is what holds peak VRAM at the documented ~13 GB.
import torch
from diffusers import Flux2KleinPipeline
device = "cuda"
dtype = torch.bfloat16
pipe = Flux2KleinPipeline.from_pretrained(
"black-forest-labs/FLUX.2-klein-4B",
torch_dtype=dtype,
)
pipe.enable_model_cpu_offload() # keeps peak near the BFL-stated ~13 GB envelope
prompt = "A cat holding a sign that says hello world"
image = pipe(
prompt=prompt,
height=1024,
width=1024,
guidance_scale=1.0,
num_inference_steps=4,
generator=torch.Generator(device=device).manual_seed(0),
).images[0]
image.save("flux-klein.png")
guidance_scale=1.0 and num_inference_steps=4 are the recommended values for the distilled variant per the BFL model card. The base (undistilled) variant — black-forest-labs/FLUX.2-klein-base-4B — uses 25–50 steps at CFG 5.0 instead.
Path B — ComfyUI (FP8 single-file)
1. Update ComfyUI to a build with Klein support
Klein support landed in ComfyUI's nightly nodes; an older build will fail to load the workflow. From your ComfyUI checkout:
git pull
pip install -r requirements.txt
2. Download the required files
Per the official ComfyUI Flux.2 Klein tutorial, the 4B distilled variant needs four files:
| File | Folder |
|---|---|
flux-2-klein-4b-fp8.safetensors (4.07 GB, distilled) — or flux-2-klein-base-4b-fp8.safetensors (base) | ComfyUI/models/diffusion_models/ |
qwen_3_4b.safetensors | ComfyUI/models/text_encoders/ |
flux2-vae.safetensors | ComfyUI/models/vae/ |
The VAE (flux2-vae.safetensors) is the Flux.2 family VAE — shared across Klein / Dev / Pro and distinct from the Flux.1 VAE. The text encoder is Qwen3-4B, not the T5 family used by Flux.1 — confirmed by Klein's model_index.json, which declares "text_encoder": ["transformers", "Qwen3ForCausalLM"].
Running
Diffusers
python flux_klein.py
Output flux-klein.png lands in your working directory. First run downloads the weights from the Hub (~16 GB BF16) into ~/.cache/huggingface/. The first generation includes one-time pipeline compilation / shader-cache build; steady-state generations are faster.
ComfyUI
python main.py --listen
Open http://localhost:8188 and load one of the six official Klein workflow JSONs (text-to-image and image-editing, base/distilled, 4B/9B) downloadable from the docs.comfy.org tutorial page. For the distilled variant: 4 steps at CFG 1.0. For the base variant: 25–50 steps at CFG 5.0.
Results
- VRAM usage: ~13 GB peak with the diffusers
enable_model_cpu_offload()path, per BFL's official Klein 4B model card ("The FLUX.2 [klein] 4B model fits in ~13GB VRAM and is accessible on NVIDIA RTX 3090/4070 and above") and the official Flux.2 GitHub README ("Klein 4B fits in ~8GB VRAM (RTX 3090/4070 and up)" — the lower 8 GB figure refers to the FP8 ComfyUI single-file path, while the ~13 GB figure refers to the offloaded BF16 diffusers path documented here). The full-resident BF16 path used by the 4090 sibling recipe is also feasible on the 3090's 24 GB tier (peak ~20 GB) but leaves only ~4 GB headroom for ComfyUI overhead, second pipelines, or higher resolutions — the offload path is preferred on this card. See /check/flux-2-klein-4b/rtx-3090 for community benchmark data as it lands. - Speed: A first-party RTX-3090-named generation-time number for Klein 4B has not surfaced in published sources at the time of writing. BFL describes Klein 4B as "sub-second inference — Generate or edit images under a second on modern hardware" in the official Flux.2 GitHub README, but the only measured GPU-named figures published so far are on RTX 5090 (the official ComfyUI Flux.2 Klein tutorial lists
distilled ~1.2s · 8.4 GB VRAMfor the RTX 5090 at FP8). The 5090 number does not transfer to the RTX 3090 — different architecture generation (Blackwell sm_120 vs Ampere sm_86), different quantization (FP8 vs offloaded BF16), and the Ampere DiT denoising loop runs materially slower than Blackwell per step on this workload. Report your measured 3090 generation time via /contribute so /check/flux-2-klein-4b/rtx-3090 gets a real benchmark. - Quality notes: Klein is the small/distilled member of the Flux.2 family — expect strong prompt adherence at 4-billion parameters with the usual distillation tradeoffs (less stylistic flexibility than the Flux.2 Dev 32B base). The 4 distilled steps make iteration fast even on Ampere.
- License: Apache 2.0 — commercial use permitted (per the model card).
For the full benchmark data, see /check/flux-2-klein-4b/rtx-3090.
Troubleshooting
"OOM during VAE decode after the diffusion steps finish"
On a 3090 paired with 16 GB system RAM, the failure mode reported by a 4090 user on flux2 Issue #11 — process killed during VAE decode while diffusion itself completed — applies equally to 3090 setups when CPU offload is enabled (offload trades GPU VRAM for system RAM). The diffusers maintainer recommends offloading to disk; a working workaround the same reporter posted on issuecomment-3596576394 is to decouple the pipeline into a text-to-latent stage followed by an explicit gc.collect() and torch.cuda.empty_cache() before running VAE decode separately. 32 GB system RAM is the realistic floor for the offload path; 16 GB users should expect to need either disk offload or the staged decode workaround.
"Distorted colors / washed-out output"
You're loading the wrong VAE in ComfyUI. Klein must use flux2-vae.safetensors (the Flux.2 family VAE, shared across Klein / Dev / Pro per model_index.json) — loading any other VAE (Flux.1, SDXL, SD1.5) will produce broken output. Confirm the VAE file in ComfyUI/models/vae/ matches the filename above.
"Text encoder shape mismatch / config error"
Klein uses a Qwen3-4B text encoder per its model_index.json (text_encoder = ['transformers', 'Qwen3ForCausalLM']), not the T5 family that Flux.1 used and not the Mistral3 family that Flux.2 dev uses (Issue #11 is a Flux.2-dev report — its workaround code does not transfer to Klein). Make sure you downloaded qwen_3_4b.safetensors (or the equivalent diffusers shards from the text_encoder/ subfolder), not a Flux.1 T5 file or a Flux.2-dev Mistral3 encoder you might still have on disk.
"Can I just keep everything resident on the 24 GB 3090?"
Technically yes — comment out pipe.enable_model_cpu_offload() and the 4090 sibling's full-resident BF16 path (~20 GB peak) will run on the 3090's same 24 GB budget. The tradeoff: only ~4 GB headroom for activations, ComfyUI overhead, second pipelines, or any resolution above 1024×1024. Compute throughput on Ampere is also lower than Ada per-clock on this workload, so the wall-clock win from skipping offload is smaller on the 3090 than on the 4090. For most users on this card the offload path is the right default; reach for full-resident only if you have measured your system has the headroom and need the throughput.
"How do I generate images larger than 1024×1024?"
Activation memory scales quadratically with side length — 2048×2048 BF16 even with offload can push past the 13 GB envelope. Drop back to 1024×1024 for the standard path, or split the work into a low-res Klein generation followed by an upscaler (Real-ESRGAN, SwinIR, etc.). Report success / failure cases via the submission form.
"Where do I find the official workflow JSONs?"
The docs.comfy.org Klein tutorial links downloadable workflow JSONs for all six variants (text-to-image and image-editing, base/distilled, 4B/9B). The BFL GitHub repo is the canonical home of the official command-line tooling. If neither has what you need, report your setup via the submission form.