self-hosted/ai
§01·recipe · image

Flux.2 Klein 4B on RTX 3090 Ti: BFL-Recommended ~13 GB CPU-Offload Path for 4-Step Text-to-Image

imagebeginner13GB+ VRAMMay 28, 2026
models
tools
prerequisites
  • NVIDIA RTX 3090 Ti (24 GB VRAM) — Ampere sm_86
  • Python 3.10+ (Python 3.12 for the official BFL repo)
  • ComfyUI (latest nightly with Klein nodes) or the `diffusers` Python package

What You'll Build

Generate 1024×1024 images locally with Black Forest Labs' Flux.2 Klein 4B — the Apache-2.0, 4-billion-parameter, step-distilled member of the Flux.2 family — on an RTX 3090 Ti. The Klein 4B model card explicitly names the RTX 3090 (and above) as a supported card: "The FLUX.2 [klein] 4B model fits in ~13GB VRAM and is accessible on NVIDIA RTX 3090/4070 and above" — the "and above" phrasing subsumes the 3090 Ti, which shares the same Ampere sm_86 architecture and 24 GB VRAM tier as the plain 3090 while adding slightly more memory bandwidth (1008 GB/s vs 936 GB/s) and a small clock-speed bump. The recipe leads with the CPU-offload BF16 path that lands at the BFL-stated ~13 GB envelope, leaving roughly 11 GB of headroom on the 3090 Ti's 24 GB budget.

Hardware data: RTX 3090 Ti (24 GB VRAM) · ~13 GB peak VRAM with enable_model_cpu_offload() per the BFL model card · See benchmark data

ℹ️ Why the 3090 Ti uses a different path than the 4090 sibling. The 4090 sibling recipe keeps everything resident in BF16 (~20 GB peak) because Ada's faster per-clock compute makes the full-resident path the simplest. On the Ampere 3090 Ti the VRAM tier is the same 24 GB, but the compute-bound DiT denoising loop runs noticeably slower than on Ada per-clock, and the full-resident path eats almost all of the 24 GB budget. The CPU-offload path documented here is what BFL's own model card recommends for the RTX 3090 family — same model, same Apache-2.0 weights, same 4-step distilled output, just with the small extra latency of CPU↔GPU transfer in exchange for substantial VRAM headroom.

Requirements

ComponentMinimumTested
GPU13 GB VRAM with enable_model_cpu_offload() (RTX 3090 / 4070 and above per the BFL card)RTX 3090 Ti (24 GB)
RAM32 GB recommended
Storage~8 GB (BF16 transformer + VAE) plus the Qwen3-4B text encoder, OR ~4.1 GB (FP8 ComfyUI single-file)
Softwarediffusers (main) + transformers + accelerate, OR ComfyUI (latest nightly with Klein nodes)

The Klein 4B BF16 transformer at black-forest-labs/FLUX.2-klein-4B is 7.75 GB on disk per the HF tree API; per Klein's model_index.json the pipeline also pulls a Qwen3-4B text encoder (Qwen3ForCausalLM + Qwen2TokenizerFast) and the Flux.2-family VAE (AutoencoderKLFlux2). The dedicated single-file black-forest-labs/FLUX.2-klein-4b-fp8 repo ships a 4.07 GB FP8 file for the ComfyUI path.

Installation

Two supported paths — pick one. Path A is the BFL-recommended Python flow for a 3090-class card; Path B is for users who already have ComfyUI installed.

Path A — Diffusers (Python, BFL-recommended for RTX 3090 Ti)

1. Install dependencies

pip install -U git+https://github.com/huggingface/diffusers.git transformers accelerate

The Flux.2 Klein pipelines (Flux2KleinPipeline) require an up-to-date diffusers — the BFL model card explicitly pins git+https://github.com/huggingface/diffusers.git rather than the released wheel.

2. Run the official example

This is the exact snippet published on the model card at huggingface.co/black-forest-labs/FLUX.2-klein-4B, unmodified — pipe.enable_model_cpu_offload() is kept enabled for the 3090 Ti path, which is what holds peak VRAM at the documented ~13 GB.

import torch
from diffusers import Flux2KleinPipeline

device = "cuda"
dtype = torch.bfloat16

pipe = Flux2KleinPipeline.from_pretrained(
    "black-forest-labs/FLUX.2-klein-4B",
    torch_dtype=dtype,
)
pipe.enable_model_cpu_offload()  # keeps peak near the BFL-stated ~13 GB envelope

prompt = "A cat holding a sign that says hello world"
image = pipe(
    prompt=prompt,
    height=1024,
    width=1024,
    guidance_scale=1.0,
    num_inference_steps=4,
    generator=torch.Generator(device=device).manual_seed(0),
).images[0]
image.save("flux-klein.png")

guidance_scale=1.0 and num_inference_steps=4 are the recommended values for the distilled variant per the BFL model card. The base (undistilled) variant — black-forest-labs/FLUX.2-klein-base-4B — uses 25–50 steps at CFG 5.0 instead.

Path B — ComfyUI (FP8 single-file)

1. Update ComfyUI to a build with Klein support

Klein support landed in ComfyUI's nightly nodes; an older build will fail to load the workflow. From your ComfyUI checkout:

git pull
pip install -r requirements.txt

2. Download the required files

Per the official ComfyUI Flux.2 Klein tutorial, the 4B distilled variant needs three files:

FileFolder
flux-2-klein-4b-fp8.safetensors (4.07 GB, distilled) — or flux-2-klein-base-4b-fp8.safetensors (base)ComfyUI/models/diffusion_models/
qwen_3_4b.safetensorsComfyUI/models/text_encoders/
flux2-vae.safetensorsComfyUI/models/vae/

The VAE (flux2-vae.safetensors) is the Flux.2 family VAE — shared across Klein / Dev / Pro and distinct from the Flux.1 VAE. The text encoder is Qwen3-4B, not the T5 family used by Flux.1 — confirmed by Klein's model_index.json, which declares "text_encoder": ["transformers", "Qwen3ForCausalLM"].

Running

Diffusers

python flux_klein.py

Output flux-klein.png lands in your working directory. First run downloads the weights from the Hub (BF16 transformer ~7.75 GB + Qwen3-4B encoder + VAE) into ~/.cache/huggingface/. The first generation includes one-time pipeline compilation / shader-cache build; steady-state generations are faster.

ComfyUI

python main.py --listen

Open http://localhost:8188 and load one of the official Klein workflow JSONs (text-to-image and image-editing, base/distilled, 4B/9B) downloadable from the docs.comfy.org tutorial page. For the distilled variant: 4 steps at CFG 1.0. For the base variant: 25–50 steps at CFG 5.0.

Results

  • VRAM usage: ~13 GB peak with the diffusers enable_model_cpu_offload() path, per BFL's official Klein 4B model card ("The FLUX.2 [klein] 4B model fits in ~13GB VRAM and is accessible on NVIDIA RTX 3090/4070 and above" — the "and above" phrasing covers the 3090 Ti, which shares the same Ampere sm_86 arch and 24 GB tier as the 3090). The official Flux.2 GitHub README additionally states "Klein 4B fits in ~8GB VRAM (RTX 3090/4070 and up)" — the lower 8 GB figure refers to the FP8 ComfyUI single-file path, while the ~13 GB figure refers to the offloaded BF16 diffusers path documented here. The full-resident BF16 path used by the 4090 sibling recipe is also feasible on the 3090 Ti's 24 GB tier (peak ~20 GB) but leaves only ~4 GB headroom for ComfyUI overhead, second pipelines, or higher resolutions — the offload path is preferred on this card. See /check/flux-2-klein-4b/rtx-3090-ti for community benchmark data as it lands.
  • Speed: A first-party RTX-3090-Ti-named generation-time number for Klein 4B has not surfaced in published sources at the time of writing. BFL describes Klein 4B as a sub-second-inference model — the official Flux.2 GitHub README calls out "Generate or edit images under a second on modern hardware" — but the only measured GPU-named figures published so far are on RTX 5090 (the official ComfyUI Flux.2 Klein tutorial lists distilled ~1.2s (5090) · 8.4GB VRAM; base ~17s (5090) · 9.2GB VRAM at FP8). The 5090 number does not transfer to the RTX 3090 Ti — different architecture generation (Blackwell sm_120 vs Ampere sm_86), different quantization (FP8 with tensor-core acceleration on Blackwell vs offloaded BF16 on Ampere), and the Ampere DiT denoising loop runs materially slower than Blackwell per step on this workload. Report your measured 3090 Ti generation time via /contribute so /check/flux-2-klein-4b/rtx-3090-ti gets a real benchmark.
  • Quality notes: Klein is the small/distilled member of the Flux.2 family — expect strong prompt adherence at 4-billion parameters with the usual distillation tradeoffs (less stylistic flexibility than the Flux.2 Dev 32B base). The 4 distilled steps make iteration fast even on Ampere.
  • License: Apache 2.0 — commercial use permitted (per the model card).

For the full benchmark data, see /check/flux-2-klein-4b/rtx-3090-ti.

Troubleshooting

"OOM during VAE decode after the diffusion steps finish"

On a 3090 Ti paired with 16 GB system RAM, the System-RAM-OOM failure mode reported by a 4090 user on flux2 Issue #11 applies equally to 3090 Ti setups when CPU offload is enabled (offload trades GPU VRAM for system RAM). Note that Issue #11 reports the failure on FLUX.2-dev (the 32B Mistral3-encoder variant), not Klein — but the specific sub-failure is at the VAE-decode stage, which is shared across the Flux.2 family runtime path under enable_model_cpu_offload() and therefore transfers. The diffusers maintainer recommends offloading to disk; a working workaround the same reporter posted on issuecomment-3596576394 is to decouple the pipeline into a text-to-latent stage followed by an explicit gc.collect() and torch.cuda.empty_cache() before running VAE decode separately. 32 GB system RAM is the realistic floor for the offload path; 16 GB users should expect to need either disk offload or the staged decode workaround. The Mistral3-encoder-specific workaround code earlier in the same thread (the diffusers/FLUX.2-dev-bnb-4bit 4-bit text encoder snippet) does NOT transfer to Klein — Klein uses a Qwen3-4B encoder, not Mistral3.

"FP8 saves VRAM but doesn't run faster on the 3090 Ti — is that expected?"

Yes — and it's an Ampere-arch property, not a Klein-specific issue. The FP8 single-file at black-forest-labs/FLUX.2-klein-4b-fp8 loads on the 3090 Ti (the file format is universal), but the 3090 Ti's Ampere sm_86 architecture has no FP8 tensor cores — FP8 hardware-accelerated compute first shipped on Hopper sm_90 and consumer Ada sm_89. At inference time the runtime dequantizes FP8 weights to BF16 (or FP16) on the fly. You get the storage / VRAM savings (the ~8 GB FP8 envelope vs ~13 GB BF16-with-offload envelope), but not the speed boost an RTX 4090 / 5090 user enjoys with the same FP8 file. For 3090 / 3090 Ti users the choice between Path A (BF16 + offload) and Path B (FP8 ComfyUI) is a memory-vs-system-RAM tradeoff, not a speed tradeoff — pick FP8 if you want to free up the 5 GB of VRAM headroom for other workloads; pick BF16 + offload if you have ample VRAM and want simpler diffusers Python.

"Distorted colors / washed-out output"

You're loading the wrong VAE in ComfyUI. Klein must use flux2-vae.safetensors (the Flux.2 family VAE, shared across Klein / Dev / Pro per model_index.json) — loading any other VAE (Flux.1, SDXL, SD1.5) will produce broken output. Confirm the VAE file in ComfyUI/models/vae/ matches the filename above.

"Text encoder shape mismatch / config error"

Klein uses a Qwen3-4B text encoder per its model_index.json (text_encoder = ['transformers', 'Qwen3ForCausalLM']), not the T5 family that Flux.1 used and not the Mistral3 family that Flux.2 dev uses (Issue #11 is a Flux.2-dev report — its encoder workaround code does not transfer to Klein). Make sure you downloaded qwen_3_4b.safetensors (or the equivalent diffusers shards from the text_encoder/ subfolder), not a Flux.1 T5 file or a Flux.2-dev Mistral3 encoder you might still have on disk.

"Can I just keep everything resident on the 24 GB 3090 Ti?"

Technically yes — comment out pipe.enable_model_cpu_offload() and the 4090 sibling's full-resident BF16 path (~20 GB peak) will run on the 3090 Ti's same 24 GB budget. The tradeoff: only ~4 GB headroom for activations, ComfyUI overhead, second pipelines, or any resolution above 1024×1024. Compute throughput on Ampere is lower than Ada per-clock on this workload, so the wall-clock win from skipping offload is smaller on the 3090 Ti than on the 4090. For most users on this card the offload path is the right default; reach for full-resident only if you have measured your system has the headroom and need the throughput.

"How do I generate images larger than 1024×1024?"

Activation memory scales quadratically with side length — 2048×2048 BF16 even with offload can push past the 13 GB envelope. Drop back to 1024×1024 for the standard path, or split the work into a low-res Klein generation followed by an upscaler (Real-ESRGAN, SwinIR, etc.). Report success / failure cases via the submission form.

"Where do I find the official workflow JSONs?"

The docs.comfy.org Klein tutorial links downloadable workflow JSONs for all six variants (text-to-image and image-editing, base/distilled, 4B/9B). The BFL GitHub repo is the canonical home of the official command-line tooling. If neither has what you need, report your setup via the submission form.