Flux.2 Klein 4B on RTX 4070 Ti SUPER: BFL-Recommended ~13 GB CPU-Offload Path for 4-Step Text-to-Image

What You'll Build

Generate 1024×1024 images locally with Black Forest Labs' Flux.2 Klein 4B — the Apache-2.0, 4-billion-parameter, step-distilled member of the Flux.2 family — on an RTX 4070 Ti SUPER. The Klein 4B model card explicitly buckets the supported cards by VRAM tier: "The FLUX.2 [klein] 4B model fits in ~13GB VRAM and is accessible on NVIDIA RTX 3090/4070 and above". The RTX 4070 Ti SUPER's 16 GB sits cleanly inside that "and above" range — it is a higher SKU than the RTX 4070 named in the card, with more VRAM (16 GB vs 12 GB) — so the recipe leads with the CPU-offload BF16 path that lands at that ~13 GB envelope, leaving roughly 3 GB of headroom on the card's 16 GB budget.

Hardware data: RTX 4070 Ti SUPER (16 GB VRAM) · ~13 GB peak VRAM with enable_model_cpu_offload() per the BFL model card · See benchmark data

ℹ️ Why this card uses the offload path, not full-residency. The 4090 sibling recipe keeps the full BF16 transformer + Qwen3-4B text encoder + VAE all resident on the GPU (~20 GB peak) because its 24 GB budget allows it. The RTX 4070 Ti SUPER has the same Ada sm_89 architecture but only 16 GB — the full-resident BF16 path's ~20 GB peak does not fit. The path documented here is BFL's own model-card recommendation for cards in this tier: the same Apache-2.0 weights, the same 4-step distilled output, run through enable_model_cpu_offload() so peak VRAM stays near ~13 GB. The only cost is the small extra latency of CPU↔GPU streaming, and the RTX 4070 Ti SUPER's PCIe Gen4 x16 link keeps that transfer overhead modest.

Requirements

Component	Minimum	Tested
GPU	13 GB VRAM with `enable_model_cpu_offload()` (RTX 3090 / 4070 and above per the BFL card)	RTX 4070 Ti SUPER (16 GB)
RAM	32 GB recommended	—
Storage	~16 GB (full diffusers BF16 checkout) or ~4.1 GB (FP8 ComfyUI single-file)	—
Software	`diffusers` (main) + `transformers` + `accelerate`, OR ComfyUI (latest nightly with Klein nodes)	—

The full BF16 diffusers checkout from black-forest-labs/FLUX.2-klein-4B is ~16 GB on disk per the HF tree API (7.75 GB consolidated transformer + 8.05 GB Qwen3-4B text encoder shards in text_encoder/ + 168 MB VAE, plus config). The official black-forest-labs/FLUX.2-klein-4b-fp8 repository ships an FP8 single-file (flux-2-klein-4b-fp8.safetensors, 4.07 GB per the HF tree API) for the ComfyUI path. FP8 runs natively on the RTX 4070 Ti SUPER's 4th-gen Ada tensor cores (E4M3/E5M2) — no special build is required.

Installation

Two supported paths — pick one. Path A is the BFL-recommended Python flow for a 16 GB card; Path B is for users who already have ComfyUI installed. On Ada (sm_89) the default PyTorch CUDA wheels already ship the right kernels — pip install torch works as-is, and FlashAttention-2 prebuilt wheels cover sm_89, so no special index-URL or attention-implementation override is needed.

Path A — Diffusers (Python, BFL-recommended for RTX 4070 Ti SUPER)

1. Install dependencies

pip install -U git+https://github.com/huggingface/diffusers.git transformers accelerate

The Flux.2 Klein pipelines (Flux2KleinPipeline) require an up-to-date diffusers — the BFL model card pins git+https://github.com/huggingface/diffusers.git rather than the released wheel.

2. Run the official example

This is the exact snippet published on the model card at huggingface.co/black-forest-labs/FLUX.2-klein-4B, unmodified — pipe.enable_model_cpu_offload() is kept enabled for the 4070 Ti SUPER path, which is what holds peak VRAM at the documented ~13 GB envelope so it fits the 16 GB budget.

import torch
from diffusers import Flux2KleinPipeline

device = "cuda"
dtype = torch.bfloat16

pipe = Flux2KleinPipeline.from_pretrained(
    "black-forest-labs/FLUX.2-klein-4B",
    torch_dtype=dtype,
)
pipe.enable_model_cpu_offload()  # keeps peak near the BFL-stated ~13 GB envelope

prompt = "A cat holding a sign that says hello world"
image = pipe(
    prompt=prompt,
    height=1024,
    width=1024,
    guidance_scale=1.0,
    num_inference_steps=4,
    generator=torch.Generator(device=device).manual_seed(0),
).images[0]
image.save("flux-klein.png")

guidance_scale=1.0 and num_inference_steps=4 are the recommended values for the distilled variant per the BFL model card. The base (undistilled) variant — black-forest-labs/FLUX.2-klein-base-4B — uses 25–50 steps at CFG 5.0 instead.

Path B — ComfyUI (FP8 single-file)

1. Update ComfyUI to a build with Klein support

Klein support landed in ComfyUI's nightly nodes; an older build will fail to load the workflow. From your ComfyUI checkout:

git pull
pip install -r requirements.txt

2. Download the required files

Per the official ComfyUI Flux.2 Klein tutorial, the 4B distilled variant needs four files:

File	Folder
`flux-2-klein-4b-fp8.safetensors` (4.07 GB, distilled) — or `flux-2-klein-base-4b-fp8.safetensors` (base)	`ComfyUI/models/diffusion_models/`
`qwen_3_4b.safetensors`	`ComfyUI/models/text_encoders/`
`flux2-vae.safetensors`	`ComfyUI/models/vae/`

The VAE (flux2-vae.safetensors) is the Flux.2 family VAE — shared across Klein / Dev / Pro and distinct from the Flux.1 VAE. The text encoder is Qwen3-4B, not the T5 family used by Flux.1 — confirmed by Klein's model_index.json, which declares "text_encoder": ["transformers", "Qwen3ForCausalLM"].

Running

Diffusers

python flux_klein.py

Output flux-klein.png lands in your working directory. First run downloads the weights from the Hub (~16 GB BF16) into ~/.cache/huggingface/. The first generation includes one-time pipeline compilation / shader-cache build; steady-state generations are faster.

ComfyUI

python main.py --listen

Open http://localhost:8188 and load one of the six official Klein workflow JSONs (text-to-image and image-editing, base/distilled, 4B/9B) downloadable from the docs.comfy.org tutorial page. For the distilled variant: 4 steps at CFG 1.0. For the base variant: 25–50 steps at CFG 5.0.

Results

VRAM usage: ~13 GB peak with the diffusers enable_model_cpu_offload() path, per BFL's official Klein 4B model card ("The FLUX.2 [klein] 4B model fits in ~13GB VRAM and is accessible on NVIDIA RTX 3090/4070 and above" — the RTX 4070 Ti SUPER is a higher SKU than the RTX 4070 named there, so it is covered by "and above"). The full-resident BF16 path used by the 4090 sibling recipe peaks at ~20 GB and does not fit the 4070 Ti SUPER's 16 GB — the offload path is the path that fits this card. See /check/flux-2-klein-4b/rtx-4070-ti-super for community benchmark data as it lands.
Speed: A first-party RTX-4070-Ti-SUPER-named generation-time number for Klein 4B has not surfaced in published sources at the time of writing, and the backend has no benchmark for this pair yet (/check/flux-2-klein-4b/rtx-4070-ti-super returns verdict: unknown). BFL describes Klein 4B as delivering "end-to-end inference in as low as under a second" in the official Klein model card, but does not name the RTX 4070 Ti SUPER with a measured time. The only published GPU-named figure is on RTX 5090 (the official ComfyUI Flux.2 Klein tutorial lists distilled ~1.2s · 8.4 GB VRAM for the RTX 5090 at FP8); that figure does not transfer to the RTX 4070 Ti SUPER — different architecture generation (Blackwell sm_120 vs Ada sm_89), different memory bandwidth, and a different quantization path (FP8-resident vs offloaded BF16). Report your measured 4070 Ti SUPER generation time via /contribute so /check/flux-2-klein-4b/rtx-4070-ti-super gets a real benchmark.
Quality notes: Klein is the small/distilled member of the Flux.2 family — expect strong prompt adherence at 4-billion parameters with the usual distillation tradeoffs (less stylistic flexibility than the Flux.2 Dev 32B base). The 4 distilled steps make iteration fast.
License: Apache 2.0 — commercial use permitted (per the model card; the 4B Klein weights are open under Apache 2.0, distinct from the non-commercial 9B Klein variant).

For the full benchmark data, see /check/flux-2-klein-4b/rtx-4070-ti-super.

Troubleshooting

"OOM during VAE decode after the diffusion steps finish"

When CPU offload is enabled, the binding constraint shifts from GPU VRAM to system RAM. A community user (neuhsm, role NONE on the repo) running an RTX 4090 + 32 GB system RAM reported on flux2 Issue #11 that the diffusion steps complete fine but the process is killed (system-RAM OOM) during the final VAE-decode stage — converting latents back to the image. This sub-failure is model-class-independent: it is a property of the offload runtime path (the VAE-decode peak lands in host RAM), so it affects the Klein 4B offload path documented here too. The same reporter posted a working fix on issuecomment-3596576394: split the pipeline into a text-to-latent stage (run the DiT, return latents, release the text encoder + DiT from memory) followed by a separate latent-to-image stage that loads the VAE after clearing memory. The diffusers docs also document offloading to disk as an alternative. 32 GB system RAM is the realistic floor for the offload path; 16 GB users should expect to need either disk offload or the staged-decode workaround.

⚠️ Issue #11 is a FLUX.2-dev report, not Klein. Issue #11 is titled "3090 24G: cuda out of memory" and reproduces with FLUX.2-dev and the Mistral3 text encoder (the reporter loads FLUX.2-dev-bnb-4bit via Flux2Pipeline). Klein 4B uses a Qwen3-4B encoder (Qwen3ForCausalLM per model_index.json). Only the model-class-independent VAE-decode-RAM sub-failure above transfers to Klein — the thread's encoder-specific (Mistral3 / bnb-4bit) discussion does not apply to this recipe's variant.

"Distorted colors / washed-out output"

You're loading the wrong VAE in ComfyUI. Klein must use flux2-vae.safetensors (the Flux.2 family VAE, shared across Klein / Dev / Pro per model_index.json) — loading any other VAE (Flux.1, SDXL, SD1.5) will produce broken output. Confirm the VAE file in ComfyUI/models/vae/ matches the filename above.

"Text encoder shape mismatch / config error"

Klein uses a Qwen3-4B text encoder per its model_index.json (text_encoder = ['transformers', 'Qwen3ForCausalLM']), not the T5 family that Flux.1 used and not the Mistral3 family that Flux.2 dev uses. Make sure you downloaded qwen_3_4b.safetensors (or the equivalent diffusers shards from the text_encoder/ subfolder), not a Flux.1 T5 file or a Flux.2-dev Mistral3 encoder you might still have on disk.

"Can I run the full-resident BF16 path on the 4070 Ti SUPER like the 4090 does?"

No — the 4090 sibling's full-resident BF16 path peaks at ~20 GB, which exceeds the RTX 4070 Ti SUPER's 16 GB. Keep pipe.enable_model_cpu_offload() enabled (the default in the snippet above) to stay near the ~13 GB envelope. If you want to drop GPU memory further at a small quality/throughput cost, use the FP8 ComfyUI single-file path (Path B), which loads a 4.07 GB FP8 transformer instead of the 7.75 GB BF16 one.

"How do I generate images larger than 1024×1024?"

Activation memory scales quadratically with side length — 2048×2048 even with offload can push past the 13 GB envelope toward the 4070 Ti SUPER's 16 GB ceiling. Drop back to 1024×1024 for the standard path, or split the work into a low-res Klein generation followed by an upscaler (Real-ESRGAN, SwinIR, etc.). Report success / failure cases via the submission form.

"Where do I find the official workflow JSONs?"

The docs.comfy.org Klein tutorial links downloadable workflow JSONs for all six variants (text-to-image and image-editing, base/distilled, 4B/9B). The BFL GitHub repo is the canonical home of the official command-line tooling. If neither has what you need, report your setup via the submission form.