Flux.2 Klein 4B on RTX 5070: Blackwell-Native FP8 4-Step Text-to-Image at ~8.4 GB

What You'll Build

Generate 1024×1024 images locally with Black Forest Labs' Flux.2 Klein 4B — the Apache-2.0, 4-billion-parameter, step-distilled member of the Flux.2 family — on an RTX 5070. Klein was explicitly built for consumer hardware: BFL's model card states "The FLUX.2 [klein] 4B model fits in ~13GB VRAM and is accessible on NVIDIA RTX 3090/4070 and above". The RTX 4070 named there is a 12 GB card — the same VRAM class as the RTX 5070 — so the 5070 is squarely inside the hardware band the model authors target. The recipe leads with the FP8 ComfyUI path because the 5070's Blackwell GB205 sm_120 architecture has native FP8 tensor-core acceleration, and the FP8 distilled variant runs at a measured ~8.4 GB peak — leaving comfortable headroom inside the 5070's 12 GB budget.

Hardware data: RTX 5070 (12 GB VRAM) · ~8.4 GB peak VRAM at FP8 distilled (Blackwell sm_120 runtime property) · 4-step distilled generation · See benchmark data

ℹ️ This recipe targets Klein 4B (Apache-2.0), not the 9B variants. The official Flux.2 README ships Klein in several sizes — only the 4B models (distilled and base) are Apache-2.0 and free for commercial use. The 9B and 9B-KV variants are released under the FLUX Non-Commercial License. Everything below pins the 4B distilled variant; the file names and VRAM figures are 4B-specific.

⚠️ 12 GB budget: lead with the FP8 path, not BF16. Unlike a 16 GB card, the RTX 5070's 12 GB budget (≈10.5–11.3 GB usable once a display is attached) does not comfortably hold the ~13 GB BF16 + CPU-offload diffusers path documented on the model card. On the 5070, use Path A (FP8 ComfyUI, ~8.4 GB) — that is why min_vram_gb is set to 8 here. The ~13 GB BF16 path is documented below only as a 16 GB-card / headless alternative, not as a 12 GB path.

Requirements

Component	Minimum	Tested
GPU	~9 GB VRAM for the FP8 ComfyUI path (peak ~8.4 GB)	RTX 5070 (12 GB)
RAM	16 GB (32 GB recommended if you try the BF16 + offload path)	—
Storage	~8 GB (FP8 ComfyUI single-file + Qwen3-4B encoder + VAE)	—
Software	ComfyUI (latest nightly with Klein nodes) — OR — `diffusers` (main) + `transformers` + `accelerate`	—

The official black-forest-labs/FLUX.2-klein-4b-fp8 repository ships a 4.07 GB FP8 single-file (flux-2-klein-4b-fp8.safetensors) used by the ComfyUI path, per the HF tree API. The full BF16 diffusers checkout from black-forest-labs/FLUX.2-klein-4B is one consolidated 7.75 GB transformer file (flux-2-klein-4b.safetensors) plus the Qwen3-4B text encoder and the Flux.2-family VAE — that BF16 path is the higher-VRAM alternative below, not the recommended 12 GB path.

Installation

The recommended path on the RTX 5070 is Path A (ComfyUI FP8) — it leverages the card's native Blackwell FP8 tensor cores and lands at ~8.4 GB peak, well inside the 12 GB budget. Path B (BF16 diffusers) is documented as an alternative for users with a 16 GB card or a headless box and a preference for Python scripting; on a 12 GB display card it is too tight (see the warning above).

Path A — ComfyUI FP8 (recommended for RTX 5070)

1. Update ComfyUI to a build with Klein support

Klein support landed in ComfyUI's nightly nodes; an older build will fail to load the workflow. From your ComfyUI checkout:

git pull
pip install -r requirements.txt

For Blackwell sm_120 PyTorch support, install the cu128 wheels:

pip install --upgrade torch --index-url https://download.pytorch.org/whl/cu128

The default pip install torch already includes sm_120 kernels via cu128 on current releases; the explicit --index-url above guarantees it if you have an older index pinned. The RTX 5070's GB205 die is the same Blackwell compute capability (sm_120) as the rest of the 50-series, so the same cu128 wheel applies.

2. Download the required files

Per the official ComfyUI Flux.2 Klein tutorial, the 4B distilled variant needs three files:

File	Folder
`flux-2-klein-4b-fp8.safetensors` (4.07 GB, distilled)	`ComfyUI/models/diffusion_models/`
`qwen_3_4b.safetensors`	`ComfyUI/models/text_encoders/`
`flux2-vae.safetensors`	`ComfyUI/models/vae/`

The VAE (flux2-vae.safetensors) is the Flux.2 family VAE — shared across Klein / Dev / Pro and distinct from the Flux.1 VAE. The text encoder is Qwen3-4B, not the T5 family used by Flux.1 — confirmed by Klein's model_index.json which declares "text_encoder": ["transformers", "Qwen3ForCausalLM"].

Path B — Diffusers (BF16 + CPU offload) — 16 GB-card / headless alternative

On a 12 GB display card this path is too tight (peak ~13 GB vs ~10.5–11.3 GB usable). Use it only on a 16 GB card, or on a headless 12 GB box where the full VRAM is free. It is included for completeness and because the install steps are the canonical BFL example.

1. Install dependencies

pip install git+https://github.com/huggingface/diffusers.git transformers accelerate
pip install --upgrade torch --index-url https://download.pytorch.org/whl/cu128

The Flux.2 Klein pipelines (Flux2KleinPipeline) require an up-to-date diffusers — the BFL model card pins git+https://github.com/huggingface/diffusers.git rather than the released wheel.

2. Run the official example

This is the exact snippet published on the model card at huggingface.co/black-forest-labs/FLUX.2-klein-4B, unmodified — pipe.enable_model_cpu_offload() is what holds peak VRAM near the documented ~13 GB envelope.

import torch
from diffusers import Flux2KleinPipeline

device = "cuda"
dtype = torch.bfloat16

pipe = Flux2KleinPipeline.from_pretrained("black-forest-labs/FLUX.2-klein-4B", torch_dtype=dtype)
pipe.enable_model_cpu_offload()  # save some VRAM by offloading the model to CPU

prompt = "A cat holding a sign that says hello world"
image = pipe(
    prompt=prompt,
    height=1024,
    width=1024,
    guidance_scale=1.0,
    num_inference_steps=4,
    generator=torch.Generator(device=device).manual_seed(0)
).images[0]
image.save("flux-klein.png")

guidance_scale=1.0 and num_inference_steps=4 are the values published in the BFL model-card example for the distilled variant. The base (undistilled) variant — black-forest-labs/FLUX.2-klein-base-4B — is a 50-step model per the official Flux.2 README, which recommends Base for fine-tuning, LoRA training, and maximum flexibility.

Running

ComfyUI (Path A — recommended)

python main.py --listen

Open http://localhost:8188 and load one of the official Klein workflow JSONs (text-to-image and image-editing, base/distilled, 4B/9B) downloadable from the docs.comfy.org tutorial page. For the distilled variant, use 4 steps; the base variant is a 50-step model per the official Flux.2 README. Peak VRAM stays near ~8.4 GB on the FP8 distilled path, so the 12 GB card has headroom for the display and a browser during generation.

Diffusers (Path B — alternative)

python flux_klein.py

Output flux-klein.png lands in your working directory. First run downloads the BF16 weights from the Hub (consolidated transformer ~7.75 GB + Qwen3-4B encoder + VAE) into ~/.cache/huggingface/. The first generation includes one-time pipeline compilation / shader-cache build; steady-state generations are faster. On a 12 GB display card this path will sit very close to (or over) the usable VRAM ceiling — prefer Path A.

Results

Speed: No RTX-5070-named generation-time number for Klein 4B has been published at the time of writing, and the backend has no benchmark for this pair yet (/check/flux-2-klein-4b/rtx-5070 is unknown). The only measured GPU-named timing published so far is on RTX 5090: the official ComfyUI Flux.2 Klein tutorial lists "~1.2s (5090) · 8.4GB VRAM" for the distilled variant. That RTX 5090 figure does not transfer to the RTX 5070 — the 5070 has far fewer CUDA cores and much lower memory bandwidth than the 5090, so per-image time will be materially slower. Rather than publish an extrapolated number, this recipe omits a 5070 speed figure. Report your measured RTX 5070 generation time via /contribute so /check/flux-2-klein-4b/rtx-5070 gets a real benchmark.
VRAM usage (FP8 ComfyUI, recommended path): ~8.4 GB peak for the distilled variant per the official docs.comfy.org tutorial ("~1.2s (5090) · 8.4GB VRAM"). That measurement was taken on an RTX 5090, but the figure is a property of the FP8 runtime graph on Blackwell sm_120 — the same FP8 single-file, Qwen3-4B encoder, VAE, and activations are resident on the 5070's identical sm_120 architecture, so the ~8.4 GB envelope holds and sits comfortably inside the 5070's 12 GB budget. The FP8 single-file on disk is 4.07 GB per the HF tree API, consistent with the ~8.4 GB runtime peak once the encoder, VAE, and activations are also resident.
VRAM usage (BF16 diffusers alternative — NOT the 12 GB path): ~13 GB peak with enable_model_cpu_offload() enabled per BFL's official Klein 4B model card ("The FLUX.2 [klein] 4B model fits in ~13GB VRAM and is accessible on NVIDIA RTX 3090/4070 and above"). At ~13 GB this exceeds the RTX 5070's ~10.5–11.3 GB usable budget on a display card, so it is documented as a 16 GB-card / headless alternative only. The min_vram_gb: 8 in the frontmatter reflects the recommended FP8 Path A that this recipe actually leads with. See /check/flux-2-klein-4b/rtx-5070 for community benchmark data as it lands.
Quality notes: Klein is the small/distilled member of the Flux.2 family — expect strong prompt adherence at 4-billion parameters with the usual distillation tradeoffs (less stylistic flexibility than the Flux.2 Dev base). The 4 distilled steps make iteration fast on Blackwell.
License: Apache 2.0 — commercial use permitted for the 4B variant (per the model card).

For the full benchmark data, see /check/flux-2-klein-4b/rtx-5070.

Troubleshooting

"OOM during VAE decode after the diffusion steps finish (Path B, offload enabled)"

If you run the BF16 diffusers path (Path B) with enable_model_cpu_offload() and limited system RAM, the failure mode reported on flux2 Issue #11 can apply: the diffusion steps complete fine, but the process is killed by the OS (System RAM OOM) specifically during the final VAE-decode stage, because offload trades GPU VRAM for system RAM. Note the variant disambiguation: Issue #11 reports the failure on FLUX.2-dev (the larger Mistral3-encoder variant running bnb-4bit), not Klein — but the specific sub-failure here is at the VAE-decode stage, which is shared across the Flux.2 family runtime path under enable_model_cpu_offload() and therefore transfers to Klein. The community reporter neuhsm (not a maintainer) posted a working fix on issuecomment-3596576394: split the pipeline into a text-to-latent stage, then explicitly gc.collect() + torch.cuda.empty_cache() before running VAE decode separately. The Mistral3-encoder-specific workaround code earlier in that same thread (the FLUX.2-dev-bnb-4bit 4-bit text-encoder snippet) does NOT transfer to Klein — Klein uses a Qwen3-4B encoder, not Mistral3. On the RTX 5070 the cleanest fix is to avoid the offload path entirely and use Path A (FP8 ComfyUI, no CPU offload needed at ~8.4 GB).

"FlashAttention-2 crash on first inference"

If you've manually set attn_implementation="flash_attention_2" somewhere in a wrapper or custom node, FA2 sm_120 kernels are still tracked open at Dao-AILab/flash-attention#2168 — RTX 50-series (Blackwell sm_120) cards crash at the first inference call. The diffusers default attention is scaled_dot_product_attention (SDPA), which has native sm_120 kernels in any cu128 PyTorch build and is the always-works fallback. The official Klein snippet does not request FA2; only a custom override would trigger this. Switch the override to "sdpa" (or "eager") if you hit it.

"Distorted colors / washed-out output"

You're loading the wrong VAE in ComfyUI. Klein must use flux2-vae.safetensors (the Flux.2 family VAE, shared across Klein / Dev / Pro per model_index.json) — loading any other VAE (Flux.1, SDXL, SD1.5) will produce broken output. Confirm the VAE file in ComfyUI/models/vae/ matches the filename above.

"Text encoder shape mismatch / config error"

Klein uses a Qwen3-4B text encoder per its model_index.json (text_encoder = ['transformers', 'Qwen3ForCausalLM']), not the T5 family that Flux.1 used and not the Mistral3 family that Flux.2 dev uses. Make sure you downloaded qwen_3_4b.safetensors (or the equivalent diffusers shards from the text_encoder/ subfolder), not a Flux.1 T5 file or a Flux.2-dev Mistral3 encoder you might still have on disk.

"Can I run the BF16 diffusers path on my 12 GB RTX 5070?"

It's tight and not recommended. The BF16 + offload path peaks near ~13 GB per the BFL model card, which exceeds the ~10.5–11.3 GB usable on a 12 GB display card. Stay on Path A (FP8 ComfyUI, ~8.4 GB). If you have a second, headless 12 GB card with no display attached (so the full VRAM is free) you may have enough room, but the FP8 path is faster on Blackwell anyway because it uses the native FP8 tensor cores. If you measure the BF16 path fitting on your 5070, report it via the submission form.

"Where do I find the official workflow JSONs?"

The docs.comfy.org Klein tutorial links downloadable workflow JSONs for all variants (text-to-image and image-editing, base/distilled, 4B/9B). The BFL GitHub repo is the canonical home of the official command-line tooling. If neither has what you need, report your setup via the submission form.