Flux.2 Klein 4B on RTX 5070 Ti: Blackwell-Native FP8 4-Step Text-to-Image at ~8.4 GB

What You'll Build

Generate 1024×1024 images locally with Black Forest Labs' Flux.2 Klein 4B — the Apache-2.0, 4-billion-parameter, step-distilled member of the Flux.2 family — on an RTX 5070 Ti. Klein is explicitly built for consumer hardware: BFL's model card states "The FLUX.2 [klein] 4B model fits in ~13GB VRAM and is accessible on NVIDIA RTX 3090/4070 and above" — the "and above" phrasing cleanly subsumes the 16 GB RTX 5070 Ti. The recipe leads with the FP8 ComfyUI path because the 5070 Ti's Blackwell GB203 sm_120 architecture has native FP8 tensor-core acceleration, and the FP8 distilled variant runs at a measured ~8.4 GB peak — leaving roughly 7 GB of headroom on the 5070 Ti's 16 GB budget.

Hardware data: RTX 5070 Ti (16 GB VRAM) · ~8.4 GB peak VRAM at FP8 distilled (Blackwell-measured) · 4-step distilled generation · See benchmark data

ℹ️ This recipe targets Klein 4B (Apache-2.0), not the 9B variants. The official Flux.2 README ships Klein in several sizes — only the 4B models (distilled and base) are Apache-2.0 and free for commercial use. The 9B and 9B-KV variants are released under the FLUX Non-Commercial License. Everything below pins the 4B distilled variant; the file names and VRAM figures are 4B-specific.

Requirements

Component	Minimum	Tested
GPU	13 GB VRAM with `enable_model_cpu_offload()` (BF16); ~9 GB for the FP8 ComfyUI path	RTX 5070 Ti (16 GB)
RAM	16 GB (32 GB recommended for the BF16 + offload path)	—
Storage	~8 GB (FP8 ComfyUI single-file + Qwen3-4B encoder + VAE) or ~16 GB (full BF16 diffusers checkout)	—
Software	ComfyUI (latest nightly with Klein nodes) — OR — `diffusers` (main) + `transformers` + `accelerate`	—

The official black-forest-labs/FLUX.2-klein-4b-fp8 repository ships a 4.07 GB FP8 single-file (flux-2-klein-4b-fp8.safetensors) used by the ComfyUI path, per the HF tree API. The full BF16 diffusers checkout from black-forest-labs/FLUX.2-klein-4B is one consolidated 7.75 GB transformer file (flux-2-klein-4b.safetensors) plus the Qwen3-4B text encoder and the Flux.2-family VAE.

Installation

Two supported paths — pick one. Path A is the recommended default for an RTX 5070 Ti because the FP8 path leverages the card's native Blackwell FP8 tensor cores and lands at the smallest VRAM footprint. Path B is for users who prefer Python scripting and the diffusers ecosystem.

Path A — ComfyUI FP8 (recommended for RTX 5070 Ti)

1. Update ComfyUI to a build with Klein support

Klein support landed in ComfyUI's nightly nodes; an older build will fail to load the workflow. From your ComfyUI checkout:

git pull
pip install -r requirements.txt

For Blackwell sm_120 PyTorch support, install the cu128 wheels:

pip install --upgrade torch --index-url https://download.pytorch.org/whl/cu128

The default pip install torch already includes sm_120 kernels via cu128 on current releases; the explicit --index-url above guarantees it if you have an older index pinned.

2. Download the required files

Per the official ComfyUI Flux.2 Klein tutorial, the 4B distilled variant needs three files:

File	Folder
`flux-2-klein-4b-fp8.safetensors` (4.07 GB, distilled) — or `flux-2-klein-base-4b-fp8.safetensors` (base)	`ComfyUI/models/diffusion_models/`
`qwen_3_4b.safetensors`	`ComfyUI/models/text_encoders/`
`flux2-vae.safetensors`	`ComfyUI/models/vae/`

The VAE (flux2-vae.safetensors) is the Flux.2 family VAE — shared across Klein / Dev / Pro and distinct from the Flux.1 VAE. The text encoder is Qwen3-4B, not the T5 family used by Flux.1 — confirmed by Klein's model_index.json which declares "text_encoder": ["transformers", "Qwen3ForCausalLM"].

Path B — Diffusers (Python, BF16 + CPU offload)

1. Install dependencies

pip install git+https://github.com/huggingface/diffusers.git transformers accelerate
pip install --upgrade torch --index-url https://download.pytorch.org/whl/cu128

The Flux.2 Klein pipelines (Flux2KleinPipeline) require an up-to-date diffusers — the BFL model card pins git+https://github.com/huggingface/diffusers.git rather than the released wheel.

2. Run the official example

This is the exact snippet published on the model card at huggingface.co/black-forest-labs/FLUX.2-klein-4B, unmodified — pipe.enable_model_cpu_offload() is kept enabled for the 16 GB 5070 Ti, which is what holds peak VRAM near the documented ~13 GB envelope.

import torch
from diffusers import Flux2KleinPipeline

device = "cuda"
dtype = torch.bfloat16

pipe = Flux2KleinPipeline.from_pretrained("black-forest-labs/FLUX.2-klein-4B", torch_dtype=dtype)
pipe.enable_model_cpu_offload()  # save some VRAM by offloading the model to CPU

prompt = "A cat holding a sign that says hello world"
image = pipe(
    prompt=prompt,
    height=1024,
    width=1024,
    guidance_scale=1.0,
    num_inference_steps=4,
    generator=torch.Generator(device=device).manual_seed(0)
).images[0]
image.save("flux-klein.png")

guidance_scale=1.0 and num_inference_steps=4 are the values published in the BFL model-card example for the distilled variant. The base (undistilled) variant — black-forest-labs/FLUX.2-klein-base-4B — is a 50-step model per the official Flux.2 README, which recommends Base for fine-tuning, LoRA training, and maximum flexibility.

Running

ComfyUI (Path A)

python main.py --listen

Open http://localhost:8188 and load one of the official Klein workflow JSONs (text-to-image and image-editing, base/distilled, 4B/9B) downloadable from the docs.comfy.org tutorial page. For the distilled variant, use 4 steps; the base variant is a 50-step model per the official Flux.2 README.

Diffusers (Path B)

python flux_klein.py

Output flux-klein.png lands in your working directory. First run downloads the BF16 weights from the Hub (consolidated transformer ~7.75 GB + Qwen3-4B encoder + VAE) into ~/.cache/huggingface/. The first generation includes one-time pipeline compilation / shader-cache build; steady-state generations are faster.

Results

Speed: No first-party RTX-5070-Ti-named generation-time number for Klein 4B has been published at the time of writing, and the backend has no benchmark for this pair yet (/check/flux-2-klein-4b/rtx-5070-ti is unknown). BFL describes Klein as a sub-second model — the official Flux.2 GitHub README calls out "Generate or edit images under a second on modern hardware" — but that is a capability statement, not a hardware-named measurement. The only measured GPU-named timing published so far is on RTX 5090 (the official ComfyUI Flux.2 Klein tutorial lists "~1.2s (5090) · 8.4GB VRAM" distilled and "~17s (5090) · 9.2GB VRAM" base, both at FP8). That 5090 timing does not transfer to the RTX 5070 Ti: although both are Blackwell sm_120, the 5070 Ti has ~16.7% fewer CUDA cores and ~6.7% lower memory bandwidth (896 vs 960 GB/s vs the 5090's much wider memory), so the per-image time will be materially slower than the 5090's ~1.2 s. Report your measured 5070 Ti generation time via /contribute so /check/flux-2-klein-4b/rtx-5070-ti gets a real benchmark.
VRAM usage (FP8 ComfyUI, recommended path): ~8.4 GB peak for the distilled variant and ~9.2 GB peak for the base variant per the official docs.comfy.org tutorial ("~1.2s (5090) · 8.4GB VRAM" distilled, "~17s (5090) · 9.2GB VRAM" base). That measurement was taken on an RTX 5090, but the figure is a property of the FP8 runtime graph on Blackwell sm_120 — the same FP8 single-file, Qwen3-4B encoder, VAE, and activations are resident on the 5070 Ti's identical sm_120 architecture, so the ~8.4 GB envelope holds and sits comfortably inside the 5070 Ti's 16 GB budget. The FP8 single-file on disk is 4.07 GB per the HF tree API, consistent with the 8.4 GB runtime peak once the encoder, VAE, and activations are also resident.
VRAM usage (BF16 diffusers, Path B alternative): ~13 GB peak with enable_model_cpu_offload() enabled per BFL's official Klein 4B model card ("The FLUX.2 [klein] 4B model fits in ~13GB VRAM and is accessible on NVIDIA RTX 3090/4070 and above" — RTX 5070 Ti is covered by the "and above" wording). This is the higher-envelope path; the frontmatter min_vram_gb: 13 reflects it so the compatibility math stays honest for users who pick Path B. The official Flux.2 GitHub README separately states "Klein 4B fits in ~8GB VRAM (RTX 3090/4070 and up)" — the lower ~8 GB figure refers to the FP8 ComfyUI single-file path (Path A above), while the ~13 GB figure refers to the offloaded BF16 diffusers path. Both fit the 5070 Ti with room to spare. See /check/flux-2-klein-4b/rtx-5070-ti for community benchmark data as it lands.
Quality notes: Klein is the small/distilled member of the Flux.2 family — expect strong prompt adherence at 4-billion parameters with the usual distillation tradeoffs (less stylistic flexibility than the Flux.2 Dev base). The 4 distilled steps make iteration fast on Blackwell.
License: Apache 2.0 — commercial use permitted for the 4B variant (per the model card).

For the full benchmark data, see /check/flux-2-klein-4b/rtx-5070-ti.

Troubleshooting

"OOM during VAE decode after the diffusion steps finish (Path B, offload enabled)"

If you run the BF16 diffusers path (Path B) with enable_model_cpu_offload() and limited system RAM, the failure mode reported on flux2 Issue #11 can apply: the diffusion steps complete fine, but the process is killed by the OS (System RAM OOM) specifically during the final VAE-decode stage, because offload trades GPU VRAM for system RAM. Note the variant disambiguation: Issue #11 reports the failure on FLUX.2-dev (the larger Mistral3-encoder variant running bnb-4bit), not Klein — but the specific sub-failure here is at the VAE-decode stage, which is shared across the Flux.2 family runtime path under enable_model_cpu_offload() and therefore transfers to Klein. The community reporter neuhsm posted a working fix on issuecomment-3596576394: split the pipeline into a text-to-latent stage, then explicitly gc.collect() + torch.cuda.empty_cache() before running VAE decode separately. 32 GB system RAM is a comfortable floor for the offload path. The Mistral3-encoder-specific workaround code earlier in that same thread (the FLUX.2-dev-bnb-4bit 4-bit text-encoder snippet) does NOT transfer to Klein — Klein uses a Qwen3-4B encoder, not Mistral3. Avoiding the offload path entirely by using Path A (FP8 ComfyUI, no CPU offload needed at ~8.4 GB) sidesteps this issue on the 5070 Ti.

"FlashAttention-2 crash on first inference"

If you've manually set attn_implementation="flash_attention_2" somewhere in a wrapper or custom node, FA2 sm_120 kernels are still tracked open at Dao-AILab/flash-attention#2168 — RTX 50-series (Blackwell sm_120) cards crash at the first inference call. The diffusers default attention is scaled_dot_product_attention (SDPA), which has native sm_120 kernels in any cu128 PyTorch build and is the always-works fallback. The official Klein snippet does not request FA2; only a custom override would trigger this. Switch the override to "sdpa" (or "eager") if you hit it.

"Distorted colors / washed-out output"

You're loading the wrong VAE in ComfyUI. Klein must use flux2-vae.safetensors (the Flux.2 family VAE, shared across Klein / Dev / Pro per model_index.json) — loading any other VAE (Flux.1, SDXL, SD1.5) will produce broken output. Confirm the VAE file in ComfyUI/models/vae/ matches the filename above.

"Text encoder shape mismatch / config error"

Klein uses a Qwen3-4B text encoder per its model_index.json (text_encoder = ['transformers', 'Qwen3ForCausalLM']), not the T5 family that Flux.1 used and not the Mistral3 family that Flux.2 dev uses (Issue #11 on the flux2 repo is a Flux.2-dev bnb-4bit report — its encoder-specific workaround does not transfer to Klein, which has a different encoder family entirely). Make sure you downloaded qwen_3_4b.safetensors (or the equivalent diffusers shards from the text_encoder/ subfolder), not a Flux.1 T5 file or a Flux.2-dev Mistral3 encoder you might still have on disk.

"Want the lowest VRAM / can I run the 9B variant?"

On the 5070 Ti, Path A (FP8 ComfyUI, ~8.4 GB distilled) is already well within the 16 GB budget, so there's rarely a need to go lower. The docs.comfy.org Klein tutorial also lists Klein 9B files, but the 9B variants are under the FLUX Non-Commercial License (not Apache-2.0), and their timing/VRAM figures aren't published on the same tutorial page — stay on the 4B distilled variant unless you have a research-only use case and have measured the 9B footprint yourself.

"Where do I find the official workflow JSONs?"

The docs.comfy.org Klein tutorial links downloadable workflow JSONs for all variants (text-to-image and image-editing, base/distilled, 4B/9B). The BFL GitHub repo is the canonical home of the official command-line tooling. If neither has what you need, report your setup via the submission form.