self-hosted/ai
§01·recipe · image

Flux.2 Klein 4B on RTX 5070 Ti: Blackwell-Native FP8 4-Step Text-to-Image at ~8.4 GB

imagebeginner13GB+ VRAMJun 3, 2026
models
tools
prerequisites
  • NVIDIA RTX 5070 Ti (16 GB VRAM) — Blackwell GB203 sm_120
  • Python 3.10+ (Python 3.12 for the official BFL repo)
  • ComfyUI (latest nightly with Klein nodes) or the `diffusers` Python package
  • PyTorch built with cu128 wheels (Blackwell sm_120 support)

What You'll Build

Generate 1024×1024 images locally with Black Forest Labs' Flux.2 Klein 4B — the Apache-2.0, 4-billion-parameter, step-distilled member of the Flux.2 family — on an RTX 5070 Ti. Klein is explicitly built for consumer hardware: BFL's model card states "The FLUX.2 [klein] 4B model fits in ~13GB VRAM and is accessible on NVIDIA RTX 3090/4070 and above" — the "and above" phrasing cleanly subsumes the 16 GB RTX 5070 Ti. The recipe leads with the FP8 ComfyUI path because the 5070 Ti's Blackwell GB203 sm_120 architecture has native FP8 tensor-core acceleration, and the FP8 distilled variant runs at a measured ~8.4 GB peak — leaving roughly 7 GB of headroom on the 5070 Ti's 16 GB budget.

Hardware data: RTX 5070 Ti (16 GB VRAM) · ~8.4 GB peak VRAM at FP8 distilled (Blackwell-measured) · 4-step distilled generation · See benchmark data

ℹ️ This recipe targets Klein 4B (Apache-2.0), not the 9B variants. The official Flux.2 README ships Klein in several sizes — only the 4B models (distilled and base) are Apache-2.0 and free for commercial use. The 9B and 9B-KV variants are released under the FLUX Non-Commercial License. Everything below pins the 4B distilled variant; the file names and VRAM figures are 4B-specific.

Requirements

ComponentMinimumTested
GPU13 GB VRAM with enable_model_cpu_offload() (BF16); ~9 GB for the FP8 ComfyUI pathRTX 5070 Ti (16 GB)
RAM16 GB (32 GB recommended for the BF16 + offload path)
Storage~8 GB (FP8 ComfyUI single-file + Qwen3-4B encoder + VAE) or ~16 GB (full BF16 diffusers checkout)
SoftwareComfyUI (latest nightly with Klein nodes) — OR — diffusers (main) + transformers + accelerate

The official black-forest-labs/FLUX.2-klein-4b-fp8 repository ships a 4.07 GB FP8 single-file (flux-2-klein-4b-fp8.safetensors) used by the ComfyUI path, per the HF tree API. The full BF16 diffusers checkout from black-forest-labs/FLUX.2-klein-4B is one consolidated 7.75 GB transformer file (flux-2-klein-4b.safetensors) plus the Qwen3-4B text encoder and the Flux.2-family VAE.

Installation

Two supported paths — pick one. Path A is the recommended default for an RTX 5070 Ti because the FP8 path leverages the card's native Blackwell FP8 tensor cores and lands at the smallest VRAM footprint. Path B is for users who prefer Python scripting and the diffusers ecosystem.

Path A — ComfyUI FP8 (recommended for RTX 5070 Ti)

1. Update ComfyUI to a build with Klein support

Klein support landed in ComfyUI's nightly nodes; an older build will fail to load the workflow. From your ComfyUI checkout:

git pull
pip install -r requirements.txt

For Blackwell sm_120 PyTorch support, install the cu128 wheels:

pip install --upgrade torch --index-url https://download.pytorch.org/whl/cu128

The default pip install torch already includes sm_120 kernels via cu128 on current releases; the explicit --index-url above guarantees it if you have an older index pinned.

2. Download the required files

Per the official ComfyUI Flux.2 Klein tutorial, the 4B distilled variant needs three files:

FileFolder
flux-2-klein-4b-fp8.safetensors (4.07 GB, distilled) — or flux-2-klein-base-4b-fp8.safetensors (base)ComfyUI/models/diffusion_models/
qwen_3_4b.safetensorsComfyUI/models/text_encoders/
flux2-vae.safetensorsComfyUI/models/vae/

The VAE (flux2-vae.safetensors) is the Flux.2 family VAE — shared across Klein / Dev / Pro and distinct from the Flux.1 VAE. The text encoder is Qwen3-4B, not the T5 family used by Flux.1 — confirmed by Klein's model_index.json which declares "text_encoder": ["transformers", "Qwen3ForCausalLM"].

Path B — Diffusers (Python, BF16 + CPU offload)

1. Install dependencies

pip install git+https://github.com/huggingface/diffusers.git transformers accelerate
pip install --upgrade torch --index-url https://download.pytorch.org/whl/cu128

The Flux.2 Klein pipelines (Flux2KleinPipeline) require an up-to-date diffusers — the BFL model card pins git+https://github.com/huggingface/diffusers.git rather than the released wheel.

2. Run the official example

This is the exact snippet published on the model card at huggingface.co/black-forest-labs/FLUX.2-klein-4B, unmodified — pipe.enable_model_cpu_offload() is kept enabled for the 16 GB 5070 Ti, which is what holds peak VRAM near the documented ~13 GB envelope.

import torch
from diffusers import Flux2KleinPipeline

device = "cuda"
dtype = torch.bfloat16

pipe = Flux2KleinPipeline.from_pretrained("black-forest-labs/FLUX.2-klein-4B", torch_dtype=dtype)
pipe.enable_model_cpu_offload()  # save some VRAM by offloading the model to CPU

prompt = "A cat holding a sign that says hello world"
image = pipe(
    prompt=prompt,
    height=1024,
    width=1024,
    guidance_scale=1.0,
    num_inference_steps=4,
    generator=torch.Generator(device=device).manual_seed(0)
).images[0]
image.save("flux-klein.png")

guidance_scale=1.0 and num_inference_steps=4 are the values published in the BFL model-card example for the distilled variant. The base (undistilled) variant — black-forest-labs/FLUX.2-klein-base-4B — is a 50-step model per the official Flux.2 README, which recommends Base for fine-tuning, LoRA training, and maximum flexibility.

Running

ComfyUI (Path A)

python main.py --listen

Open http://localhost:8188 and load one of the official Klein workflow JSONs (text-to-image and image-editing, base/distilled, 4B/9B) downloadable from the docs.comfy.org tutorial page. For the distilled variant, use 4 steps; the base variant is a 50-step model per the official Flux.2 README.

Diffusers (Path B)

python flux_klein.py

Output flux-klein.png lands in your working directory. First run downloads the BF16 weights from the Hub (consolidated transformer ~7.75 GB + Qwen3-4B encoder + VAE) into ~/.cache/huggingface/. The first generation includes one-time pipeline compilation / shader-cache build; steady-state generations are faster.

Results

  • Speed: No first-party RTX-5070-Ti-named generation-time number for Klein 4B has been published at the time of writing, and the backend has no benchmark for this pair yet (/check/flux-2-klein-4b/rtx-5070-ti is unknown). BFL describes Klein as a sub-second model — the official Flux.2 GitHub README calls out "Generate or edit images under a second on modern hardware" — but that is a capability statement, not a hardware-named measurement. The only measured GPU-named timing published so far is on RTX 5090 (the official ComfyUI Flux.2 Klein tutorial lists "~1.2s (5090) · 8.4GB VRAM" distilled and "~17s (5090) · 9.2GB VRAM" base, both at FP8). That 5090 timing does not transfer to the RTX 5070 Ti: although both are Blackwell sm_120, the 5070 Ti has ~16.7% fewer CUDA cores and ~6.7% lower memory bandwidth (896 vs 960 GB/s vs the 5090's much wider memory), so the per-image time will be materially slower than the 5090's ~1.2 s. Report your measured 5070 Ti generation time via /contribute so /check/flux-2-klein-4b/rtx-5070-ti gets a real benchmark.
  • VRAM usage (FP8 ComfyUI, recommended path): ~8.4 GB peak for the distilled variant and ~9.2 GB peak for the base variant per the official docs.comfy.org tutorial ("~1.2s (5090) · 8.4GB VRAM" distilled, "~17s (5090) · 9.2GB VRAM" base). That measurement was taken on an RTX 5090, but the figure is a property of the FP8 runtime graph on Blackwell sm_120 — the same FP8 single-file, Qwen3-4B encoder, VAE, and activations are resident on the 5070 Ti's identical sm_120 architecture, so the ~8.4 GB envelope holds and sits comfortably inside the 5070 Ti's 16 GB budget. The FP8 single-file on disk is 4.07 GB per the HF tree API, consistent with the 8.4 GB runtime peak once the encoder, VAE, and activations are also resident.
  • VRAM usage (BF16 diffusers, Path B alternative): ~13 GB peak with enable_model_cpu_offload() enabled per BFL's official Klein 4B model card ("The FLUX.2 [klein] 4B model fits in ~13GB VRAM and is accessible on NVIDIA RTX 3090/4070 and above" — RTX 5070 Ti is covered by the "and above" wording). This is the higher-envelope path; the frontmatter min_vram_gb: 13 reflects it so the compatibility math stays honest for users who pick Path B. The official Flux.2 GitHub README separately states "Klein 4B fits in ~8GB VRAM (RTX 3090/4070 and up)" — the lower ~8 GB figure refers to the FP8 ComfyUI single-file path (Path A above), while the ~13 GB figure refers to the offloaded BF16 diffusers path. Both fit the 5070 Ti with room to spare. See /check/flux-2-klein-4b/rtx-5070-ti for community benchmark data as it lands.
  • Quality notes: Klein is the small/distilled member of the Flux.2 family — expect strong prompt adherence at 4-billion parameters with the usual distillation tradeoffs (less stylistic flexibility than the Flux.2 Dev base). The 4 distilled steps make iteration fast on Blackwell.
  • License: Apache 2.0 — commercial use permitted for the 4B variant (per the model card).

For the full benchmark data, see /check/flux-2-klein-4b/rtx-5070-ti.

Troubleshooting

"OOM during VAE decode after the diffusion steps finish (Path B, offload enabled)"

If you run the BF16 diffusers path (Path B) with enable_model_cpu_offload() and limited system RAM, the failure mode reported on flux2 Issue #11 can apply: the diffusion steps complete fine, but the process is killed by the OS (System RAM OOM) specifically during the final VAE-decode stage, because offload trades GPU VRAM for system RAM. Note the variant disambiguation: Issue #11 reports the failure on FLUX.2-dev (the larger Mistral3-encoder variant running bnb-4bit), not Klein — but the specific sub-failure here is at the VAE-decode stage, which is shared across the Flux.2 family runtime path under enable_model_cpu_offload() and therefore transfers to Klein. The community reporter neuhsm posted a working fix on issuecomment-3596576394: split the pipeline into a text-to-latent stage, then explicitly gc.collect() + torch.cuda.empty_cache() before running VAE decode separately. 32 GB system RAM is a comfortable floor for the offload path. The Mistral3-encoder-specific workaround code earlier in that same thread (the FLUX.2-dev-bnb-4bit 4-bit text-encoder snippet) does NOT transfer to Klein — Klein uses a Qwen3-4B encoder, not Mistral3. Avoiding the offload path entirely by using Path A (FP8 ComfyUI, no CPU offload needed at ~8.4 GB) sidesteps this issue on the 5070 Ti.

"FlashAttention-2 crash on first inference"

If you've manually set attn_implementation="flash_attention_2" somewhere in a wrapper or custom node, FA2 sm_120 kernels are still tracked open at Dao-AILab/flash-attention#2168 — RTX 50-series (Blackwell sm_120) cards crash at the first inference call. The diffusers default attention is scaled_dot_product_attention (SDPA), which has native sm_120 kernels in any cu128 PyTorch build and is the always-works fallback. The official Klein snippet does not request FA2; only a custom override would trigger this. Switch the override to "sdpa" (or "eager") if you hit it.

"Distorted colors / washed-out output"

You're loading the wrong VAE in ComfyUI. Klein must use flux2-vae.safetensors (the Flux.2 family VAE, shared across Klein / Dev / Pro per model_index.json) — loading any other VAE (Flux.1, SDXL, SD1.5) will produce broken output. Confirm the VAE file in ComfyUI/models/vae/ matches the filename above.

"Text encoder shape mismatch / config error"

Klein uses a Qwen3-4B text encoder per its model_index.json (text_encoder = ['transformers', 'Qwen3ForCausalLM']), not the T5 family that Flux.1 used and not the Mistral3 family that Flux.2 dev uses (Issue #11 on the flux2 repo is a Flux.2-dev bnb-4bit report — its encoder-specific workaround does not transfer to Klein, which has a different encoder family entirely). Make sure you downloaded qwen_3_4b.safetensors (or the equivalent diffusers shards from the text_encoder/ subfolder), not a Flux.1 T5 file or a Flux.2-dev Mistral3 encoder you might still have on disk.

"Want the lowest VRAM / can I run the 9B variant?"

On the 5070 Ti, Path A (FP8 ComfyUI, ~8.4 GB distilled) is already well within the 16 GB budget, so there's rarely a need to go lower. The docs.comfy.org Klein tutorial also lists Klein 9B files, but the 9B variants are under the FLUX Non-Commercial License (not Apache-2.0), and their timing/VRAM figures aren't published on the same tutorial page — stay on the 4B distilled variant unless you have a research-only use case and have measured the 9B footprint yourself.

"Where do I find the official workflow JSONs?"

The docs.comfy.org Klein tutorial links downloadable workflow JSONs for all variants (text-to-image and image-editing, base/distilled, 4B/9B). The BFL GitHub repo is the canonical home of the official command-line tooling. If neither has what you need, report your setup via the submission form.