Flux.2 Klein 4B on RTX 4070: FP8 ComfyUI 4-Step Text-to-Image at ~8.4 GB

What You'll Build

Generate 1024×1024 images locally with Black Forest Labs' Flux.2 Klein 4B — the Apache-2.0, 4-billion-parameter, step-distilled member of the Flux.2 family — on an RTX 4070. The RTX 4070 is the consumer card Black Forest Labs names by name: the Klein 4B model card's Hardware section states "The FLUX.2 [klein] 4B model fits in ~13GB VRAM and is accessible on NVIDIA RTX 3090/4070 and above". This recipe leads with the FP8 ComfyUI path because the 4070's Ada Lovelace AD104 (sm_89) has 4th-generation FP8 tensor cores (E4M3/E5M2), and the FP8 distilled variant runs at a ~8.4 GB peak — leaving comfortable headroom inside the 4070's 12 GB budget.

Hardware data: RTX 4070 (12 GB VRAM) · ~8.4 GB peak VRAM at FP8 distilled (FP8 runtime-graph property) · 4-step distilled generation · See benchmark data

ℹ️ This recipe targets Klein 4B (Apache-2.0), not the 9B variants. The official Flux.2 README ships Klein in several sizes — only the 4B models (distilled and base) are Apache-2.0 and free for commercial use. The 9B and 9B-KV variants are released under the FLUX Non-Commercial License. Everything below pins the 4B distilled variant; the file names and VRAM figures are 4B-specific.

⚠️ 12 GB budget: lead with the FP8 path, not BF16. The RTX 4070's 12 GB budget (≈10.5–11.3 GB usable once a display is attached) does not comfortably hold the ~13 GB BF16 + CPU-offload diffusers path documented on the BFL model card. On the 4070, use Path A (FP8 ComfyUI, ~8.4 GB) — that is why min_vram_gb is set to 8 here. The ~13 GB BF16 + offload path is documented below only as a 16 GB-card / headless alternative, not as a 12 GB display-card path.

Requirements

Component	Minimum	Tested
GPU	~9 GB VRAM for the FP8 ComfyUI path (peak ~8.4 GB)	RTX 4070 (12 GB)
RAM	16 GB (32 GB recommended if you try the BF16 + offload path)	—
Storage	~8 GB (FP8 ComfyUI single-file + Qwen3-4B encoder + VAE)	—
Software	ComfyUI (latest nightly with Klein nodes) — OR — `diffusers` (main) + `transformers` + `accelerate`	—

The official black-forest-labs/FLUX.2-klein-4b-fp8 repository ships a 4.07 GB FP8 single-file (flux-2-klein-4b-fp8.safetensors) used by the ComfyUI path, per the HF tree API. FP8 runs natively on the RTX 4070's 4th-gen Ada tensor cores (E4M3/E5M2) — no special build is required. The full BF16 diffusers checkout from black-forest-labs/FLUX.2-klein-4B is one consolidated 7.75 GB transformer file (flux-2-klein-4b.safetensors) plus the 8.05 GB Qwen3-4B text encoder shards (in text_encoder/) and the 168 MB Flux.2-family VAE — that BF16 path is the higher-VRAM alternative below, not the recommended 12 GB path.

Installation

The recommended path on the RTX 4070 is Path A (ComfyUI FP8) — it leverages the card's native Ada FP8 tensor cores and lands at ~8.4 GB peak, well inside the 12 GB budget. Path B (BF16 diffusers + offload) is documented as an alternative for users with a 16 GB card or a headless box; on a 12 GB display card it is too tight (see the warning above).

On Ada (sm_89) the default PyTorch CUDA wheels (cu124) already ship the right kernels — pip install torch works as-is, and FlashAttention-2 prebuilt wheels cover sm_89, so no cu128 index-URL and no attn_implementation override are required. (Those are Blackwell sm_120 workarounds; they do not apply to the 4070.)

Path A — ComfyUI FP8 (recommended for RTX 4070)

1. Update ComfyUI to a build with Klein support

Klein support landed in ComfyUI's nightly nodes; an older build will fail to load the workflow. From your ComfyUI checkout:

git pull
pip install -r requirements.txt

The default pip install torch (cu124) already includes the sm_89 kernels the RTX 4070 needs — unlike Blackwell (sm_120) cards, no special --index-url https://download.pytorch.org/whl/cu128 is required for the 4070.

2. Download the required files

Per the official ComfyUI Flux.2 Klein tutorial, the 4B distilled variant needs three files:

File	Folder
`flux-2-klein-4b-fp8.safetensors` (4.07 GB, distilled)	`ComfyUI/models/diffusion_models/`
`qwen_3_4b.safetensors`	`ComfyUI/models/text_encoders/`
`flux2-vae.safetensors`	`ComfyUI/models/vae/`

The VAE (flux2-vae.safetensors) is the Flux.2 family VAE — shared across Klein / Dev / Pro and distinct from the Flux.1 VAE. The text encoder is Qwen3-4B, not the T5 family used by Flux.1 — confirmed by Klein's model_index.json which declares "text_encoder": ["transformers", "Qwen3ForCausalLM"].

Path B — Diffusers (BF16 + CPU offload) — 16 GB-card / headless alternative

On a 12 GB display card this path is too tight (peak ~13 GB vs ~10.5–11.3 GB usable). Use it only on a 16 GB card, or on a headless 12 GB box where the full VRAM is free. It is included for completeness and because the install steps are the canonical BFL example.

1. Install dependencies

pip install -U git+https://github.com/huggingface/diffusers.git transformers accelerate

The Flux.2 Klein pipelines (Flux2KleinPipeline) require an up-to-date diffusers — the BFL model card pins git+https://github.com/huggingface/diffusers.git rather than the released wheel. On Ada the default pip install torch (cu124) already ships sm_89 kernels, so no cu128 index-URL is needed here either.

2. Run the official example

This is the exact snippet published on the model card at huggingface.co/black-forest-labs/FLUX.2-klein-4B, unmodified — pipe.enable_model_cpu_offload() is what holds peak VRAM near the documented ~13 GB envelope.

import torch
from diffusers import Flux2KleinPipeline

device = "cuda"
dtype = torch.bfloat16

pipe = Flux2KleinPipeline.from_pretrained("black-forest-labs/FLUX.2-klein-4B", torch_dtype=dtype)
pipe.enable_model_cpu_offload()  # save some VRAM by offloading the model to CPU

prompt = "A cat holding a sign that says hello world"
image = pipe(
    prompt=prompt,
    height=1024,
    width=1024,
    guidance_scale=1.0,
    num_inference_steps=4,
    generator=torch.Generator(device=device).manual_seed(0)
).images[0]
image.save("flux-klein.png")

guidance_scale=1.0 and num_inference_steps=4 are the values published in the BFL model-card example for the distilled variant. The base (undistilled) variant — black-forest-labs/FLUX.2-klein-base-4B — is a 50-step model per the official Flux.2 README, which recommends Base for fine-tuning, LoRA training, and maximum flexibility.

Running

ComfyUI (Path A — recommended)

python main.py --listen

Open http://localhost:8188 and load one of the official Klein workflow JSONs (text-to-image and image-editing, base/distilled, 4B/9B) downloadable from the docs.comfy.org tutorial page. For the distilled variant, use 4 steps at CFG 1.0; the base variant is a 50-step model per the official Flux.2 README. Peak VRAM stays near ~8.4 GB on the FP8 distilled path, so the 12 GB card has headroom for the display and a browser during generation.

Diffusers (Path B — alternative)

python flux_klein.py

Output flux-klein.png lands in your working directory. First run downloads the BF16 weights from the Hub (consolidated transformer ~7.75 GB + Qwen3-4B encoder ~8.05 GB + VAE) into ~/.cache/huggingface/. The first generation includes one-time pipeline compilation / shader-cache build; steady-state generations are faster. On a 12 GB display card this path will sit very close to (or over) the usable VRAM ceiling — prefer Path A.

Results

Speed: No RTX-4070-named generation-time number for Klein 4B has been published at the time of writing, and the backend has no benchmark for this pair yet (/check/flux-2-klein-4b/rtx-4070 is unknown). The only measured GPU-named timing published so far is on RTX 5090: the official ComfyUI Flux.2 Klein tutorial lists distilled ~1.2s (5090) · 8.4GB VRAM for the distilled variant. That RTX 5090 figure does not transfer to the RTX 4070 — the 5090 is a different architecture generation (Blackwell sm_120 vs Ada sm_89) with far more CUDA cores and much higher memory bandwidth, so per-image time will be materially slower. BFL describes Klein 4B as delivering "end-to-end inference in as low as under a second" in the model card, but does not name the RTX 4070 with a measured time. Rather than publish an extrapolated number, this recipe omits a 4070 speed figure. Report your measured RTX 4070 generation time via /contribute so /check/flux-2-klein-4b/rtx-4070 gets a real benchmark.
VRAM usage (FP8 ComfyUI, recommended path): ~8.4 GB peak for the distilled variant per the official docs.comfy.org tutorial (distilled ~1.2s (5090) · 8.4GB VRAM). That measurement was taken on an RTX 5090, but the ~8.4 GB figure is a property of the FP8 runtime graph (the same 4.07 GB FP8 single-file, Qwen3-4B encoder, VAE, and activations are resident regardless of GPU) — and FP8 is hardware-native on the 4070's Ada sm_89 tensor cores, so the envelope holds and sits comfortably inside the 4070's 12 GB budget. The FP8 single-file on disk is 4.07 GB per the HF tree API, consistent with the ~8.4 GB runtime peak once the encoder, VAE, and activations are also resident.
VRAM usage (BF16 diffusers alternative — NOT the 12 GB path): ~13 GB peak with enable_model_cpu_offload() enabled per BFL's official Klein 4B model card ("The FLUX.2 [klein] 4B model fits in ~13GB VRAM and is accessible on NVIDIA RTX 3090/4070 and above" — the RTX 4070 is the consumer card named there). At ~13 GB this exceeds the RTX 4070's ~10.5–11.3 GB usable budget on a display card, so it is documented as a 16 GB-card / headless alternative only. The min_vram_gb: 8 in the frontmatter reflects the recommended FP8 Path A that this recipe actually leads with. See /check/flux-2-klein-4b/rtx-4070 for community benchmark data as it lands.
Quality notes: Klein is the small/distilled member of the Flux.2 family — expect strong prompt adherence at 4-billion parameters with the usual distillation tradeoffs (less stylistic flexibility than the Flux.2 Dev base). The 4 distilled steps make iteration fast.
License: Apache 2.0 — commercial use permitted for the 4B variant (per the model card; gated: False, weights download freely).

For the full benchmark data, see /check/flux-2-klein-4b/rtx-4070.

Troubleshooting

"OOM during VAE decode after the diffusion steps finish (Path B, offload enabled)"

If you run the BF16 diffusers path (Path B) with enable_model_cpu_offload() and limited system RAM, the failure mode reported on flux2 Issue #11 can apply: the diffusion steps complete fine, but the process is killed by the OS (System RAM OOM) specifically during the final VAE-decode stage, because offload trades GPU VRAM for system RAM. Note the variant disambiguation: Issue #11 (titled "3090 24G: cuda out of memory") reports the failure on FLUX.2-dev (the larger Mistral3-encoder variant running bnb-4bit), not Klein — but the specific sub-failure here is at the VAE-decode stage, which is shared across the Flux.2 family runtime path under enable_model_cpu_offload() and therefore transfers to Klein. The community reporter neuhsm (role NONE on the repo, not a maintainer) posted a working fix on issuecomment-3596576394: split the pipeline into a text-to-latent stage (run the DiT, return latents, release the text encoder + DiT from memory), then explicitly clear memory before running VAE decode separately. The diffusers docs also document offloading to disk as an alternative. The Mistral3-encoder-specific workaround code earlier in that same thread (the FLUX.2-dev-bnb-4bit 4-bit text-encoder snippet) does NOT transfer to Klein — Klein uses a Qwen3-4B encoder, not Mistral3. On the RTX 4070 the cleanest fix is to avoid the offload path entirely and use Path A (FP8 ComfyUI, no CPU offload needed at ~8.4 GB).

"FlashAttention-2 on the RTX 4070"

Unlike Blackwell (sm_120) cards, the RTX 4070's Ada sm_89 architecture is fully covered by the prebuilt FlashAttention-2 wheels — there is no sm_120 kernel gap to work around, and the default cu124 PyTorch build already ships the right kernels. You do not need to override attn_implementation to "sdpa" or "eager" on the 4070; the diffusers default attention works as-is. (The FA2 sm_120 crash and the cu128 wheel requirement are Blackwell-only concerns and do not apply to this card.)

"Distorted colors / washed-out output"

You're loading the wrong VAE in ComfyUI. Klein must use flux2-vae.safetensors (the Flux.2 family VAE, shared across Klein / Dev / Pro per model_index.json) — loading any other VAE (Flux.1, SDXL, SD1.5) will produce broken output. Confirm the VAE file in ComfyUI/models/vae/ matches the filename above.

"Text encoder shape mismatch / config error"

Klein uses a Qwen3-4B text encoder per its model_index.json (text_encoder = ['transformers', 'Qwen3ForCausalLM']), not the T5 family that Flux.1 used and not the Mistral3 family that Flux.2 dev uses. Make sure you downloaded qwen_3_4b.safetensors (or the equivalent diffusers shards from the text_encoder/ subfolder), not a Flux.1 T5 file or a Flux.2-dev Mistral3 encoder you might still have on disk.

"Can I run the BF16 diffusers path on my 12 GB RTX 4070?"

It's tight and not recommended. The BF16 + offload path peaks near ~13 GB per the BFL model card, which exceeds the ~10.5–11.3 GB usable on a 12 GB display card. Stay on Path A (FP8 ComfyUI, ~8.4 GB). If you have a second, headless 12 GB card with no display attached (so the full VRAM is free) you may have enough room, but the FP8 path is faster on Ada anyway because it uses the native FP8 tensor cores. If you measure the BF16 path fitting on your 4070, report it via the submission form.

"Where do I find the official workflow JSONs?"

The docs.comfy.org Klein tutorial links downloadable workflow JSONs for all variants (text-to-image and image-editing, base/distilled, 4B/9B). The BFL GitHub repo is the canonical home of the official command-line tooling. If neither has what you need, report your setup via the submission form.