self-hosted/ai
§01·recipe · image

Flux.2 Klein 4B on RTX 3060: FP8 ComfyUI 4-Step Text-to-Image at ~8.4 GB

imagebeginner8GB+ VRAMJun 14, 2026

This beginner recipe sets up Flux.2-Klein-4B on the RTX 3060, needing about 8 GB of VRAM.

models
tools
prerequisites
  • NVIDIA RTX 3060 (12 GB VRAM) — Ampere GA106 sm_86
  • Python 3.10+ (Python 3.12 for the official BFL repo)
  • ComfyUI (latest nightly with Klein nodes) or the `diffusers` Python package

What You'll Build

Generate 1024×1024 images locally with Black Forest Labs' Flux.2 Klein 4B — the Apache-2.0, 4-billion-parameter, step-distilled member of the Flux.2 family — on an RTX 3060 12 GB. This recipe leads with the FP8 ComfyUI path because it lands at a ~8.4 GB peak, which fits comfortably inside the 3060's 12 GB budget. On the 3060's Ampere GA106 (sm_86) the FP8 weight file is a memory escape hatch, not a speed boost — see the architecture note below.

Hardware data: RTX 3060 (12 GB VRAM) · ~8.4 GB peak VRAM at FP8 distilled (FP8 runtime-graph property) · 4-step distilled generation · See benchmark data

ℹ️ This recipe targets Klein 4B (Apache-2.0), not the 9B variants. The official Flux.2 README ships Klein in several sizes — only the 4B models (distilled and base) are Apache-2.0 and free for commercial use. The 9B and 9B-KV variants are released under the FLUX Non-Commercial License. Everything below pins the 4B distilled variant; the file names and VRAM figures are 4B-specific.

⚠️ The RTX 3060 sits just below the card Black Forest Labs names. The Klein 4B model card's Hardware section states "The FLUX.2 [klein] 4B model fits in ~13GB VRAM and is accessible on NVIDIA RTX 3090/4070 and above". The RTX 3060 is below that named floor: it meets the 12 GB VRAM bar but has a slower GA106 GPU and lower memory bandwidth (360 GB/s, per TechPowerUp) than the RTX 3090 (936 GB/s) or RTX 4070 (504 GB/s) that BFL lists. The FP8 ComfyUI path below fits the 12 GB budget at ~8.4 GB, so the model runs — but expect materially slower per-image generation than on the cards BFL names, and do not read BFL's statement as an endorsement of the 3060.

🧩 FP8 on Ampere is a memory trick, not a speed-up. The 3060's Ampere sm_86 has no FP8 tensor cores (FP8 first shipped on Hopper sm_90 and Ada sm_89). The FP8 weight file loads on the 3060 and keeps VRAM low, but at compute time the runtime dequantizes the weights to BF16/FP16 per operation — so you get the VRAM saving without the speed boost an Ada or Blackwell card enjoys. The drbaph HiDream-O1 FP8 card documents this behavior verbatim for the whole class of older GPUs: "On older GPUs, weights are dequantized on-the-fly — still saving VRAM, with a small speed penalty." See the architecture note in Troubleshooting.

Requirements

ComponentMinimumTested
GPU~9 GB VRAM for the FP8 ComfyUI path (peak ~8.4 GB)RTX 3060 (12 GB)
RAM16 GB (32 GB recommended if you try the BF16 + offload path)
Storage~8 GB (FP8 ComfyUI single-file + Qwen3-4B encoder + VAE)
SoftwareComfyUI (latest nightly with Klein nodes) — OR — diffusers (main) + transformers + accelerate

The official black-forest-labs/FLUX.2-klein-4b-fp8 repository ships a 4.07 GB FP8 single-file (flux-2-klein-4b-fp8.safetensors) used by the ComfyUI path, per the HF tree API. The FP8 weights load on the 3060's Ampere sm_86 cores (as a memory escape hatch — see the architecture note above), so no special build is required. The full BF16 diffusers checkout from black-forest-labs/FLUX.2-klein-4B is one consolidated 7.75 GB transformer file (flux-2-klein-4b.safetensors) plus the Qwen3-4B text encoder shards (in text_encoder/) and the Flux.2-family VAE — that BF16 + offload path peaks near ~13 GB and does not fit a 12 GB display card; it is documented below only as a 16 GB-card / headless alternative.

Installation

The recommended path on the RTX 3060 is Path A (ComfyUI FP8) — it lands at ~8.4 GB peak, well inside the 12 GB budget. Path B (BF16 diffusers + offload) is documented as an alternative for users with a 16 GB card or a headless box; on a 12 GB display card it is too tight (see the warning above).

On Ampere (sm_86) the default PyTorch CUDA wheels (cu124 / cu121) already ship the right kernels — pip install torch works as-is, and FlashAttention-2 prebuilt wheels cover sm_86 (the oldest FA2-supported consumer arch), so no cu128 index-URL and no attn_implementation override are required. (Those are Blackwell sm_120 workarounds; they do not apply to the 3060.)

Path A — ComfyUI FP8 (recommended for RTX 3060)

1. Update ComfyUI to a build with Klein support

Klein support landed in ComfyUI's nightly nodes; an older build will fail to load the workflow. From your ComfyUI checkout:

git pull
pip install -r requirements.txt

The default pip install torch (cu124) already includes the sm_86 kernels the RTX 3060 needs — unlike Blackwell (sm_120) cards, no special --index-url https://download.pytorch.org/whl/cu128 is required for the 3060.

2. Download the required files

Per the official ComfyUI Flux.2 Klein tutorial, the 4B distilled variant needs three files:

FileFolder
flux-2-klein-4b-fp8.safetensors (4.07 GB, distilled)ComfyUI/models/diffusion_models/
qwen_3_4b.safetensorsComfyUI/models/text_encoders/
flux2-vae.safetensorsComfyUI/models/vae/

The VAE (flux2-vae.safetensors) is the Flux.2 family VAE — shared across Klein / Dev / Pro and distinct from the Flux.1 VAE. The text encoder is Qwen3-4B, not the T5 family used by Flux.1 — confirmed by Klein's model_index.json which declares "text_encoder": ["transformers", "Qwen3ForCausalLM"].

Path B — Diffusers (BF16 + CPU offload) — 16 GB-card / headless alternative

On a 12 GB display card this path is too tight (peak ~13 GB vs ~10.5–11.3 GB usable). Use it only on a 16 GB card, or on a headless 12 GB box where the full VRAM is free. It is included for completeness and because the install steps are the canonical BFL example.

1. Install dependencies

pip install -U git+https://github.com/huggingface/diffusers.git transformers accelerate

The Flux.2 Klein pipelines (Flux2KleinPipeline) require an up-to-date diffusers — the BFL model card pins git+https://github.com/huggingface/diffusers.git rather than the released wheel. On Ampere the default pip install torch (cu124) already ships sm_86 kernels, so no cu128 index-URL is needed here either.

2. Run the official example

This is the exact snippet published on the model card at huggingface.co/black-forest-labs/FLUX.2-klein-4B, unmodified — pipe.enable_model_cpu_offload() is what holds peak VRAM near the documented ~13 GB envelope.

import torch
from diffusers import Flux2KleinPipeline

device = "cuda"
dtype = torch.bfloat16

pipe = Flux2KleinPipeline.from_pretrained("black-forest-labs/FLUX.2-klein-4B", torch_dtype=dtype)
pipe.enable_model_cpu_offload()  # save some VRAM by offloading the model to CPU

prompt = "A cat holding a sign that says hello world"
image = pipe(
    prompt=prompt,
    height=1024,
    width=1024,
    guidance_scale=1.0,
    num_inference_steps=4,
    generator=torch.Generator(device=device).manual_seed(0)
).images[0]
image.save("flux-klein.png")

guidance_scale=1.0 and num_inference_steps=4 are the values published in the BFL model-card example for the distilled variant. The base (undistilled) variant — black-forest-labs/FLUX.2-klein-base-4B — is a multi-step model per the official Flux.2 README, which recommends Base for fine-tuning, LoRA training, and maximum flexibility.

Running

ComfyUI (Path A — recommended)

python main.py --listen

Open http://localhost:8188 and load one of the official Klein workflow JSONs (text-to-image and image-editing, base/distilled, 4B/9B) downloadable from the docs.comfy.org tutorial page. For the distilled variant, use 4 steps at CFG 1.0 (the values published in the BFL model-card example). Peak VRAM stays near ~8.4 GB on the FP8 distilled path, so the 12 GB card has headroom for the display and a browser during generation. Per-image time will be slower than on the RTX 3090/4070 BFL names — the FP8 weights dequantize to BF16 at compute time on the 3060's Ampere cores (see the architecture note below).

Diffusers (Path B — alternative)

python flux_klein.py

Output flux-klein.png lands in your working directory. First run downloads the BF16 weights from the Hub (consolidated transformer ~7.75 GB + Qwen3-4B encoder + VAE) into ~/.cache/huggingface/. The first generation includes one-time pipeline compilation / shader-cache build; steady-state generations are faster. On a 12 GB display card this path will sit very close to (or over) the usable VRAM ceiling — prefer Path A.

Results

  • Speed: No RTX-3060-named generation-time number for Klein 4B has been published at the time of writing, and the backend has no benchmark for this pair yet (/check/flux-2-klein-4b/rtx-3060 is unknown). The only measured GPU-named timing published so far is on RTX 5090: the official ComfyUI Flux.2 Klein tutorial lists "distilled ~1.2s (5090) · 8.4GB VRAM" for the distilled variant. That RTX 5090 figure does not transfer to the RTX 3060 — the 5090 is a far more powerful, different-generation card (Blackwell sm_120 vs Ampere sm_86) with vastly more CUDA cores and ~6× the memory bandwidth, and the 3060 additionally lacks FP8 tensor cores, so per-image time on the 3060 will be much slower. BFL describes Klein 4B as delivering "end-to-end inference in as low as under a second" in the model card, but that is hardware-unspecified and certainly does not describe the RTX 3060. Rather than publish an extrapolated number, this recipe omits a 3060 speed figure. Report your measured RTX 3060 generation time via /contribute so /check/flux-2-klein-4b/rtx-3060 gets a real benchmark.
  • VRAM usage (FP8 ComfyUI, recommended path): ~8.4 GB peak for the distilled variant per the official docs.comfy.org tutorial ("distilled ~1.2s (5090) · 8.4GB VRAM"). That measurement was taken on an RTX 5090, but the ~8.4 GB figure is a property of the FP8 runtime graph (the same 4.07 GB FP8 single-file, Qwen3-4B encoder, VAE, and activations are resident regardless of GPU) — and the FP8 weights load on the 3060's Ampere sm_86 cores at the same footprint, so the envelope holds and sits comfortably inside the 3060's 12 GB budget. The FP8 single-file on disk is 4.07 GB per the HF tree API, consistent with the ~8.4 GB runtime peak once the encoder, VAE, and activations are also resident.
  • VRAM usage (BF16 diffusers alternative — NOT the 12 GB path): ~13 GB peak with enable_model_cpu_offload() enabled per BFL's official Klein 4B model card ("The FLUX.2 [klein] 4B model fits in ~13GB VRAM and is accessible on NVIDIA RTX 3090/4070 and above"). At ~13 GB this exceeds the RTX 3060's ~10.5–11.3 GB usable budget on a display card, so it is documented as a 16 GB-card / headless alternative only. The min_vram_gb: 8 in the frontmatter reflects the recommended FP8 Path A that this recipe actually leads with and installs end-to-end. See /check/flux-2-klein-4b/rtx-3060 for community benchmark data as it lands.
  • Quality notes: Klein is the small/distilled member of the Flux.2 family — expect strong prompt adherence at 4-billion parameters with the usual distillation tradeoffs (less stylistic flexibility than the Flux.2 Dev base). The 4 distilled steps keep iteration tolerable even on the 3060, though each step is slower than on a card with FP8 tensor cores.
  • License: Apache 2.0 — commercial use permitted for the 4B variant (per the model card; gated: False, weights download freely).

For the full benchmark data, see /check/flux-2-klein-4b/rtx-3060.

Troubleshooting

FP8 runs but isn't faster on the RTX 3060

This is expected, not a misconfiguration. The 3060's Ampere GA106 (sm_86) has no FP8 tensor cores — FP8 compute is hardware-accelerated only on Hopper (sm_90), Ada Lovelace (RTX 40xx, sm_89), and Blackwell (RTX 50xx, sm_120). On the 3060 the FP8 weight file loads (keeping VRAM near ~8.4 GB), but the runtime dequantizes those weights to BF16/FP16 on the fly at compute time, so you get the memory saving without the speed boost. The drbaph HiDream-O1 FP8 card describes this for the whole class of older GPUs: "On older GPUs, weights are dequantized on-the-fly — still saving VRAM, with a small speed penalty." The takeaway: keep the FP8 path because it's what fits 12 GB, but don't expect the sub-second generation BFL's RTX 5090 / 4070-and-up figures imply.

"OOM during VAE decode after the diffusion steps finish (Path B, offload enabled)"

If you run the BF16 diffusers path (Path B) with enable_model_cpu_offload() and limited system RAM, the failure mode reported on flux2 Issue #11 can apply: the diffusion steps complete fine, but the process is killed by the OS (System RAM OOM) specifically during the final VAE-decode stage, because offload trades GPU VRAM for system RAM. Note the variant disambiguation: Issue #11 (titled "3090 24G: cuda out of memory") reports the failure on FLUX.2-dev (the larger Mistral3-encoder variant running bnb-4bit), not Klein — but the specific sub-failure here is at the VAE-decode stage, which is shared across the Flux.2 family runtime path under enable_model_cpu_offload() and therefore transfers to Klein. The community reporter neuhsm (role NONE on the repo, not a maintainer) posted a working fix on issuecomment-3596576394: split the pipeline into a text-to-latent stage (run the DiT, return latents, release the text encoder + DiT from memory), then explicitly run gc.collect() and torch.cuda.empty_cache() before running VAE decode separately. The diffusers docs also document offloading to disk as an alternative. The Mistral3-encoder-specific loading code earlier in that same thread (the FLUX.2-dev-bnb-4bit 4-bit text-encoder snippet) does NOT transfer to Klein — Klein uses a Qwen3-4B encoder, not Mistral3. On the RTX 3060 the cleanest fix is to avoid the offload path entirely and use Path A (FP8 ComfyUI, no CPU offload needed at ~8.4 GB).

"FlashAttention-2 on the RTX 3060"

The RTX 3060's Ampere sm_86 architecture is fully covered by the prebuilt FlashAttention-2 wheels — sm_86 is the oldest consumer arch with stock FA2 support, so there is no kernel gap to work around, and the default cu124 / cu121 PyTorch build already ships the right kernels. You do not need to override attn_implementation to "sdpa" or "eager" on the 3060. (The FA2 sm_120 crash and the cu128 wheel requirement are Blackwell-only concerns and do not apply to this card.)

"Distorted colors / washed-out output"

You're loading the wrong VAE in ComfyUI. Klein must use flux2-vae.safetensors (the Flux.2 family VAE, shared across Klein / Dev / Pro per model_index.json) — loading any other VAE (Flux.1, SDXL, SD1.5) will produce broken output. Confirm the VAE file in ComfyUI/models/vae/ matches the filename above.

"Text encoder shape mismatch / config error"

Klein uses a Qwen3-4B text encoder per its model_index.json (text_encoder = ['transformers', 'Qwen3ForCausalLM']), not the T5 family that Flux.1 used and not the Mistral3 family that Flux.2 dev uses. Make sure you downloaded qwen_3_4b.safetensors (or the equivalent diffusers shards from the text_encoder/ subfolder), not a Flux.1 T5 file or a Flux.2-dev Mistral3 encoder you might still have on disk.

"Can I run the BF16 diffusers path on my 12 GB RTX 3060?"

It's too tight and not recommended. The BF16 + offload path peaks near ~13 GB per the BFL model card, which exceeds the ~10.5–11.3 GB usable on a 12 GB display card. Stay on Path A (FP8 ComfyUI, ~8.4 GB). If you have a second, headless 12 GB card with no display attached (so the full VRAM is free) you may have just enough room, but the FP8 path is what BFL itself recommends for the 12 GB tier. If you measure the BF16 path fitting on your 3060, report it via the submission form.

"Where do I find the official workflow JSONs?"

The docs.comfy.org Klein tutorial links downloadable workflow JSONs for all variants (text-to-image and image-editing, base/distilled, 4B/9B). The BFL GitHub repo is the canonical home of the official command-line tooling. If neither has what you need, report your setup via the submission form.

common questions
How much VRAM does Flux.2-Klein-4B need?

About 8 GB — the minimum this recipe targets.

Which GPUs is Flux.2-Klein-4B tested on?

RTX 3060 (12 GB).

How hard is this setup?

Beginner — follow the steps above.