self-hosted/ai
§01·recipe · image

Krea 2 Turbo (FP8) on RTX 3090 via ComfyUI: 8-Step Text-to-Image, with the Raw Tier in Reach

imageintermediate16GB+ VRAMJun 24, 2026

This intermediate recipe sets up Krea 2 on the RTX 3090, needing about 16 GB of VRAM.

models
tools
prerequisites
  • NVIDIA RTX 3090 (24GB VRAM) or any consumer GPU with 16GB+ VRAM
  • ComfyUI 0.25.0 or newer (native Krea 2 support, no custom nodes)
  • Recent PyTorch with a CUDA build (cu121 or cu124) — the RTX 3090's Ampere sm_86 is stock-supported, no special wheel
  • ~18GB free disk for the FP8 diffusion model, FP8 text encoder, and VAE

What You'll Build

A local install of Krea 2 Turbo — the distilled, few-step variant of Krea AI's from-scratch aesthetic-first text-to-image foundation model (released 2026-06-23) — running 8-step text-to-image at up to 1280×720 on a 24GB RTX 3090, entirely inside native ComfyUI with no custom nodes. The lead configuration is the community FP8 (float8_e4m3fn) Turbo build, which shrinks the transformer from 24.76 GiB (BF16) to 12.01 GiB. On the RTX 3090's 24GB that fits with wide headroom — and a clearly-labelled section below covers how close the full-quality Krea 2 Raw (undistilled, CFG) tier gets on 24GB, where it is more reachable than on a 16GB card but full-precision BF16 still doesn't fit resident.

Hardware data: RTX 3090 (24GB VRAM) · Krea 2 Turbo FP8, 8 steps at 1280×720 · See benchmark data

ℹ️ This is Krea 2, not FLUX.1-Krea-dev. Krea 2 is Krea AI's own from-scratch ~12.9B-parameter DiT released 2026-06-23 — a different model from the 2025 black-forest-labs/FLUX.1-Krea-dev (a BFL×Krea collaboration built on FLUX). Don't mix their weights, sizes, or workflows.

⚠️ Two variants, two very different fits. Krea 2 ships as Turbo (distilled, 8 steps, this recipe's lead) and Raw / Base (undistilled, CFG, 52 steps, 24.76 GiB BF16). On 24GB the FP8 Turbo build runs comfortably; full-precision BF16 Raw does not fit resident (the 24.76 GiB transformer alone exceeds 24GB) — see "The Raw quality tier" below for the routes that do work. Pin the variant before you download.

ℹ️ Where the weights come from. Krea published the official weights as gated repos under its verified org — krea/Krea-2-Raw and krea/Krea-2-Turbo (access-restricted; license approval required). An ungated community mirror of the same raw.safetensors + turbo.safetensors (plus reference inference.py) is the practical download today: the krea-community/krea-2 bucket. Neither is a ComfyUI checkpoint — the ComfyUI-loadable FP8 build used in this recipe is a community conversion of the official Turbo weights. Model identity and license come from krea.ai; read the license before any commercial use (see Requirements).

Requirements

ComponentMinimumTested
GPU16GB VRAM consumer cardRTX 3090 (24GB, Ampere GA102, sm_86)
RAM16GB system RAM (32GB comfortable)
Storage~18GB (12.01GB FP8 transformer + 4.88GB FP8 text encoder + 0.24GB VAE)
SoftwareComfyUI 0.25.0+, PyTorch cu121/cu124 (sm_86)ComfyUI native Krea2 nodes

The FP8 Turbo build is documented as runnable on "standard consumer hardware (such as 16GB and 24GB GPUs)" per the AlperKTS/Krea2_FP8 model card. The RTX 3090's 24GB sits at the top of that documented range, so Turbo runs with roughly 12GB of headroom over the FP8 transformer — room for higher resolutions, larger batches, or keeping the text encoder resident.

Licensing — read before commercial use. Krea 2 is released under the Krea 2 Community License. Key terms: you own the Outputs you generate; commercial use is free only if your company's total annual revenue is under $1,000,000 USD (above that requires an Enterprise License); any derivative AI model name must begin with "Krea"; you must implement reasonable content-filtering; and you may not circumvent or remove the model's content-provenance or watermarking mechanisms.

Installation

1. Install / update ComfyUI to 0.25.0+

ComfyUI 0.25.0 and newer have built-in Krea 2 support — no custom nodes needed, per the AlperKTS/Krea2_FP8 model card. Update via ComfyUI Manager → "Update ComfyUI", or pull the latest and reinstall requirements:

cd ComfyUI
git pull
pip install -r requirements.txt

Note — Ampere FP8 caveat. The RTX 3090 is Ampere (GA102, sm_86) and runs on any recent PyTorch CUDA build (cu121 or cu124) — no special wheel is required. One architecture caveat worth understanding: FP8 tensor cores were introduced with Ada (sm_89) and Hopper — Ampere has none, per NVIDIA's Ada tuning guide (which describes Ada's "Fourth Generation Tensor Cores featuring the Hopper FP8 Transformer Engine"). The FP8 Turbo build still loads and runs on the 3090 — ComfyUI stores the weights in fp8 and upcasts them for compute — so you get the file-size and VRAM benefit, but the fp8 format is not a throughput win here the way it is on a 40- or 50-series card; sampling speed is governed by Ampere's bf16/fp16 tensor cores. On a 24GB card the VRAM saving isn't even needed to fit Turbo — the FP8 build is simply what the official workflow ships. If torch.cuda.is_available() is False, reinstall a current CUDA wheel: pip install --upgrade torch torchvision.

2. Download the three model files

Place each file in the indicated ComfyUI/models/ subfolder. File-to-folder mapping and sources are from the AlperKTS/Krea2_FP8 model card:

# from your ComfyUI root

# FP8 Turbo diffusion model (12.01 GiB) → unet/
cd models/unet
wget https://huggingface.co/AlperKTS/Krea2_FP8/resolve/main/krea2_turbo_fp8.safetensors

# FP8-scaled Qwen3-VL 4B text encoder (4.88 GiB) → text_encoders/
cd ../text_encoders
wget https://huggingface.co/Comfy-Org/Qwen3-VL/resolve/main/text_encoders/qwen3vl_4b_fp8_scaled.safetensors

# Qwen-Image VAE (242 MiB) → vae/
cd ../vae
wget https://huggingface.co/Comfy-Org/Qwen-Image_ComfyUI/resolve/main/split_files/vae/qwen_image_vae.safetensors

Krea 2's text encoder is Qwen/Qwen3-VL-4B-Instruct and its VAE is the Qwen-Image autoencoder (AutoencoderKLQwenImage, f8, 16 latent channels), per the Krea-2-Base-Diffusers model card. The Comfy-Org repackaged files above are the ComfyUI-loader-compatible versions of those two components.

24GB headroom — you can use the full-precision encoder. Because the RTX 3090 has 24GB, you can optionally swap the FP8 text encoder for the full BF16 Qwen3-VL-4B encoder (~8 GiB) for slightly higher prompt-encode fidelity — even held resident alongside the 12.01 GiB FP8 transformer that totals ~20 GiB, inside 24GB. On 16GB cards the FP8 encoder is mandatory; here it is a choice.

3. Load the workflow

The FP8 repo ships native ComfyUI workflow JSONs. Drag workflows/Krea 2 simple workflow.json (or krea2_native_workflow.json) from the AlperKTS/Krea2_FP8 repo onto your ComfyUI canvas. The workflow wires the unet/, text_encoders/, and vae/ files into the native Krea 2 sampler graph.

Running

Edit the prompt node and click Queue Prompt. The Turbo defaults shipped in the workflow, per the AlperKTS/Krea2_FP8 model card, are:

  • Steps: 8
  • CFG: 1.0
  • Sampler: er_sde
  • Scheduler: simple
  • Resolution: 1280×720

ComfyUI loads and runs the FP8 Qwen3-VL text encoder to encode your prompt, then frees it before the diffusion sampling stage. On a 24GB card you have enough headroom to keep both resident if you prefer, but the default sequential encode-then-sample pattern keeps the sampling-stage footprint near the 12 GiB transformer plus VAE and activations — leaving plenty of room on the RTX 3090. Output PNGs land in ComfyUI/output/.

Tip — natural-language prompts. Krea 2 is prompted in natural language; long, detailed descriptions yield the best results, and words to be rendered as text in the image are wrapped in quotes (per the Krea-2-Base-Diffusers model card).

The Raw quality tier (full-quality, undistilled)

Krea 2 Raw / Base is the undistilled foundation checkpoint — no step or guidance distillation, run with classifier-free guidance. Its recommended settings are 52 steps, CFG 3.5, up to 1024×1024, which trades much longer generation time for maximum diversity and malleability (it is also the checkpoint intended for LoRA training — LoRAs trained on Base apply cleanly to Turbo).

On a 24GB card the Raw tier is closer than on 16GB — but full-precision BF16 still does not fit resident. The Raw transformer is 24.76 GiB in BF16 (six diffusion_pytorch_model-0000X-of-00006.safetensors shards at CalamitousFelicitousness/Krea-2-Base-Diffusers, verified totalling 26,585,322,200 bytes; the official single-file raw.safetensors in the krea-community/krea-2 bucket is the same weights at 26.6 GB BF16). That transformer alone exceeds 24GB before the VAE, encoder, and activations, so you cannot hold BF16 Raw resident on an RTX 3090. Three cited routes:

  1. Experimental INT8 Raw quant (resident, ~12 GiB). The Winnougan/Krea-2-Base-Turbo-NVFP4-FP8-INT8 repo now ships a Raw/Base quant — Krea2_Raw_convrot_int8mixed.safetensors (12.02 GiB INT8) — alongside its five Turbo quants (Krea2_Turbo_*: fp8 12.01 GiB, int8 12.02 GiB, convrot-int8 12.02 GiB, mxfp8 12.60 GiB, nvfp4 7.15 GiB). At ~12 GiB the Raw INT8 file sits comfortably inside the RTX 3090's 24GB with ample room for the 52-step CFG activation peak — the headroom a 16GB card lacks. The catch: per its model card it needs a comfy_quant-enabled ComfyUI build plus custom nodes, so it is not the no-custom-node native path this recipe's lead uses, and we have not benchmarked its output parity against full BF16 Raw. (INT8 weight quants like this are also a better Ampere fit than fp8, which the 3090 cannot accelerate — see the install note.) Treat it as the most practical "resident Raw" route on 24GB today, with that experimental caveat.

  2. On-the-fly fp8 cast (resident, ~12.4 GiB — once a single-file Raw checkpoint exists). ComfyUI's Load Diffusion Model node exposes a weight_dtype setting; as the official ComfyUI examples document, setting it to fp8 lowers memory usage by about half. The equivalent launch flag is --fp8_e4m3fn-unet, whose help text in ComfyUI's cli_args.py reads Store unet weights in fp8_e4m3fn. Casting the 24.76 GiB BF16 Raw transformer to fp8_e4m3fn yields roughly a ~12.4 GiB resident footprint. Caveat: the node needs a ComfyUI-format single-file checkpoint (the official raw.safetensors is bound to Krea's reference inference.py layout), so this path is fully turnkey only once a ComfyUI-format Raw checkpoint is published — and on Ampere the fp8 weights are upcast for compute (a VRAM win, not a speed one).

  3. Diffusers with CPU offload (full BF16, slow). Run the BF16 Raw checkpoint through diffusers with enable_model_cpu_offload() / sequential offload. The RTX 3090 holds more of the pipeline resident than a 16GB card, but the 24.76 GiB transformer still cannot stay fully on-GPU, so the transformer forward pass streams from system RAM and the run is substantially slower than a resident FP8/INT8 pass.

If you want full BF16 Raw resident, that needs a 32GB card (e.g. the RTX 5090); on the RTX 3090's 24GB, the experimental INT8 quant is the most practical resident-Raw path today, with native fp8-cast arriving once a single-file ComfyUI Raw checkpoint lands.

Results

  • Speed: A community benchmark on a single RTX 3090 (open methodology + data; discussion thread) — mean of 9 clean runs (one warmup discarded), timed over the ComfyUI websocket progress stream — measured Krea 2 Turbo FP8 at ~14.75 s/image (0.65 it/s, 4.07 img/min) at 8 steps, 1024×1024, with ~18.75 GB peak VRAM. Note that figure is at 1024×1024 — slightly above this recipe's 1280×720 default. As expected for Ampere, the 3090 does not get the fp8 throughput gains a 40/50-series card does (see the Ampere note above): the benchmark's author observes that int8 — not fp8 — is the low-precision format the 3090 can actually accelerate (≈2× faster), which is on their follow-up list. If you have your own numbers, please submit them so they appear on /check/krea-2/rtx-3090.
  • VRAM usage: The Turbo FP8 transformer is 12.01 GiB on disk (down from 24.76 GiB BF16) per the AlperKTS/Krea2_FP8 model card; the FP8-scaled text encoder is 4.88 GiB and the VAE 0.24 GiB (verified via the HuggingFace tree API). On the RTX 3090's 24GB the sampling-stage peak leaves wide headroom — a community benchmark measured ~18.75 GB peak at 1024×1024, higher than the ~12 GiB minimum-to-fit because on a 24GB card ComfyUI keeps more resident rather than aggressively freeing the encoder. Either way it stays comfortably within 24GB. See live data at /check/krea-2/rtx-3090.
  • Quality notes: Turbo is distilled for 8-step CFG-1.0 generation; for maximum fidelity and diversity use the Raw tier (52 steps, CFG 3.5) — see above. Architecture is a single-stream DiT, 12.9B parameters, 28 blocks at width 6144, with grouped-query attention and flow-matching sampling, per the Krea-2-Base-Diffusers model card.

For the full benchmark data, see /check/krea-2/rtx-3090.

Troubleshooting

torch.cuda.is_available() is False

If ComfyUI falls back to CPU (extremely slow) or reports no CUDA device, your PyTorch install is missing CUDA support. The RTX 3090 (Ampere sm_86) is stock-supported by any recent CUDA PyTorch build — reinstall a current wheel:

pip install --upgrade torch torchvision

ComfyUI's native sampling uses PyTorch SDPA, so you do not need FlashAttention for this recipe. Skip any pip install flash-attn step; it is not required for the ComfyUI Krea 2 path.

Out of memory during sampling

On a 24GB RTX 3090 the FP8 Turbo build (12.01 GiB transformer + small VAE) leaves wide headroom, so Turbo OOMs are unlikely unless another app is holding significant VRAM or you push resolution/batch far beyond the 1280×720 default. If you do hit OOM, close other GPU apps first. Note the OOM risk rises sharply if you attempt the BF16 Raw tier resident — the 24.76 GiB Raw transformer overflows 24GB (see "The Raw quality tier" for the routes that fit).

ComfyUI doesn't recognize the Krea 2 nodes

Native Krea 2 support requires ComfyUI 0.25.0 or newer per the AlperKTS/Krea2_FP8 model card. Update via ComfyUI Manager → "Update ComfyUI" → restart. No custom nodes are needed; if you installed a third-party "Krea" node pack, remove it to avoid conflicts.

Which Winnougan quant is which — Turbo vs Raw

The Winnougan/Krea-2-Base-Turbo-NVFP4-FP8-INT8 repo mixes both variants, and the filename is the only reliable tell. The Turbo quants — the variant this recipe's lead path is about — are the Krea2_Turbo_* / krea2_turbo_* files (fp8mixed 12.01 GiB, int8mixed 12.02 GiB, convrot-int8 12.02 GiB, mxfp8 12.60 GiB, nvfp4 7.15 GiB). There is exactly one Raw/Base file: Krea2_Raw_convrot_int8mixed.safetensors (12.02 GiB) — the undistilled 52-step CFG model, which per its model card needs a comfy_quant-enabled ComfyUI build plus custom nodes, so don't load it expecting native 8-step Turbo behaviour. Note this recipe's lead does not pull Turbo from here — it uses the documented AlperKTS/Krea2_FP8 repo, which ships the native workflow JSONs. For the Raw tier, see "The Raw quality tier" above.

common questions
How much VRAM does Krea 2 need?

About 16 GB — the minimum this recipe targets.

Which GPUs is Krea 2 tested on?

RTX 3090 (24 GB).

How hard is this setup?

Intermediate — follow the steps above.