How much VRAM does Krea 2 need?

About 16 GB — the minimum this recipe targets.

How hard is this setup?

Intermediate — follow the steps above.

Krea 2 Turbo on RX 7800 XT via ComfyUI + ROCm: GGUF Text-to-Image in 16GB

What You'll Build

A local install of Krea 2 Turbo — the distilled, few-step variant of Krea AI's from-scratch aesthetic-first text-to-image foundation model (released 2026-06-23) — running 8-step text-to-image at up to 1280×720 on an AMD RX 7800 XT (16GB), inside ComfyUI on ROCm. Because RDNA 3 has no FP8 hardware, this recipe does not use the NVIDIA FP8 build; it leads with a community GGUF quantization loaded through the ComfyUI-GGUF node, which is the reliable AMD path. On the RX 7800 XT's 16GB the Q5_K_M tier (8.87 GB) is the sweet spot, with Q6_K available if you keep other VRAM use low.

Hardware data: RX 7800 XT (16GB VRAM, RDNA 3, gfx1101) · Krea 2 Turbo GGUF, 8 steps at 1280×720 · See benchmark data

ℹ️ This is Krea 2, not FLUX.1-Krea-dev. Krea 2 is Krea AI's own from-scratch ~12.9B-parameter DiT released 2026-06-23 — a different model from the 2025 black-forest-labs/FLUX.1-Krea-dev (a BFL×Krea collaboration built on FLUX). Don't mix their weights, sizes, or workflows.

⚠️ On AMD, use GGUF — not the FP8 build. RDNA 3 (gfx1101) has no FP8 tensor hardware (FP8 matrix ops arrived with RDNA 4 / CDNA 3). An FP8 safetensors loads on ROCm but upcasts to BF16 for compute — so it gives no memory saving, and neither it nor the full BF16 Turbo transformer (24.76 GiB) comes close to fitting a 16GB card. The path that fits and runs is a GGUF quant (this recipe's lead) via the ComfyUI-GGUF node. Pin the format before you download.

ℹ️ Where the weights come from. Krea published the official weights as gated repos under its verified org — krea/Krea-2-Raw and krea/Krea-2-Turbo (access-restricted). An ungated community mirror of the official turbo.safetensors is the krea-community/krea-2 bucket. The GGUF quants used here are a community conversion of the official Turbo weights, published at vantagewithai/Krea-2-Turbo-GGUF (which also ships a ready ComfyUI workflow JSON). Model identity and license come from krea.ai; read the license before any commercial use (see Requirements).

Requirements

Component	Minimum	Tested
GPU	16GB VRAM RDNA 3 card	RX 7800 XT (16GB, RDNA 3 Navi 32, gfx1101)
OS / driver	Linux + ROCm 6.2+ (gfx1101 is officially supported)	—
RAM	16GB system RAM (32GB comfortable)	—
Storage	~10GB (Q5_K_M GGUF 8.87GB + ~8GB BF16 encoder + 0.24GB VAE)	—
Software	ComfyUI 0.25.0+ · ComfyUI-GGUF (city96) · PyTorch for ROCm	ComfyUI native Krea2 + UnetLoaderGGUF

The RX 7800 XT is an officially ROCm-supported card (gfx1101) per AMD's ROCm system-requirements matrix — so HSA_OVERRIDE_GFX_VERSION is not required for it. ROCm is Linux-first — on Windows, use WSL2.

Licensing — read before commercial use. Krea 2 is released under the Krea 2 Community License. Key terms: you own the Outputs you generate; commercial use is free only if your company's total annual revenue is under $1,000,000 USD (above that requires an Enterprise License); any derivative AI model name must begin with "Krea"; you must implement reasonable content-filtering; and you may not circumvent or remove the model's content-provenance or watermarking mechanisms.

Installation

1. Install ComfyUI on ROCm

Install a ROCm PyTorch build (not CUDA) and run ComfyUI on it. The exact rocmX.Y wheel tag moves over time — read the live selector at pytorch.org/get-started/locally and pick the ROCm option, e.g.:

# Linux, inside your ComfyUI venv — pick the current ROCm tag from the selector
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.3
# then ComfyUI itself
git clone https://github.com/Comfy-Org/ComfyUI && cd ComfyUI
pip install -r requirements.txt

Launch ComfyUI with PyTorch's cross-attention backend — on RDNA 3 this is the stable attention path and avoids a known ComfyUI ROCm VAE-decode crash (ComfyUI #11551):

python main.py --use-pytorch-cross-attention

ROCm notes. Do not install FlashAttention (pip install flash-attn) — upstream CK FlashAttention does not build on gfx1101; ComfyUI's PyTorch SDPA path is what you use. The RX 7800 XT is officially supported, so the HSA_OVERRIDE_GFX_VERSION=11.0.0 masquerade (gfx1101→gfx1100) is a legacy fallback you do not need here. If a large model load stalls, add --disable-smart-memory.

2. Install the ComfyUI-GGUF custom node

The GGUF diffusion model loads through ComfyUI-GGUF (city96). Install it via ComfyUI Manager ("Install Custom Nodes" → search "GGUF"), or by hand:

cd ComfyUI/custom_nodes
git clone https://github.com/city96/ComfyUI-GGUF
pip install --upgrade gguf

Restart ComfyUI; you should now have the Unet Loader (GGUF) node.

3. Download the model files

Pick one GGUF tier (see the table under "Running" — Q5_K_M is the 16GB lead), plus the text encoder and VAE. File-to-folder mapping follows the vantagewithai/Krea-2-Turbo-GGUF workflow:

# from your ComfyUI root

# GGUF diffusion model (Q5_K_M = 8.87 GiB, the 16GB lead) → unet/
cd models/unet
wget https://huggingface.co/vantagewithai/Krea-2-Turbo-GGUF/resolve/main/krea2_turbo-Q5_K_M.gguf

# Qwen3-VL 4B text encoder, BF16 (8.26 GiB) → text_encoders/
cd ../text_encoders
wget https://huggingface.co/Comfy-Org/Qwen3-VL/resolve/main/text_encoders/qwen3vl_4b_bf16.safetensors

# Qwen-Image VAE (242 MiB) → vae/
cd ../vae
wget https://huggingface.co/Comfy-Org/Qwen-Image_ComfyUI/resolve/main/split_files/vae/qwen_image_vae.safetensors

Krea 2's text encoder is Qwen/Qwen3-VL-4B-Instruct and its VAE is the Qwen-Image autoencoder (AutoencoderKLQwenImage, f8, 16 latent channels), per the Krea-2-Base-Diffusers model card. The VAE file is precision-independent (same file used on NVIDIA). On AMD the BF16 encoder is the clean choice — the workflow's qwen3vl_4b_fp8_scaled file also loads, but on RDNA 3 it upcasts to BF16 anyway (no memory saving), so there's no reason to prefer it here.

4. Load the workflow

Drag the Vantage_Krea-2-Turbo.json workflow onto the ComfyUI canvas. Set the Unet Loader (GGUF) node to the .gguf tier you downloaded, the encoder loader to qwen3vl_4b_bf16.safetensors, and the VAE to qwen_image_vae.safetensors.

Running

Edit the prompt node and click Queue Prompt. The Turbo defaults baked into the workflow are:

Steps: 8
CFG: 1.0
Sampler: er_sde
Scheduler: simple
Resolution: 1280×720

Sequential encode is what makes 16GB work. ComfyUI runs the ~8 GiB Qwen3-VL encoder to encode your prompt, then frees it before the diffusion sampling stage — so the encoder and the GGUF transformer are never both resident. The sampling-stage peak is therefore near the GGUF tier you chose plus the VAE and activations, not the sum of both. Output PNGs land in ComfyUI/output/.

GGUF quant tiers (byte sizes verified via the HuggingFace tree API; pick by VRAM and quality target):

Tier	File size	Notes
Q8_0	13.71 GB	Near-BF16 — borderline on 16GB (tight after activations)
Q6_K	10.58 GB	Excellent quality; fits 16GB with care
Q5_K_M	8.87 GB	Strong quality/size balance — lead on 16GB
Q4_K_M	7.49 GB	Lighter/faster; good headroom on 16GB
Q3_K_M	6.01 GB	Visible degradation; only if tight
Q2_K	4.89 GB	Lowest tier; quality drops noticeably

Tip — natural-language prompts. Krea 2 is prompted in natural language; long, detailed descriptions yield the best results, and words to be rendered as text in the image are wrapped in quotes (per the Krea-2-Base-Diffusers model card).

The Raw quality tier (full-quality, undistilled)

Krea 2 Raw / Base is the undistilled foundation checkpoint — no step or guidance distillation, run with classifier-free guidance (recommended settings: 52 steps, CFG 3.5, up to 1024×1024). It is the LoRA-training base (LoRAs trained on Base apply to Turbo). For the same no-FP8 reason, run Raw on AMD as GGUF too — community Raw GGUFs are published at vantagewithai/Krea-2-Raw-GGUF, loaded through the same Unet Loader (GGUF) node. On 16GB, stick to the lower Raw GGUF tiers and expect substantially longer generations (52 vs 8 steps, plus CFG doubling the per-step work); verify your chosen tier fits before a long run.

Results

Speed: No community benchmark exists for Krea 2 on the RX 7800 XT yet — the /check/krea-2/rx-7800-xt endpoint currently returns verdict: unknown with no benchmark rows. GGUF inference on ROCm is dequantized to the card's native compute (RDNA 3 has no FP8/INT acceleration for diffusion here), so expect throughput governed by the 7800 XT's FP16/BF16 path; no vendor figure names this card, so we omit a measured number rather than quote different hardware. If you run it, please submit your numbers so they appear on /check/krea-2/rx-7800-xt.
VRAM usage: The lead Q5_K_M GGUF transformer is 8.87 GB on disk; tiers run 4.89–13.71 GB (verified via the HuggingFace tree API). Because ComfyUI frees the ~8 GiB BF16 encoder before sampling (see "Running"), the sampling-stage peak sits near the chosen GGUF tier plus the VAE and activations — within the RX 7800 XT's 16GB at the Q4–Q6 tiers. Q8_0 is borderline at 16GB once activations are added. Live measurements will land at /check/krea-2/rx-7800-xt.
Quality notes: Q5_K_M/Q6_K are strong; Q8_0 is near-BF16 but tight on 16GB; below Q4 expect visible degradation. Turbo is distilled for 8-step CFG-1.0 generation; for maximum fidelity use the Raw tier (52 steps, CFG 3.5) — see above. Architecture is a single-stream DiT, 12.9B parameters, 28 blocks at width 6144, with grouped-query attention and flow-matching sampling, per the Krea-2-Base-Diffusers model card.

For the full benchmark data, see /check/krea-2/rx-7800-xt.

Troubleshooting

ComfyUI ROCm VAE-decode crash / black or garbled output

Launch ComfyUI with --use-pytorch-cross-attention (PyTorch SDPA). On RDNA 3 this is the attention path confirmed in ComfyUI #11551; the alternative --bf16-vae is not the fix (it is contested and can inflate decode VRAM). Do not install FlashAttention — it does not build on gfx1101, and ComfyUI does not need it.

"Unet Loader (GGUF) node not found"

Install ComfyUI-GGUF (city96) in custom_nodes/, run pip install --upgrade gguf, and restart ComfyUI. The base ComfyUI install cannot load .gguf diffusion models without it.

Out of memory during sampling

On 16GB, drop to a lighter GGUF tier (Q6_K → Q5_K_M → Q4_K_M) — each step roughly follows the file-size column above. Q8_0 is borderline here; prefer Q5_K_M or Q6_K. Close other GPU apps. If a large load stalls rather than OOMs, add --disable-smart-memory to the launch command (a known ROCm memory-management workaround).

Verify ROCm sees the GPU

If ComfyUI falls back to CPU, confirm python -c "import torch; print(torch.cuda.is_available(), torch.version.hip)" prints True and a HIP version. If torch.version.hip is None, you installed a CUDA wheel — reinstall the ROCm build from the pytorch.org selector. The RX 7800 XT (gfx1101) is officially supported, so the HSA_OVERRIDE_GFX_VERSION=11.0.0 masquerade is not needed unless a specific library ships only gfx1100 kernels.