What You'll Build
A local install of Krea 2 Turbo — the distilled, few-step variant of Krea AI's from-scratch aesthetic-first text-to-image foundation model (released 2026-06-23) — running 8-step text-to-image at up to 1280×720 on an AMD RX 7800 XT (16GB), inside ComfyUI on ROCm. Because RDNA 3 has no FP8 hardware, this recipe does not use the NVIDIA FP8 build; it leads with a community GGUF quantization loaded through the ComfyUI-GGUF node, which is the reliable AMD path. On the RX 7800 XT's 16GB the Q5_K_M tier (8.87 GB) is the sweet spot, with Q6_K available if you keep other VRAM use low.
Hardware data: RX 7800 XT (16GB VRAM, RDNA 3, gfx1101) · Krea 2 Turbo GGUF, 8 steps at 1280×720 · See benchmark data
ℹ️ This is Krea 2, not FLUX.1-Krea-dev. Krea 2 is Krea AI's own from-scratch ~12.9B-parameter DiT released 2026-06-23 — a different model from the 2025
black-forest-labs/FLUX.1-Krea-dev(a BFL×Krea collaboration built on FLUX). Don't mix their weights, sizes, or workflows.
⚠️ On AMD, use GGUF — not the FP8 build. RDNA 3 (gfx1101) has no FP8 tensor hardware (FP8 matrix ops arrived with RDNA 4 / CDNA 3). An FP8 safetensors loads on ROCm but upcasts to BF16 for compute — so it gives no memory saving, and neither it nor the full BF16 Turbo transformer (24.76 GiB) comes close to fitting a 16GB card. The path that fits and runs is a GGUF quant (this recipe's lead) via the ComfyUI-GGUF node. Pin the format before you download.
ℹ️ Where the weights come from. Krea published the official weights as gated repos under its verified org —
krea/Krea-2-Rawandkrea/Krea-2-Turbo(access-restricted). An ungated community mirror of the officialturbo.safetensorsis thekrea-community/krea-2bucket. The GGUF quants used here are a community conversion of the official Turbo weights, published atvantagewithai/Krea-2-Turbo-GGUF(which also ships a ready ComfyUI workflow JSON). Model identity and license come from krea.ai; read the license before any commercial use (see Requirements).
Requirements
| Component | Minimum | Tested |
|---|---|---|
| GPU | 16GB VRAM RDNA 3 card | RX 7800 XT (16GB, RDNA 3 Navi 32, gfx1101) |
| OS / driver | Linux + ROCm 6.2+ (gfx1101 is officially supported) | — |
| RAM | 16GB system RAM (32GB comfortable) | — |
| Storage | ~10GB (Q5_K_M GGUF 8.87GB + ~8GB BF16 encoder + 0.24GB VAE) | — |
| Software | ComfyUI 0.25.0+ · ComfyUI-GGUF (city96) · PyTorch for ROCm | ComfyUI native Krea2 + UnetLoaderGGUF |
The RX 7800 XT is an officially ROCm-supported card (gfx1101) per AMD's ROCm system-requirements matrix — so HSA_OVERRIDE_GFX_VERSION is not required for it. ROCm is Linux-first — on Windows, use WSL2.
Licensing — read before commercial use. Krea 2 is released under the Krea 2 Community License. Key terms: you own the Outputs you generate; commercial use is free only if your company's total annual revenue is under $1,000,000 USD (above that requires an Enterprise License); any derivative AI model name must begin with "Krea"; you must implement reasonable content-filtering; and you may not circumvent or remove the model's content-provenance or watermarking mechanisms.
Installation
1. Install ComfyUI on ROCm
Install a ROCm PyTorch build (not CUDA) and run ComfyUI on it. The exact rocmX.Y wheel tag moves over time — read the live selector at pytorch.org/get-started/locally and pick the ROCm option, e.g.:
# Linux, inside your ComfyUI venv — pick the current ROCm tag from the selector
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.3
# then ComfyUI itself
git clone https://github.com/Comfy-Org/ComfyUI && cd ComfyUI
pip install -r requirements.txt
Launch ComfyUI with PyTorch's cross-attention backend — on RDNA 3 this is the stable attention path and avoids a known ComfyUI ROCm VAE-decode crash (ComfyUI #11551):
python main.py --use-pytorch-cross-attention
ROCm notes. Do not install FlashAttention (
pip install flash-attn) — upstream CK FlashAttention does not build on gfx1101; ComfyUI's PyTorch SDPA path is what you use. The RX 7800 XT is officially supported, so theHSA_OVERRIDE_GFX_VERSION=11.0.0masquerade (gfx1101→gfx1100) is a legacy fallback you do not need here. If a large model load stalls, add--disable-smart-memory.
2. Install the ComfyUI-GGUF custom node
The GGUF diffusion model loads through ComfyUI-GGUF (city96). Install it via ComfyUI Manager ("Install Custom Nodes" → search "GGUF"), or by hand:
cd ComfyUI/custom_nodes
git clone https://github.com/city96/ComfyUI-GGUF
pip install --upgrade gguf
Restart ComfyUI; you should now have the Unet Loader (GGUF) node.
3. Download the model files
Pick one GGUF tier (see the table under "Running" — Q5_K_M is the 16GB lead), plus the text encoder and VAE. File-to-folder mapping follows the vantagewithai/Krea-2-Turbo-GGUF workflow:
# from your ComfyUI root
# GGUF diffusion model (Q5_K_M = 8.87 GiB, the 16GB lead) → unet/
cd models/unet
wget https://huggingface.co/vantagewithai/Krea-2-Turbo-GGUF/resolve/main/krea2_turbo-Q5_K_M.gguf
# Qwen3-VL 4B text encoder, BF16 (8.26 GiB) → text_encoders/
cd ../text_encoders
wget https://huggingface.co/Comfy-Org/Qwen3-VL/resolve/main/text_encoders/qwen3vl_4b_bf16.safetensors
# Qwen-Image VAE (242 MiB) → vae/
cd ../vae
wget https://huggingface.co/Comfy-Org/Qwen-Image_ComfyUI/resolve/main/split_files/vae/qwen_image_vae.safetensors
Krea 2's text encoder is Qwen/Qwen3-VL-4B-Instruct and its VAE is the Qwen-Image autoencoder (AutoencoderKLQwenImage, f8, 16 latent channels), per the Krea-2-Base-Diffusers model card. The VAE file is precision-independent (same file used on NVIDIA). On AMD the BF16 encoder is the clean choice — the workflow's qwen3vl_4b_fp8_scaled file also loads, but on RDNA 3 it upcasts to BF16 anyway (no memory saving), so there's no reason to prefer it here.
4. Load the workflow
Drag the Vantage_Krea-2-Turbo.json workflow onto the ComfyUI canvas. Set the Unet Loader (GGUF) node to the .gguf tier you downloaded, the encoder loader to qwen3vl_4b_bf16.safetensors, and the VAE to qwen_image_vae.safetensors.
Running
Edit the prompt node and click Queue Prompt. The Turbo defaults baked into the workflow are:
- Steps: 8
- CFG: 1.0
- Sampler:
er_sde - Scheduler:
simple - Resolution: 1280×720
Sequential encode is what makes 16GB work. ComfyUI runs the ~8 GiB Qwen3-VL encoder to encode your prompt, then frees it before the diffusion sampling stage — so the encoder and the GGUF transformer are never both resident. The sampling-stage peak is therefore near the GGUF tier you chose plus the VAE and activations, not the sum of both. Output PNGs land in ComfyUI/output/.
GGUF quant tiers (byte sizes verified via the HuggingFace tree API; pick by VRAM and quality target):
| Tier | File size | Notes |
|---|---|---|
| Q8_0 | 13.71 GB | Near-BF16 — borderline on 16GB (tight after activations) |
| Q6_K | 10.58 GB | Excellent quality; fits 16GB with care |
| Q5_K_M | 8.87 GB | Strong quality/size balance — lead on 16GB |
| Q4_K_M | 7.49 GB | Lighter/faster; good headroom on 16GB |
| Q3_K_M | 6.01 GB | Visible degradation; only if tight |
| Q2_K | 4.89 GB | Lowest tier; quality drops noticeably |
Tip — natural-language prompts. Krea 2 is prompted in natural language; long, detailed descriptions yield the best results, and words to be rendered as text in the image are wrapped in quotes (per the Krea-2-Base-Diffusers model card).
The Raw quality tier (full-quality, undistilled)
Krea 2 Raw / Base is the undistilled foundation checkpoint — no step or guidance distillation, run with classifier-free guidance (recommended settings: 52 steps, CFG 3.5, up to 1024×1024). It is the LoRA-training base (LoRAs trained on Base apply to Turbo). For the same no-FP8 reason, run Raw on AMD as GGUF too — community Raw GGUFs are published at vantagewithai/Krea-2-Raw-GGUF, loaded through the same Unet Loader (GGUF) node. On 16GB, stick to the lower Raw GGUF tiers and expect substantially longer generations (52 vs 8 steps, plus CFG doubling the per-step work); verify your chosen tier fits before a long run.
Results
- Speed: No community benchmark exists for Krea 2 on the RX 7800 XT yet — the
/check/krea-2/rx-7800-xtendpoint currently returnsverdict: unknownwith no benchmark rows. GGUF inference on ROCm is dequantized to the card's native compute (RDNA 3 has no FP8/INT acceleration for diffusion here), so expect throughput governed by the 7800 XT's FP16/BF16 path; no vendor figure names this card, so we omit a measured number rather than quote different hardware. If you run it, please submit your numbers so they appear on /check/krea-2/rx-7800-xt. - VRAM usage: The lead Q5_K_M GGUF transformer is 8.87 GB on disk; tiers run 4.89–13.71 GB (verified via the HuggingFace tree API). Because ComfyUI frees the ~8 GiB BF16 encoder before sampling (see "Running"), the sampling-stage peak sits near the chosen GGUF tier plus the VAE and activations — within the RX 7800 XT's 16GB at the Q4–Q6 tiers. Q8_0 is borderline at 16GB once activations are added. Live measurements will land at /check/krea-2/rx-7800-xt.
- Quality notes: Q5_K_M/Q6_K are strong; Q8_0 is near-BF16 but tight on 16GB; below Q4 expect visible degradation. Turbo is distilled for 8-step CFG-1.0 generation; for maximum fidelity use the Raw tier (52 steps, CFG 3.5) — see above. Architecture is a single-stream DiT, 12.9B parameters, 28 blocks at width 6144, with grouped-query attention and flow-matching sampling, per the Krea-2-Base-Diffusers model card.
For the full benchmark data, see /check/krea-2/rx-7800-xt.
Troubleshooting
ComfyUI ROCm VAE-decode crash / black or garbled output
Launch ComfyUI with --use-pytorch-cross-attention (PyTorch SDPA). On RDNA 3 this is the attention path confirmed in ComfyUI #11551; the alternative --bf16-vae is not the fix (it is contested and can inflate decode VRAM). Do not install FlashAttention — it does not build on gfx1101, and ComfyUI does not need it.
"Unet Loader (GGUF) node not found"
Install ComfyUI-GGUF (city96) in custom_nodes/, run pip install --upgrade gguf, and restart ComfyUI. The base ComfyUI install cannot load .gguf diffusion models without it.
Out of memory during sampling
On 16GB, drop to a lighter GGUF tier (Q6_K → Q5_K_M → Q4_K_M) — each step roughly follows the file-size column above. Q8_0 is borderline here; prefer Q5_K_M or Q6_K. Close other GPU apps. If a large load stalls rather than OOMs, add --disable-smart-memory to the launch command (a known ROCm memory-management workaround).
Verify ROCm sees the GPU
If ComfyUI falls back to CPU, confirm python -c "import torch; print(torch.cuda.is_available(), torch.version.hip)" prints True and a HIP version. If torch.version.hip is None, you installed a CUDA wheel — reinstall the ROCm build from the pytorch.org selector. The RX 7800 XT (gfx1101) is officially supported, so the HSA_OVERRIDE_GFX_VERSION=11.0.0 masquerade is not needed unless a specific library ships only gfx1100 kernels.