What You'll Build
A working ComfyUI setup that runs Chroma1-Base — the 8.9B-parameter, Apache 2.0, uncensored re-derivation of FLUX.1-Schnell published by Lodestone Rock and explicitly labeled "Chroma1-Base is Chroma-v.48" on the official HF card — on an RTX 4080 (Ada Lovelace, sm_89, 16 GB GDDR6X). The RTX 4080's 4th-generation tensor cores run FP8 (e4m3fn) natively, so this recipe leads with the pre-scaled FP8 redistribution by Clybius, which keeps the V48 weight lineage intact while fitting comfortably on a 16 GB card.
Hardware data: RTX 4080 (16 GB VRAM, ~716.8 GB/s, Ada sm_89) · runs at FP8-scaled e4m3fn (9.19 GB on disk) with the FLUX VAE and T5 XXL fp8 encoder · See benchmark data
ℹ️ Why Chroma1-Base and not Chroma1-HD or Chroma1-Radiance. The Chroma family ships three current variants from the same author:
Chroma1-Base(the literal V48 weights),Chroma1-HD(a successor retrained from V48 as a finetune-ready base), andChroma1-Radiance(a different output head — no FLUX VAE, different decoder). This recipe pinsChroma1-Basebecause that is what V48 specifically is, per the Chroma1-Base HF card. For Chroma1-HD or Chroma1-Radiance, follow their own respective HF cards — install paths differ.
⚠️ The original
lodestones/Chromarepo is deprecated. Its README now directs users toChroma1-HD,Chroma1-Base, orChroma1-Flashinstead. UseChroma1-Basefor V48.
Requirements
| Component | Minimum | Tested |
|---|---|---|
| GPU | 16 GB VRAM (BF16 weights are 17.8 GB on disk per silveroxides/Chroma1-Base-GGUF, so an FP8-scaled or GGUF Q8_0-or-smaller path is required to keep weights + T5 + VAE + activations resident on 16 GB) | RTX 4080 (16 GB) |
| RAM | 16 GB system | — |
| Storage | ~10 GB (FP8-scaled weights 9.19 GB + T5 XXL fp8 + FLUX VAE ae.safetensors) | — |
| Software | ComfyUI + ComfyUI-GGUF custom node by city96 (only needed for the GGUF fallback path) | — |
Installation
1. Update ComfyUI
Update to a recent ComfyUI release. The standard ComfyUI Chroma example workflow at comfyanonymous.github.io/ComfyUI_examples/chroma drives the diffusion-model path this recipe uses. The Clybius FP8-scaled README notes it requires an up-to-date ComfyUI.
2. Download the Chroma1-Base (V48) FP8-scaled weights
On the RTX 4080's Ada 4th-gen tensor cores, FP8 e4m3fn is a native datatype, so the pre-scaled FP8 redistribution is the recommended path. Download the standard V48 file from Clybius/Chroma-fp8-scaled — the /v48/ subdirectory carries the FP8-scaled e4m3fn V48 weights (link-back to canonical lodestones/Chroma declared in the repo's base_model):
wget -P ComfyUI/models/diffusion_models/ \
https://huggingface.co/Clybius/Chroma-fp8-scaled/resolve/main/v48/chroma-unlocked-v48_float8_e4m3fn_scaled_learned_nodistill.safetensors
This file is 9.19 GB. It uses pre-scaled FP8 e4m3fn (not on-the-fly --fp8_e5m2 casting), which avoids the noise artifacts that on-the-fly casting introduces on the Chroma family (see Troubleshooting).
ℹ️ License note for the Clybius repo. The
/v48/FP8-scaled files redistribute the Apache-2.0 Chroma1-Base weights. The Clybius repo README flags that files in itsdebug/,merges/, andhybrid_merges/folders are CC BY-NC-SA 4.0 (non-commercial) because they incorporate components from a non-commercial development repo. Stay in/v48/for the Apache-2.0 lineage; avoid the debug/merges folders if you need commercial use.
3. Download the T5 XXL text encoder and FLUX VAE
The Chroma1-Base HF card requires the FLUX-ecosystem T5 XXL encoder and the FLUX VAE:
# T5 XXL — use fp8 on a 16 GB card (the fp16 variant doubles the encoder footprint)
wget -P ComfyUI/models/clip/ \
https://huggingface.co/comfyanonymous/flux_text_encoders/resolve/main/t5xxl_fp8_e4m3fn.safetensors
# FLUX VAE (ae.safetensors from the FLUX.1 release)
wget -P ComfyUI/models/vae/ \
https://huggingface.co/lodestones/Chroma/resolve/main/ae.safetensors
The Chroma1-Base card links the t5xxl_fp16.safetensors (same repo) and the FLUX VAE ae.safetensors; the fp8_e4m3fn T5 variant above is the lower-footprint encoder for a 16 GB card. The fp16 T5 variant is also supported if you have spare VRAM.
4. Load the Chroma workflow
The Chroma1-HD T2I workflow JSON ships at ComfyUI_Chroma1-HD_T2I-workflow.json in both the lodestones/Chroma and lodestones/Chroma1-HD repos. Download it, drag it onto the ComfyUI canvas, then:
- Point the
Load Diffusion Modelnode at the downloadedchroma-unlocked-v48_float8_e4m3fn_scaled_learned_nodistill.safetensors, and set itsweight_dtypetodefault(as documented in the Clybius README). - Confirm the workflow's T5 loader points at
t5xxl_fp8_e4m3fn.safetensorsfrom step 3. - Confirm the VAE loader points at
ae.safetensors.
The Chroma1-HD workflow JSON is the canonical ComfyUI workflow for the V48 lineage — the difference between Chroma1-Base and Chroma1-HD is the weights file, not the workflow topology.
Alternative: GGUF path (lower VRAM, more granular quant tiers)
If you prefer GGUF (e.g. to drop below the FP8 footprint, or to stack acceleration LoRAs at lower resolution), install the ComfyUI-GGUF custom node by city96 into ComfyUI/custom_nodes:
git clone https://github.com/city96/ComfyUI-GGUF
cd ComfyUI-GGUF
pip install -r requirements.txt
Restart ComfyUI, then pick one quantization from silveroxides/Chroma1-Base-GGUF — file sizes verbatim from the model card:
| Quant | Size |
|---|---|
| Q2_K | 3.41 GB |
| Q3_K_S | 4.29 GB |
| Q4_0 / Q4_K_S | 5.43 GB |
| Q4_K_M | 5.57 GB |
| Q4_1 | 5.97 GB |
| Q5_0 / Q5_K_S | 6.51 GB |
| Q5_K_M | 6.65 GB |
| Q5_1 | 7.05 GB |
| Q6_K | 7.65 GB |
| Q8_0 | 9.74 GB |
| BF16 | 17.8 GB |
Drop the .gguf into ComfyUI/models/diffusion_models/ and swap the Load Diffusion Model node for the Unet Loader (GGUF) node from ComfyUI-GGUF. Q8_0 (9.74 GB) is the highest in-family quality that comfortably fits 16 GB; drop to Q4_K_M (5.57 GB) if you stack LoRAs or push past 1024×1024.
Running
Use a 1024×1024 latent for the first run. The Chroma1-Base diffusers snippet on the HF card uses num_inference_steps=40 and guidance_scale=3.0 as a reasonable starting point; in ComfyUI, set the sampler's step count similarly (20–40 steps work; the example workflow defaults are a safe start).
Trigger: Queue Prompt
Output: PNG saved to ComfyUI/output/
The first generation pays a cold-load cost (weights → VRAM, text encoder → VRAM). Subsequent generations with the same model reuse the loaded weights.
ℹ️ No Blackwell-specific wheel selection needed. The RTX 4080 is Ada Lovelace (sm_89), not Blackwell (RTX 50-series, sm_120) — the default
pip install torchshipped with ComfyUI already includes sm_89 kernels, and FlashAttention-2 has full sm_89 coverage. Nocu128-specific wheel pinning orattn_implementationoverrides are required, and the FP8 e4m3fn datatype the recommended path uses is hardware-native on Ada's 4th-gen tensor cores.
Results
- Speed: Omitted. No first-party generation-time data point on Chroma1-Base specifically is published for the RTX 4080. The RTX 4080's ~716.8 GB/s memory bandwidth sits between the slower 16 GB Ada cards and the 24 GB RTX 4090 (~1008 GB/s), so a generation time scaled from any of those would be a guess, not a measurement. The only first-party speed thread in the family (Chroma1-HD discussion #25) measures Chroma1-HD (not Base) on an RTX 5090 at 1152×1152, 40 steps, 10 LoRAs — different variant, different card, different configuration — so it is not quotable for this recipe. Once community measurements land via /contribute, the /check/ endpoint will surface them.
- VRAM usage: Plan for ≥ 16 GB. The on-disk BF16 size is 17.8 GB per silveroxides/Chroma1-Base-GGUF, so unquantized BF16 already overflows 16 GB before the text encoder and VAE load — the FP8-scaled file (9.19 GB on disk per Clybius/Chroma-fp8-scaled) or the Q8_0 GGUF (9.74 GB on disk) is the path that fits the RTX 4080's 16 GB with headroom for T5 fp8 + ae.safetensors + activations. Once a measured number for Chroma1-Base on a 16 GB card lands, /check/ will replace this envelope.
- Quality notes: Chroma1-Base is a FLUX.1-Schnell de-distillation — it restores the multi-step diffusion behavior that Schnell distilled away, so it runs more like a FLUX.1-Dev-class model than a 4-step turbo. Don't expect Schnell-tier speed.
For the full benchmark data, see /check/chroma-v48/rtx-4080.
Troubleshooting
"v48", "Chroma1-Base", "Chroma1-HD", "Chroma1-Radiance" — which one is V48?
Per the lodestones/Chroma1-Base README, "Chroma1-Base is Chroma-v.48" — that's the literal V48. Chroma1-HD is a successor retrained from v.48 as a finetune-ready base — adjacent lineage, not the same weights. Chroma1-Radiance is a separate output-head variant (no FLUX VAE, different decoder) — close cousin, distinct architecture. The canonical, currently-maintained V48 distribution is Chroma1-Base; the FP8-scaled file in step 2 (chroma-unlocked-v48...) is that lineage in Ada-native FP8.
Noise artifacts with --fp8_e5m2-unet
The FP8-scaled file in step 2 uses pre-scaled e4m3fn weights, which is the clean path. Avoid ComfyUI's on-the-fly --fp8_e5m2-unet flag — it produces noise artifacts on Chroma1-family models. Stick to the pre-scaled e4m3fn file (or the Q8_0 GGUF from the alternative path), or --fp8_e4m3fn-unet if you need an on-the-fly fp8 cast.
Out-of-memory at high resolution
The FP8-scaled file (9.19 GB) leaves roughly 6 GB of headroom on a 16 GB card for the T5 XXL fp8 encoder (~4.5 GB), the FLUX VAE (~330 MB), and intermediate activations. If 1280×1280 or larger pushes the card to OOM, switch to the GGUF alternative and drop to Q5_K_M (6.65 GB) or Q4_K_M (5.57 GB) — quality degrades gracefully across the GGUF tiers, and the per-quant footprint table in the alternative-path section gives you the exact trade.
Did the workflow load the right encoder?
Chroma1-Base uses the standard FLUX T5 XXL encoder (not Qwen3-4B / Gemma / etc.), so the safetensors t5xxl_fp8_e4m3fn.safetensors path works directly. If you see garbled prompts or a CLIP-vs-T5 mismatch error, confirm the workflow points the text-encoder node at the T5 file from step 3, not a CLIP file.
If your specific issue isn't covered above, please report it via the submission form so the next reader benefits.