Chroma1-Base (V48) on RTX 4080: Uncensored 8.9B FLUX.1-Schnell De-Distillation via FP8-Scaled in ComfyUI

What You'll Build

A working ComfyUI setup that runs Chroma1-Base — the 8.9B-parameter, Apache 2.0, uncensored re-derivation of FLUX.1-Schnell published by Lodestone Rock and explicitly labeled "Chroma1-Base is Chroma-v.48" on the official HF card — on an RTX 4080 (Ada Lovelace, sm_89, 16 GB GDDR6X). The RTX 4080's 4th-generation tensor cores run FP8 (e4m3fn) natively, so this recipe leads with the pre-scaled FP8 redistribution by Clybius, which keeps the V48 weight lineage intact while fitting comfortably on a 16 GB card.

Hardware data: RTX 4080 (16 GB VRAM, ~716.8 GB/s, Ada sm_89) · runs at FP8-scaled e4m3fn (9.19 GB on disk) with the FLUX VAE and T5 XXL fp8 encoder · See benchmark data

ℹ️ Why Chroma1-Base and not Chroma1-HD or Chroma1-Radiance. The Chroma family ships three current variants from the same author: Chroma1-Base (the literal V48 weights), Chroma1-HD (a successor retrained from V48 as a finetune-ready base), and Chroma1-Radiance (a different output head — no FLUX VAE, different decoder). This recipe pins Chroma1-Base because that is what V48 specifically is, per the Chroma1-Base HF card. For Chroma1-HD or Chroma1-Radiance, follow their own respective HF cards — install paths differ.

⚠️ The original lodestones/Chroma repo is deprecated. Its README now directs users to Chroma1-HD, Chroma1-Base, or Chroma1-Flash instead. Use Chroma1-Base for V48.

Requirements

Component	Minimum	Tested
GPU	16 GB VRAM (BF16 weights are 17.8 GB on disk per silveroxides/Chroma1-Base-GGUF, so an FP8-scaled or GGUF Q8_0-or-smaller path is required to keep weights + T5 + VAE + activations resident on 16 GB)	RTX 4080 (16 GB)
RAM	16 GB system	—
Storage	~10 GB (FP8-scaled weights 9.19 GB + T5 XXL fp8 + FLUX VAE ae.safetensors)	—
Software	ComfyUI + ComfyUI-GGUF custom node by city96 (only needed for the GGUF fallback path)	—

Installation

1. Update ComfyUI

Update to a recent ComfyUI release. The standard ComfyUI Chroma example workflow at comfyanonymous.github.io/ComfyUI_examples/chroma drives the diffusion-model path this recipe uses. The Clybius FP8-scaled README notes it requires an up-to-date ComfyUI.

2. Download the Chroma1-Base (V48) FP8-scaled weights

On the RTX 4080's Ada 4th-gen tensor cores, FP8 e4m3fn is a native datatype, so the pre-scaled FP8 redistribution is the recommended path. Download the standard V48 file from Clybius/Chroma-fp8-scaled — the /v48/ subdirectory carries the FP8-scaled e4m3fn V48 weights (link-back to canonical lodestones/Chroma declared in the repo's base_model):

wget -P ComfyUI/models/diffusion_models/ \
  https://huggingface.co/Clybius/Chroma-fp8-scaled/resolve/main/v48/chroma-unlocked-v48_float8_e4m3fn_scaled_learned_nodistill.safetensors

This file is 9.19 GB. It uses pre-scaled FP8 e4m3fn (not on-the-fly --fp8_e5m2 casting), which avoids the noise artifacts that on-the-fly casting introduces on the Chroma family (see Troubleshooting).

ℹ️ License note for the Clybius repo. The /v48/ FP8-scaled files redistribute the Apache-2.0 Chroma1-Base weights. The Clybius repo README flags that files in its debug/, merges/, and hybrid_merges/ folders are CC BY-NC-SA 4.0 (non-commercial) because they incorporate components from a non-commercial development repo. Stay in /v48/ for the Apache-2.0 lineage; avoid the debug/merges folders if you need commercial use.

3. Download the T5 XXL text encoder and FLUX VAE

The Chroma1-Base HF card requires the FLUX-ecosystem T5 XXL encoder and the FLUX VAE:

# T5 XXL — use fp8 on a 16 GB card (the fp16 variant doubles the encoder footprint)
wget -P ComfyUI/models/clip/ \
  https://huggingface.co/comfyanonymous/flux_text_encoders/resolve/main/t5xxl_fp8_e4m3fn.safetensors

# FLUX VAE (ae.safetensors from the FLUX.1 release)
wget -P ComfyUI/models/vae/ \
  https://huggingface.co/lodestones/Chroma/resolve/main/ae.safetensors

The Chroma1-Base card links the t5xxl_fp16.safetensors (same repo) and the FLUX VAE ae.safetensors; the fp8_e4m3fn T5 variant above is the lower-footprint encoder for a 16 GB card. The fp16 T5 variant is also supported if you have spare VRAM.

4. Load the Chroma workflow

The Chroma1-HD T2I workflow JSON ships at ComfyUI_Chroma1-HD_T2I-workflow.json in both the lodestones/Chroma and lodestones/Chroma1-HD repos. Download it, drag it onto the ComfyUI canvas, then:

Point the Load Diffusion Model node at the downloaded chroma-unlocked-v48_float8_e4m3fn_scaled_learned_nodistill.safetensors, and set its weight_dtype to default (as documented in the Clybius README).
Confirm the workflow's T5 loader points at t5xxl_fp8_e4m3fn.safetensors from step 3.
Confirm the VAE loader points at ae.safetensors.

The Chroma1-HD workflow JSON is the canonical ComfyUI workflow for the V48 lineage — the difference between Chroma1-Base and Chroma1-HD is the weights file, not the workflow topology.

Alternative: GGUF path (lower VRAM, more granular quant tiers)

If you prefer GGUF (e.g. to drop below the FP8 footprint, or to stack acceleration LoRAs at lower resolution), install the ComfyUI-GGUF custom node by city96 into ComfyUI/custom_nodes:

git clone https://github.com/city96/ComfyUI-GGUF
cd ComfyUI-GGUF
pip install -r requirements.txt

Restart ComfyUI, then pick one quantization from silveroxides/Chroma1-Base-GGUF — file sizes verbatim from the model card:

Quant	Size
Q2_K	3.41 GB
Q3_K_S	4.29 GB
Q4_0 / Q4_K_S	5.43 GB
Q4_K_M	5.57 GB
Q4_1	5.97 GB
Q5_0 / Q5_K_S	6.51 GB
Q5_K_M	6.65 GB
Q5_1	7.05 GB
Q6_K	7.65 GB
Q8_0	9.74 GB
BF16	17.8 GB

Drop the .gguf into ComfyUI/models/diffusion_models/ and swap the Load Diffusion Model node for the Unet Loader (GGUF) node from ComfyUI-GGUF. Q8_0 (9.74 GB) is the highest in-family quality that comfortably fits 16 GB; drop to Q4_K_M (5.57 GB) if you stack LoRAs or push past 1024×1024.

Running

Use a 1024×1024 latent for the first run. The Chroma1-Base diffusers snippet on the HF card uses num_inference_steps=40 and guidance_scale=3.0 as a reasonable starting point; in ComfyUI, set the sampler's step count similarly (20–40 steps work; the example workflow defaults are a safe start).

Trigger: Queue Prompt
Output: PNG saved to ComfyUI/output/

The first generation pays a cold-load cost (weights → VRAM, text encoder → VRAM). Subsequent generations with the same model reuse the loaded weights.

ℹ️ No Blackwell-specific wheel selection needed. The RTX 4080 is Ada Lovelace (sm_89), not Blackwell (RTX 50-series, sm_120) — the default pip install torch shipped with ComfyUI already includes sm_89 kernels, and FlashAttention-2 has full sm_89 coverage. No cu128-specific wheel pinning or attn_implementation overrides are required, and the FP8 e4m3fn datatype the recommended path uses is hardware-native on Ada's 4th-gen tensor cores.

Results

Speed: Omitted. No first-party generation-time data point on Chroma1-Base specifically is published for the RTX 4080. The RTX 4080's ~716.8 GB/s memory bandwidth sits between the slower 16 GB Ada cards and the 24 GB RTX 4090 (~1008 GB/s), so a generation time scaled from any of those would be a guess, not a measurement. The only first-party speed thread in the family (Chroma1-HD discussion #25) measures Chroma1-HD (not Base) on an RTX 5090 at 1152×1152, 40 steps, 10 LoRAs — different variant, different card, different configuration — so it is not quotable for this recipe. Once community measurements land via /contribute, the /check/ endpoint will surface them.
VRAM usage: Plan for ≥ 16 GB. The on-disk BF16 size is 17.8 GB per silveroxides/Chroma1-Base-GGUF, so unquantized BF16 already overflows 16 GB before the text encoder and VAE load — the FP8-scaled file (9.19 GB on disk per Clybius/Chroma-fp8-scaled) or the Q8_0 GGUF (9.74 GB on disk) is the path that fits the RTX 4080's 16 GB with headroom for T5 fp8 + ae.safetensors + activations. Once a measured number for Chroma1-Base on a 16 GB card lands, /check/ will replace this envelope.
Quality notes: Chroma1-Base is a FLUX.1-Schnell de-distillation — it restores the multi-step diffusion behavior that Schnell distilled away, so it runs more like a FLUX.1-Dev-class model than a 4-step turbo. Don't expect Schnell-tier speed.

For the full benchmark data, see /check/chroma-v48/rtx-4080.

Troubleshooting

"v48", "Chroma1-Base", "Chroma1-HD", "Chroma1-Radiance" — which one is V48?

Per the lodestones/Chroma1-Base README, "Chroma1-Base is Chroma-v.48" — that's the literal V48. Chroma1-HD is a successor retrained from v.48 as a finetune-ready base — adjacent lineage, not the same weights. Chroma1-Radiance is a separate output-head variant (no FLUX VAE, different decoder) — close cousin, distinct architecture. The canonical, currently-maintained V48 distribution is Chroma1-Base; the FP8-scaled file in step 2 (chroma-unlocked-v48...) is that lineage in Ada-native FP8.

Noise artifacts with `--fp8_e5m2-unet`

The FP8-scaled file in step 2 uses pre-scaled e4m3fn weights, which is the clean path. Avoid ComfyUI's on-the-fly --fp8_e5m2-unet flag — it produces noise artifacts on Chroma1-family models. Stick to the pre-scaled e4m3fn file (or the Q8_0 GGUF from the alternative path), or --fp8_e4m3fn-unet if you need an on-the-fly fp8 cast.

Out-of-memory at high resolution

The FP8-scaled file (9.19 GB) leaves roughly 6 GB of headroom on a 16 GB card for the T5 XXL fp8 encoder (~4.5 GB), the FLUX VAE (~330 MB), and intermediate activations. If 1280×1280 or larger pushes the card to OOM, switch to the GGUF alternative and drop to Q5_K_M (6.65 GB) or Q4_K_M (5.57 GB) — quality degrades gracefully across the GGUF tiers, and the per-quant footprint table in the alternative-path section gives you the exact trade.

Did the workflow load the right encoder?

Chroma1-Base uses the standard FLUX T5 XXL encoder (not Qwen3-4B / Gemma / etc.), so the safetensors t5xxl_fp8_e4m3fn.safetensors path works directly. If you see garbled prompts or a CLIP-vs-T5 mismatch error, confirm the workflow points the text-encoder node at the T5 file from step 3, not a CLIP file.

If your specific issue isn't covered above, please report it via the submission form so the next reader benefits.