Chroma1-Base (V48) on RTX 4090: Uncensored 8.9B FLUX.1-Schnell De-Distillation via Diffusers BF16

What You'll Build

A working setup that runs Chroma1-Base — the 8.9B-parameter, Apache 2.0, uncensored re-derivation of FLUX.1-Schnell published by Lodestone Rock and explicitly labeled as "Chroma1-Base is Chroma-v.48" on the official HF card — on an RTX 4090 24GB (Ada Lovelace, sm_89). With 24 GB of VRAM you can drop the GGUF quantization the 16 GB-tier recipes require and run the canonical BF16 weights directly through the official ChromaPipeline in the diffusers library (or the ComfyUI equivalent), keeping the V48 weight lineage at full precision.

Hardware data: RTX 4090 (24 GB VRAM, 1008 GB/s memory bandwidth, Ada sm_89) · runs at BF16 (17.8 GB single-file checkpoint per silveroxides/Chroma1-Base-GGUF and the canonical HF Files tab) with T5 XXL fp16 and the FLUX VAE · See benchmark data

ℹ️ Why Chroma1-Base and not Chroma1-HD or Chroma1-Radiance. The Chroma family ships three current variants from the same author: Chroma1-Base (the literal V48 weights), Chroma1-HD (a successor retrained from V48 as a finetune-ready base), and Chroma1-Radiance (a different output head — no FLUX VAE, different decoder). This recipe pins Chroma1-Base because that is what V48 specifically is, per the Chroma1-Base HF card. For Chroma1-HD or Chroma1-Radiance, follow their own respective HF cards — the install paths differ (Radiance does not use the FLUX VAE at all).

ℹ️ Why diffusers BF16 and not GGUF. On 16 GB-tier cards (RTX 4060 Ti 16GB, RTX 5060 Ti) the BF16 17.8 GB single-file checkpoint cannot stay resident alongside the T5 encoder and FLUX VAE, so those recipes use Chroma1-Base GGUF quants from silveroxides/Chroma1-Base-GGUF in ComfyUI. The RTX 4090's 24 GB makes the GGUF detour unnecessary — the canonical BF16 path documented on the Chroma1-Base HF card and in the HF diffusers Chroma pipeline docs is the recommended setup for this card.

Requirements

Component	Minimum	Tested
GPU	24 GB VRAM (BF16 weights are 17.8 GB on disk per silveroxides/Chroma1-Base-GGUF and the HF Files tab; the T5 XXL fp16 encoder and FLUX VAE load on top)	RTX 4090 (24 GB)
RAM	16 GB system (32 GB recommended for `enable_model_cpu_offload`)	—
Storage	~24 GB (BF16 checkpoint 17.8 GB + T5 XXL fp16 ~9.5 GB + FLUX VAE ae.safetensors ~330 MB)	—
Software	Python 3.10+, `diffusers`, `transformers`, `sentencepiece`, `accelerate` (or ComfyUI for the workflow path)	—

Installation

1. Install the diffusers stack

The Chroma pipeline lands in mainline diffusers — install the runtime per the Chroma1-Base HF card Quickstart:

pip install transformers diffusers sentencepiece accelerate

You also need PyTorch with CUDA support. Unlike Blackwell GPUs (sm_120), the RTX 4090 is Ada Lovelace (sm_89) — the default pip install torch already includes sm_89 kernels and no special cu128 wheel selection is required.

2. Run the canonical diffusers Quickstart

The Chroma1-Base HF card ships a complete Quickstart that loads the BF16 weights directly from the Hub. This is the path the 24 GB of an RTX 4090 unlocks:

import torch
from diffusers import ChromaPipeline

pipe = ChromaPipeline.from_pretrained("lodestones/Chroma1-Base", torch_dtype=torch.bfloat16)
pipe.enable_model_cpu_offload()

prompt = [
    "A high-fashion close-up portrait of a blonde woman in clear sunglasses. The image uses a bold teal and red color split for dramatic lighting. The background is a simple teal-green. The photo is sharp and well-composed, and is designed for viewing with anaglyph 3D glasses for optimal effect. It looks professionally done."
]
negative_prompt = ["low quality, ugly, unfinished, out of focus, deformed, disfigure, blurry, smudged, restricted palette, flat colors"]

image = pipe(
    prompt=prompt,
    negative_prompt=negative_prompt,
    generator=torch.Generator("cpu").manual_seed(433),
    num_inference_steps=40,
    guidance_scale=3.0,
    num_images_per_prompt=1,
).images[0]
image.save("chroma.png")

Snippet verbatim from the Chroma1-Base HF card Quickstart. pipe.enable_model_cpu_offload() keeps the transformer in VRAM only when active and offloads the T5 encoder back to CPU after prompt encoding — the HF diffusers Chroma docs note that "Chroma can use all the same optimizations as Flux," so the standard Flux memory tricks (enable_vae_slicing(), enable_vae_tiling()) are all available as well if you want to drop runtime further or push past 1024×1024.

3. (Alternative) Use the ComfyUI workflow path

If you prefer ComfyUI, the Chroma1-Base HF card lists the three assets you need and where to put them. On a 24 GB card you can use the unquantized .safetensors directly — no GGUF or fp8 cast required:

# Chroma1-Base BF16 single-file checkpoint (17.8 GB) → diffusion_models folder
wget -P ComfyUI/models/diffusion_models/ \
  https://huggingface.co/lodestones/Chroma1-Base/resolve/main/Chroma1-Base.safetensors

# T5 XXL fp16 text encoder → text_encoders folder
#   (HF card recommends the fp16 variant; on 24 GB you have ample headroom for it)
wget -P ComfyUI/models/text_encoders/ \
  https://huggingface.co/comfyanonymous/flux_text_encoders/resolve/main/t5xxl_fp16.safetensors

# FLUX VAE (from the Chroma repo's ae.safetensors) → vae folder
wget -P ComfyUI/models/vae/ \
  https://huggingface.co/lodestones/Chroma/resolve/main/ae.safetensors

# Chroma ComfyUI simple workflow JSON (from the lodestones/Chroma repo)
wget https://huggingface.co/lodestones/Chroma/resolve/main/simple_workflow.json

The checkpoint, T5 encoder, and VAE URLs are adapted from the Chroma1-Base HF card "ComfyUI Setup" section — the checkpoint link is upgraded to the canonical Chroma1-Base.safetensors since the originally-linked lodestones/Chroma repo is now deprecated as a weights source. The workflow JSON itself still lives under lodestones/Chroma; drag simple_workflow.json onto the ComfyUI canvas and verify the loader nodes point at the three files above.

Running

For the diffusers path, save the snippet in step 2 as run.py and:

python run.py

The first run pays a cold-load cost (weights → VRAM, T5 encoder → VRAM, then offloaded back to CPU after prompt encoding). Subsequent generations reuse the loaded transformer. The default 1024×1024 is the resolution Chroma was trained at and the safest first-run target; the ChromaPipeline API docs document enable_vae_tiling() for pushing past it without OOM.

For the ComfyUI path: open ChromaSimpleWorkflow20250507.json, press Queue Prompt. PNGs land in ComfyUI/output/.

Results

Speed: Omitted. No first-party generation-time data point on Chroma1-Base specifically is published for the RTX 4090. The only first-party speed thread in the Chroma family (Chroma1-HD discussion #25) measures Chroma1-HD (a separate variant — see the variant admonition above) on an RTX 5090, not Chroma1-Base on a 4090, so it is not quotable for this recipe. Once community measurements land via /contribute, the /check/chroma-v48/rtx-4090 endpoint will surface them.
VRAM usage: Plan for the full 24 GB envelope on the RTX 4090. The BF16 transformer checkpoint is 17.8 GB on disk per both the HF Files tab (single Chroma1-Base.safetensors, 17.8 GB) and the silveroxides/Chroma1-Base-GGUF per-quant-tier table (BF16 row). The T5 XXL fp16 encoder (~9.5 GB) and FLUX VAE (~330 MB) load on top, plus per-step activations and latents at 1024×1024. enable_model_cpu_offload() from the canonical Quickstart keeps the encoder off-GPU outside the prompt-encoding window, which is what makes the BF16 path fit cleanly in 24 GB on the 4090. The Chroma1-Base HF card does not publish a measured runtime peak for any specific card — once an empirical 4090 measurement lands, /check/chroma-v48/rtx-4090 will replace this derived envelope.
Quality notes: Chroma1-Base is a FLUX.1-Schnell de-distillation — it restores the multi-step diffusion behavior that Schnell distilled away, so it runs more like a FLUX.1-Dev-class model than a 4-step turbo. Don't expect Schnell-tier speed; the canonical Quickstart uses num_inference_steps=40 and guidance_scale=3.0 as the starting point.

For the full benchmark data, see /check/chroma-v48/rtx-4090.

Troubleshooting

"v48", "Chroma1-Base", "Chroma1-HD", "Chroma1-Radiance" — which one is V48?

Per the lodestones/Chroma1-Base README, "Chroma1-Base is Chroma-v.48" — that's the literal V48. Chroma1-HD is a separate model in the same author's lineup, distributed under lodestones/Chroma1-HD — adjacent lineage, not the same weights. Chroma1-Radiance is a separate output-head variant (no FLUX VAE, different decoder) — close cousin, distinct architecture. The deprecated lodestones/Chroma repo's README now opens with "THIS REPO IS DEPRECATED!" and directs users to Chroma1-HD, Chroma1-Base, or Chroma1-Flash — the canonical, currently-maintained V48 distribution is Chroma1-Base.

Noise artifacts with `--fp8_e5m2-unet`

Reported on the Chroma family in the Chroma1-Radiance ComfyUI discussion thread: the --fp8_e5m2-unet ComfyUI flag produces noise artifacts on Chroma1-family models. On a 24 GB RTX 4090 you don't need an fp8 cast at all — the canonical BF16 path documented in steps 2 and 3 above is the recommended setup. If you must cast for some other reason (e.g. you are running another large model in parallel and want to free a few GB), drop the --fp8_e5m2-unet flag and stick with the unquantized BF16 path; the alternative --fp8_e4m3fn-unet cast is not recommended by the cited thread.

Quality regressions from acceleration LoRAs

Same thread as above: standard acceleration LoRAs "impart unwanted styles and compromises to the image and seem to negatively affect prompt adherence" on the Chroma1 family. The BF16 path documented here avoids any fp8 weight cast and runs the unquantized weights, so acceleration LoRAs can be experimented with on top once you have a quality baseline — but don't reach for them before establishing what unaccelerated output looks like.

No Blackwell-specific wheel selection needed

Unlike Blackwell-class GPUs (RTX 50-series, sm_120), the RTX 4090 is Ada Lovelace (sm_89) — the default pip install torch shipped with diffusers/ComfyUI already includes sm_89 kernels, and FlashAttention-2 has full sm_89 coverage. No cu128-specific wheel pinning or attn_implementation overrides are required.

Out-of-memory at higher resolutions or with batch size > 1

The BF16 path with enable_model_cpu_offload() fits 1024×1024 batch=1 in 24 GB. If you push to 1536×1536 or batch_size > 1 and hit OOM, the HF diffusers Chroma docs note that all Flux memory optimizations apply — add pipe.enable_vae_tiling() and/or pipe.enable_vae_slicing() to drop VAE peak. If that's still not enough, fall back to the Chroma1-Base GGUF Q8_0 (9.74 GB on disk per the silveroxides per-quant-tier table) and use the ComfyUI-GGUF custom-node path that the 16 GB-tier recipes document.

Did the workflow load the right encoder?

Chroma1-Base uses the standard FLUX T5 XXL encoder (not Qwen3-4B / Gemma / etc.). The HF card recommends t5xxl_fp16.safetensors (the unquantized variant), which on 24 GB you can run as-is. If you see garbled prompts or a CLIP-vs-T5 mismatch error, confirm the workflow points its text-encoder node at the T5 file from step 3, not a CLIP file.

If your specific issue isn't covered above, please report it via the submission form so the next reader benefits.