self-hosted/ai
§01·recipe · image

Chroma1-Base (V48) on RTX 5090: Uncensored 8.9B FLUX.1-Schnell De-Distillation via Diffusers BF16

imageintermediate24GB+ VRAMMay 24, 2026
models
tools
prerequisites
  • NVIDIA RTX 5090 (32GB VRAM, Blackwell sm_120) or equivalent 24GB+ consumer card
  • Python 3.10+
  • PyTorch with CUDA 12.8+ support and the `diffusers` library (or ComfyUI updated to a recent release)
  • ~24 GB free disk for the 17.8 GB BF16 checkpoint + T5 XXL fp16 + FLUX VAE

What You'll Build

A working setup that runs Chroma1-Base — the 8.9B-parameter, Apache 2.0, uncensored re-derivation of FLUX.1-Schnell published by Lodestone Rock and explicitly labeled as "Chroma1-Base is Chroma-v.48" on the official HF card — on an RTX 5090 32GB (Blackwell, sm_120). With 32 GB of VRAM the canonical BF16 weights run through the official ChromaPipeline in the diffusers library (or the ComfyUI equivalent) with comfortable headroom, and you have roughly 8 GB to spare for colocating a second model or pushing past the standard 1024×1024 resolution.

Hardware data: RTX 5090 (32 GB GDDR7, ~1792 GB/s memory bandwidth, Blackwell sm_120) · runs at BF16 (17.8 GB single-file checkpoint per silveroxides/Chroma1-Base-GGUF and the canonical HF Files tab) with T5 XXL fp16 and the FLUX VAE · See benchmark data

ℹ️ Why Chroma1-Base and not Chroma1-HD or Chroma1-Radiance. The Chroma family ships three current variants from the same author: Chroma1-Base (the literal V48 weights), Chroma1-HD (a successor retrained from V48 as a finetune-ready base), and Chroma1-Radiance (a different output head — no FLUX VAE, different decoder). This recipe pins Chroma1-Base because that is what V48 specifically is, per the Chroma1-Base HF card. For Chroma1-HD or Chroma1-Radiance, follow their own respective HF cards — the install paths differ (Radiance does not use the FLUX VAE at all).

ℹ️ Why diffusers BF16 and not GGUF on a 5090. On 16 GB-tier cards (RTX 4060 Ti 16GB, RTX 5060 Ti) the BF16 17.8 GB single-file checkpoint cannot stay resident alongside the T5 encoder and FLUX VAE, so those recipes use Chroma1-Base GGUF quants from silveroxides/Chroma1-Base-GGUF in ComfyUI. The RTX 5090's 32 GB makes the GGUF detour unnecessary — the canonical BF16 path documented on the Chroma1-Base HF card and in the HF diffusers Chroma pipeline docs is the recommended setup for this card. The same BF16 path also fits the 24 GB Ada/Ampere siblings (see the /check/chroma-v48/rtx-4090 and /check/chroma-v48/rtx-3090 recipes); the 5090 is the over-provisioned end of the BF16-fits tier, not a tighter one.

ℹ️ No FP8 cast required (and noise artifacts when tried). Blackwell sm_120 has native FP8 (E4M3/E5M2) tensor-core acceleration, so an FP8 path would be a real speedup on this GPU class in principle — but the Chroma1 family has reported FP8 noise artifacts (see Troubleshooting below). On 32 GB you don't need the FP8 cast at all; the canonical BF16 path keeps the V48 weight lineage at full precision and avoids the known quality regression.

Requirements

ComponentMinimumTested
GPU24 GB VRAM (BF16 weights are 17.8 GB on disk per silveroxides/Chroma1-Base-GGUF and the HF Files tab; the T5 XXL fp16 encoder and FLUX VAE load on top)RTX 5090 (32 GB)
RAM16 GB system (32 GB recommended for enable_model_cpu_offload)
Storage~24 GB (BF16 checkpoint 17.8 GB + T5 XXL fp16 ~9.5 GB + FLUX VAE ae.safetensors ~330 MB)
SoftwarePython 3.10+, diffusers, transformers, sentencepiece, accelerate (or ComfyUI for the workflow path)

Installation

1. Install the diffusers stack

The Chroma pipeline lands in mainline diffusers — install the runtime per the Chroma1-Base HF card Quickstart:

pip install transformers diffusers sentencepiece accelerate

You also need PyTorch with CUDA support. The RTX 5090 is Blackwell (sm_120) — a newer GPU architecture that requires CUDA 12.8+ kernels. Use the official CUDA 12.8 wheel index:

pip install torch torchvision --index-url https://download.pytorch.org/whl/cu128

The cu128 wheel ships sm_120 kernels for diffusers' default attention backend (scaled_dot_product_attention, SDPA). The recipe does not require FlashAttention-2 — SDPA is the default and is fully kernel-supported on sm_120 via PyTorch's stock CUDA 12.8 wheel. (FA2 sm_120 wheel coverage is tracked at Dao-AILab/flash-attention#2168; diffusers does not depend on FA2 for the Chroma pipeline.)

2. Run the canonical diffusers Quickstart

The Chroma1-Base HF card ships a complete Quickstart that loads the BF16 weights directly from the Hub. This is the path the 32 GB of an RTX 5090 unlocks comfortably:

import torch
from diffusers import ChromaPipeline

pipe = ChromaPipeline.from_pretrained("lodestones/Chroma1-Base", torch_dtype=torch.bfloat16)
pipe.enable_model_cpu_offload()

prompt = [
    "A high-fashion close-up portrait of a blonde woman in clear sunglasses. The image uses a bold teal and red color split for dramatic lighting. The background is a simple teal-green. The photo is sharp and well-composed, and is designed for viewing with anaglyph 3D glasses for optimal effect. It looks professionally done."
]
negative_prompt = ["low quality, ugly, unfinished, out of focus, deformed, disfigure, blurry, smudged, restricted palette, flat colors"]

image = pipe(
    prompt=prompt,
    negative_prompt=negative_prompt,
    generator=torch.Generator("cpu").manual_seed(433),
    num_inference_steps=40,
    guidance_scale=3.0,
    num_images_per_prompt=1,
).images[0]
image.save("chroma.png")

Snippet verbatim from the Chroma1-Base HF card Quickstart. pipe.enable_model_cpu_offload() keeps the transformer in VRAM only when active and offloads the T5 encoder back to CPU after prompt encoding — the HF diffusers Chroma docs state that "Chroma can use all the same optimizations as Flux," so the standard Flux memory tricks (enable_vae_slicing(), enable_vae_tiling()) are all available as well if you want to drop runtime further or push past 1024×1024. With 32 GB on the 5090 you can also drop the enable_model_cpu_offload() call entirely and keep the T5 resident for faster repeated generations — see "Spending the headroom" below.

3. (Alternative) Use the ComfyUI workflow path

If you prefer ComfyUI, the Chroma1-Base HF card lists the three assets you need and where to put them. On a 32 GB card you can use the unquantized .safetensors directly — no GGUF or FP8 cast required:

# Chroma1-Base BF16 single-file checkpoint (17.8 GB) → diffusion_models folder
wget -P ComfyUI/models/diffusion_models/ \
  https://huggingface.co/lodestones/Chroma1-Base/resolve/main/Chroma1-Base.safetensors

# T5 XXL fp16 text encoder → text_encoders folder
#   (HF card recommends the fp16 variant; on 32 GB you have ample headroom for it)
wget -P ComfyUI/models/text_encoders/ \
  https://huggingface.co/comfyanonymous/flux_text_encoders/resolve/main/t5xxl_fp16.safetensors

# FLUX VAE (from the Chroma repo's ae.safetensors) → vae folder
wget -P ComfyUI/models/vae/ \
  https://huggingface.co/lodestones/Chroma/resolve/main/ae.safetensors

# Chroma ComfyUI simple workflow JSON (from the lodestones/Chroma repo)
wget https://huggingface.co/lodestones/Chroma/resolve/main/simple_workflow.json

The checkpoint, T5 encoder, and VAE URLs are adapted from the Chroma1-Base HF card "ComfyUI Setup" section — the checkpoint link is upgraded to the canonical Chroma1-Base.safetensors because the originally-linked lodestones/Chroma repo's README now opens with "THIS REPO IS DEPRECATED!" and directs users to Chroma1-Base / Chroma1-HD / Chroma1-Flash for the current weights. The workflow JSON itself still lives under lodestones/Chroma; drag simple_workflow.json onto the ComfyUI canvas and verify the loader nodes point at the three files above.

Running

For the diffusers path, save the snippet in step 2 as run.py and:

python run.py

The first run pays a cold-load cost (weights → VRAM, T5 encoder → VRAM, then offloaded back to CPU after prompt encoding). Subsequent generations reuse the loaded transformer. The default 1024×1024 is the resolution Chroma was trained at and the safest first-run target; the ChromaPipeline API docs document enable_vae_tiling() for pushing past it without OOM.

For the ComfyUI path: open simple_workflow.json, press Queue Prompt. PNGs land in ComfyUI/output/.

Spending the headroom

The 5090's 32 GB envelope leaves roughly 8 GB above what the BF16 Chroma1-Base + T5 XXL fp16 + FLUX VAE pipeline needs with enable_model_cpu_offload(). Two concrete ways to use that headroom:

  • Drop the encoder offload for faster repeated generations. The canonical Quickstart calls pipe.enable_model_cpu_offload() to keep the T5 encoder off-GPU outside the prompt-encoding window — that's the 24 GB-tier safety net. On 32 GB you can replace it with pipe.to("cuda") and keep the T5 resident; every subsequent prompt skips the CPU↔GPU encoder shuffle. The trade-off is ~9.5 GB of permanent T5 residency in exchange for one-shot prompt encoding on each generation.
  • Push past 1024×1024 without VAE tiling. With the encoder resident you have ~5 GB of headroom for larger latents. The HF diffusers Chroma docs note all Flux memory optimizations apply (enable_vae_slicing(), enable_vae_tiling()) if you want to go further — on 32 GB those become "if I want to push to 2048×2048" knobs rather than "if I want this to run at all" knobs.

For comparison, the 4090 sibling recipe and 3090 sibling recipe both run the same BF16 path but need the offload — they don't have the slack to drop it.

Results

  • Speed: Omitted. No first-party generation-time data point on Chroma1-Base specifically is published for the RTX 5090. The only first-party speed thread in the Chroma family (Chroma1-HD discussion #25) measures Chroma1-HD (a separate variant — see the variant admonition above) on an RTX 5090 at 1152×1152 with 10 LoRAs in a community reply by user NeonScreams (not a Lodestone Rock team member). That thread is on a 5090, but it is not Chroma1-Base, not the 1024×1024 baseline configuration this recipe pins, and not canonical-org guidance — so it cannot be cited as a Chroma1-Base × RTX 5090 measurement here. Once community measurements on the matching variant + configuration land via /contribute, the /check/chroma-v48/rtx-5090 endpoint will surface them.
  • VRAM usage: Plan for ~24 GB peak with enable_model_cpu_offload() — the same envelope as the 4090 and 3090 siblings. The BF16 transformer checkpoint is 17.8 GB on disk per both the HF Files tab (single Chroma1-Base.safetensors, 17.8 GB) and the silveroxides/Chroma1-Base-GGUF per-quant-tier table (BF16 row). The T5 XXL fp16 encoder (~9.5 GB) and FLUX VAE (~330 MB) load on top, plus per-step activations and latents at 1024×1024. enable_model_cpu_offload() from the canonical Quickstart keeps the encoder off-GPU outside the prompt-encoding window. On the 5090 the 32 GB envelope leaves roughly 8 GB of headroom over the 24 GB-tier BF16 ceiling — see "Spending the headroom" above for two ways to use it. The Chroma1-Base HF card does not publish a measured runtime peak for any specific card — once an empirical 5090 measurement lands, /check/chroma-v48/rtx-5090 will replace this derived envelope.
  • Quality notes: Chroma1-Base is a FLUX.1-Schnell de-distillation — it restores the multi-step diffusion behavior that Schnell distilled away, so it runs more like a FLUX.1-Dev-class model than a 4-step turbo. Don't expect Schnell-tier speed; the canonical Quickstart uses num_inference_steps=40 and guidance_scale=3.0 as the starting point. Output quality is independent of GPU architecture — a 3090, 4090, and 5090 running the same BF16 weights with the same seed produce bit-identical images, only the per-step throughput differs.

For the full benchmark data, see /check/chroma-v48/rtx-5090.

Troubleshooting

"v48", "Chroma1-Base", "Chroma1-HD", "Chroma1-Radiance" — which one is V48?

Per the lodestones/Chroma1-Base README, "Chroma1-Base is Chroma-v.48" — that's the literal V48. Chroma1-HD is a separate model in the same author's lineup, distributed under lodestones/Chroma1-HD — adjacent lineage, not the same weights. Chroma1-Radiance is a separate output-head variant (no FLUX VAE, different decoder) — close cousin, distinct architecture. The deprecated lodestones/Chroma repo's README now opens with "THIS REPO IS DEPRECATED!" and directs users to Chroma1-HD, Chroma1-Base, or Chroma1-Flash — the canonical, currently-maintained V48 distribution is Chroma1-Base.

Noise artifacts with --fp8_e5m2-unet (and why not to chase FP8 on the 5090)

Reported on the Chroma family in the Chroma1-Radiance ComfyUI discussion thread: the --fp8_e5m2-unet ComfyUI flag produces noise artifacts on Chroma1-family models. On the RTX 5090 you have 32 GB and the BF16 path fits comfortably, so there is no VRAM reason to cast to FP8 in the first place. Blackwell sm_120 does have native FP8 tensor-core acceleration (unlike Ampere, which would emulate FP8 in software), so in principle an FP8 cast would be a real speedup on this card — but the cited noise artifacts are a Chroma1-family quality issue, not a hardware capability gap. Stick with the canonical BF16 path; if a clean Chroma1-Base FP8 redistribution lands later that proves the artifact issue is resolved, that becomes a candidate speed-up path; for now, BF16 is the safe recommendation.

Quality regressions from acceleration LoRAs

Same thread as above: standard acceleration LoRAs "impart unwanted styles and compromises to the image and seem to negatively affect prompt adherence" on the Chroma1 family. The BF16 path documented here avoids any FP8 weight cast and runs the unquantized weights, so acceleration LoRAs can be experimented with on top once you have a quality baseline — but don't reach for them before establishing what unaccelerated output looks like.

CUDA 12.8 wheel needed for sm_120 (Blackwell)

Unlike Ada Lovelace (RTX 4090, sm_89) and Ampere (RTX 3090, sm_86) where the default pip install torch already includes the relevant kernels, Blackwell sm_120 needs the CUDA 12.8 PyTorch wheel: pip install torch --index-url https://download.pytorch.org/whl/cu128. If you skip the index URL and grab the default wheel, you may hit either a no kernel image is available for execution error or a silent fall-through to a slow generic path. The diffusers pipeline does not require FlashAttention-2 — its default attention backend is SDPA, which is fully kernel-supported on sm_120 in the cu128 wheel. (FA2 sm_120 wheel coverage is still in progress at Dao-AILab/flash-attention#2168; ignore that issue tracker for this recipe — you don't need FA2.)

Out-of-memory at higher resolutions or with batch size > 1

The BF16 path with enable_model_cpu_offload() fits 1024×1024 batch=1 in ~24 GB; on the 5090 you have ~8 GB of additional headroom for larger latents or batches. If you push to a regime that still hits OOM (e.g. 2048×2048 batch_size > 1), the HF diffusers Chroma docs note that all Flux memory optimizations apply — add pipe.enable_vae_tiling() and/or pipe.enable_vae_slicing() to drop VAE peak. If that's still not enough, fall back to the Chroma1-Base GGUF Q8_0 (9.74 GB on disk per the silveroxides per-quant-tier table) and use the ComfyUI-GGUF custom-node path that the 16 GB-tier recipes document.

Did the workflow load the right encoder?

Chroma1-Base uses the standard FLUX T5 XXL encoder (not Qwen3-4B / Gemma / etc.). The HF card recommends t5xxl_fp16.safetensors (the unquantized variant), which on 32 GB you can run as-is — even with the encoder resident permanently. If you see garbled prompts or a CLIP-vs-T5 mismatch error, confirm the workflow points its text-encoder node at the T5 file from step 3, not a CLIP file.

If your specific issue isn't covered above, please report it via the submission form so the next reader benefits.