self-hosted/ai
§01·recipe · image

Chroma1-Base (V48) on RTX 4060 Ti 16GB: Uncensored 8.9B FLUX.1-Schnell De-Distillation via GGUF in ComfyUI

imageintermediate16GB+ VRAMMay 20, 2026
models
tools
prerequisites
  • NVIDIA RTX 4060 Ti 16GB (Ada sm_89) or equivalent 16GB consumer card
  • Python 3.10+
  • ComfyUI installed and updated to a recent release
  • ~12 GB free disk for the Q8_0 checkpoint + T5 XXL fp8 + FLUX VAE

What You'll Build

A working ComfyUI setup that runs Chroma1-Base — the 8.9B-parameter, Apache 2.0, uncensored re-derivation of FLUX.1-Schnell published by Lodestone Rock and explicitly labeled as "Chroma1-Base is Chroma-v.48" on the official HF card — on an RTX 4060 Ti 16GB (Ada Lovelace, sm_89). The GGUF redistribution by silveroxides is the path that fits comfortably on a 16GB card while keeping the V48 weight lineage intact.

Hardware data: RTX 4060 Ti 16GB (16 GB VRAM, 288 GB/s, Ada sm_89) · runs at Q8_0 GGUF (9.74 GB on disk) with the FLUX VAE and T5 XXL fp8 encoder · See benchmark data

ℹ️ Why Chroma1-Base and not Chroma1-HD or Chroma1-Radiance. The Chroma family ships three current variants from the same author: Chroma1-Base (the literal V48 weights), Chroma1-HD (a successor retrained from V48 as a finetune-ready base), and Chroma1-Radiance (a different output head — no FLUX VAE, different decoder). This recipe pins Chroma1-Base because that is what V48 specifically is, per the Chroma1-Base HF card. For Chroma1-HD or Chroma1-Radiance, follow their own respective HF cards — install paths differ.

⚠️ The original lodestones/Chroma repo is deprecated. Its README now opens with "THIS REPO IS DEPRECATED!" and directs users to Chroma1-HD, Chroma1-Base, or Chroma1-Flash instead. Use Chroma1-Base for V48.

Requirements

ComponentMinimumTested
GPU16 GB VRAM (BF16 weights are 17.8 GB on disk per silveroxides/Chroma1-Base-GGUF, so Q8_0 or smaller is required to keep weights + T5 + VAE + activations resident on 16 GB)RTX 4060 Ti 16GB (16 GB)
RAM16 GB system
Storage~12 GB (Q8_0 weights 9.74 GB + T5 XXL fp8 + FLUX VAE ae.safetensors)
SoftwareComfyUI + ComfyUI-GGUF custom node by city96

Installation

1. Update ComfyUI

Update to a recent ComfyUI release. The standard ComfyUI Chroma example workflow at comfyanonymous.github.io/ComfyUI_examples/chroma drives the safetensors path; ComfyUI-GGUF substitutes a GGUF Unet loader for the diffusion model node.

2. Install the GGUF custom node

From your ComfyUI/custom_nodes directory:

git clone https://github.com/city96/ComfyUI-GGUF
cd ComfyUI-GGUF
pip install -r requirements.txt

Restart ComfyUI after installation. ComfyUI-GGUF supports the FLUX architecture family that Chroma1-Base is based on, per the project README, and also ships GGUF T5 XXL loader support.

3. Download the Chroma1-Base (V48) GGUF weights

Pick one quantization from the silveroxides/Chroma1-Base-GGUF repository — file sizes verbatim from the model card:

QuantSize
Q2_K3.41 GB
Q3_K_S4.29 GB
Q4_0 / Q4_K_S5.43 GB
Q4_K_M5.57 GB
Q4_15.97 GB
Q5_0 / Q5_K_S6.51 GB
Q5_K_M6.65 GB
Q5_17.05 GB
Q6_K7.65 GB
Q8_09.74 GB
BF1617.8 GB

Recommendation for the 4060 Ti 16GB: Q8_0 (9.74 GB) for the highest in-family quality that still leaves comfortable headroom for the T5 XXL text encoder, FLUX VAE, and intermediate activations. Drop to Q4_K_M (5.57 GB) if you plan to stack acceleration LoRAs or push past 1024×1024.

Drop the downloaded .gguf into ComfyUI/models/diffusion_models/.

4. Download the T5 XXL text encoder and FLUX VAE

The Chroma1-Base HF card requires the FLUX-ecosystem T5 XXL encoder and the FLUX VAE:

# T5 XXL — use fp8 on a 16 GB card (the fp16 variant doubles the encoder footprint)
wget -P ComfyUI/models/clip/ \
  https://huggingface.co/comfyanonymous/flux_text_encoders/resolve/main/t5xxl_fp8_e4m3fn.safetensors

# FLUX VAE (ae.safetensors from the FLUX.1 release)
wget -P ComfyUI/models/vae/ \
  https://huggingface.co/lodestones/Chroma/resolve/main/ae.safetensors

URLs verbatim from the lodestones/Chroma README, which the Chroma1-Base card defers to for FLUX-shared assets. The fp16 T5 variant at t5xxl_fp16.safetensors (same repo) is also supported if you have spare VRAM.

5. Load the Chroma workflow

The Chroma1-HD T2I workflow JSON ships at ComfyUI_Chroma1-HD_T2I-workflow.json in both the lodestones/Chroma and lodestones/Chroma1-HD repos. Download it, drag it onto the ComfyUI canvas, then:

  1. Swap the default Load Diffusion Model node for the Unet Loader (GGUF) node from ComfyUI-GGUF, pointing it at your downloaded Chroma1-Base .gguf.
  2. Confirm the workflow's T5 loader points at t5xxl_fp8_e4m3fn.safetensors from step 4.
  3. Confirm the VAE loader points at ae.safetensors.

The Chroma1-HD workflow JSON is the canonical ComfyUI workflow for the V48 lineage — the difference between Chroma1-Base and Chroma1-HD is the weights file, not the workflow topology.

Running

Use a 1024×1024 latent for the first run. The Chroma1-Base diffusers snippet on the HF card uses num_inference_steps=40 and guidance_scale=3.0 as a reasonable starting point; in ComfyUI, set the sampler's step count similarly (20–40 steps work; the example workflow defaults are a safe start).

Trigger: Queue Prompt
Output: PNG saved to ComfyUI/output/

The first generation pays a cold-load cost (weights → VRAM, text encoder → VRAM). Subsequent generations with the same model reuse the loaded weights.

ℹ️ No Blackwell-specific wheel selection needed. Unlike Blackwell-class GPUs (RTX 50-series, sm_120), the RTX 4060 Ti is Ada Lovelace (sm_89) — the default pip install torch shipped with ComfyUI already includes sm_89 kernels, and FlashAttention-2 has full sm_89 coverage. No cu128-specific wheel pinning or attn_implementation overrides are required.

Results

  • Speed: Omitted. No first-party generation-time data point on Chroma1-Base specifically is published for a card in the 16 GB Ada class. The only first-party speed thread in the family (Chroma1-HD discussion #25) measures Chroma1-HD (not Base) on an RTX 5090 at 1152×1152, 40 steps, 10 LoRAs — different variant, different card, different setup — so it is not quotable for this recipe. Once community measurements land via /contribute, the /check/ endpoint will surface them.
  • VRAM usage: Plan for ≥ 16 GB. The on-disk BF16 size is 17.8 GB per silveroxides/Chroma1-Base-GGUF, so unquantized BF16 already overflows 16 GB before the text encoder and VAE load — the Q8_0 GGUF (9.74 GB on disk) is the highest-quality path that comfortably fits the 4060 Ti 16GB with headroom for T5 fp8 + ae.safetensors + activations. A close-cousin family signal: in the Chroma1-Radiance ComfyUI discussion thread, a 12 GB RTX 5070 user reports "98–99% VRAM usage" with quality degradation, and the thread's general advice is "I'm not sure there's a viable way to run this on anything less than 16GB VRAM." That comment is about Chroma1-Radiance / Chroma1-HD, not Chroma1-Base — treat as adjacent evidence, not a Chroma1-Base citation. Once a measured number for Chroma1-Base on a 16 GB card lands, /check/ will replace this envelope.
  • Quality notes: Chroma1-Base is a FLUX.1-Schnell de-distillation — it restores the multi-step diffusion behavior that Schnell distilled away, so it runs more like a FLUX.1-Dev-class model than a 4-step turbo. Don't expect Schnell-tier speed.

For the full benchmark data, see /check/chroma-v48/rtx-4060-ti-16gb.

Troubleshooting

"v48", "Chroma1-Base", "Chroma1-HD", "Chroma1-Radiance" — which one is V48?

Per the lodestones/Chroma1-Base README, "Chroma1-Base is Chroma-v.48" — that's the literal V48. Chroma1-HD is explicitly "retrained from v.48" as a finetune-ready base — adjacent lineage, not the same weights. Chroma1-Radiance is a separate output-head variant (no FLUX VAE, different decoder) — close cousin, distinct architecture. The deprecated lodestones/Chroma repo's chroma-unlocked-v48-detail-calibrated.safetensors is the original V48 weight file, but the canonical, currently-maintained V48 distribution is Chroma1-Base.

Noise artifacts with --fp8_e5m2-unet

Reported on the family in the Chroma1-Radiance ComfyUI thread: the --fp8_e5m2-unet ComfyUI flag produces noise artifacts on Chroma1-family models. Stick to the default loader (Q8_0 GGUF as recommended above), or --fp8_e4m3fn-unet if you need an fp8 path that isn't GGUF.

Quality regressions from acceleration LoRAs

Same thread: standard acceleration LoRAs "impart unwanted styles and compromises to the image and seem to negatively affect prompt adherence" on the Chroma1 family. The Q8_0 GGUF path documented in step 3 avoids the fp8 weight cast entirely, and acceleration LoRAs can be stacked later once you have a quality baseline.

Out-of-memory at high resolution

The Q8_0 (9.74 GB) leaves roughly 6 GB of headroom on a 16 GB card for the T5 XXL fp8 encoder (~4.5 GB), the FLUX VAE (~330 MB), and intermediate activations. If 1280×1280 or larger pushes the card to OOM, drop to Q5_K_M (6.65 GB) or Q4_K_M (5.57 GB) — quality degrades gracefully across the GGUF tiers, and the per-quant footprint table in step 3 gives you the exact trade.

Did the workflow load the right encoder?

Chroma1-Base uses the standard FLUX T5 XXL encoder (not Qwen3-4B / Gemma / etc.), so the ComfyUI-GGUF CLIPLoader (gguf) and the safetensors t5xxl_fp8_e4m3fn.safetensors path both work. If you see garbled prompts or a CLIP-vs-T5 mismatch error, confirm the workflow points the text-encoder node at the T5 file from step 4, not a CLIP file.

If your specific issue isn't covered above, please report it via the submission form so the next reader benefits.