self-hosted/ai
§01·recipe · image

Chroma1-Base (V48) on RTX 3060: Uncensored 8.9B FLUX.1-Schnell De-Distillation via FP8-Weight + CPU-Offload T5 in ComfyUI

imageintermediate11GB+ VRAMJun 14, 2026

This intermediate recipe sets up Chroma V48 on the RTX 3060, needing about 11 GB of VRAM.

models
tools
prerequisites
  • NVIDIA RTX 3060 (12GB VRAM, Ampere sm_86) — the 12GB GA106 variant, not the 8GB cut-down
  • Python 3.10+
  • ComfyUI installed and updated to a recent release (May 2025 or newer)
  • 32 GB system RAM recommended (the T5 encoder is CPU-offloaded on this card)
  • ~10 GB free disk for the FP8-scaled checkpoint + T5 XXL fp8 + FLUX VAE

What You'll Build

A working ComfyUI setup that runs Chroma1-Base — the 8.9B-parameter, Apache 2.0, uncensored re-derivation of FLUX.1-Schnell published by Lodestone Rock and explicitly labeled "Chroma1-Base is Chroma-v.48" on the official HF card — on an RTX 3060 12GB (Ampere, sm_86). On a 12 GB card the full FP8-resident path the 16 GB siblings use (a 9.19 GB FP8 checkpoint with the 4.89 GB T5 encoder and FLUX VAE all resident at once, ~14 GB total) does not fit. This recipe leads with the FP8 transformer + CPU-offloaded T5 path: the 9.19 GB scaled-FP8 e4m3fn transformer stays resident on the GPU while ComfyUI runs the T5 text encoder on the CPU and frees it before sampling, dropping the resident footprint to a ~9.5–10.8 GB envelope that fits 12 GB with display headroom.

Hardware data: RTX 3060 (12 GB GDDR6, ~360 GB/s memory bandwidth, Ampere sm_86, PCIe Gen4 x16) · runs the FP8-scaled e4m3fn transformer (9.19 GB resident) with a CPU-offloaded T5 XXL fp8 encoder (4.89 GB) and the FLUX VAE (0.33 GB) · See benchmark data

ℹ️ Why Chroma1-Base and not Chroma1-HD or Chroma1-Radiance. The Chroma family ships several variants from the same author: Chroma1-Base (the literal V48 weights), Chroma1-HD (a successor retrained from V48), Chroma1-Flash (a CFG-baked fast variant), and Chroma1-Radiance (a different output head — no FLUX VAE, different decoder). This recipe pins Chroma1-Base because that is what V48 specifically is, per the HF card. For HD, Flash, or Radiance, follow their own HF cards — install paths differ (Radiance does not use the FLUX VAE at all).

⚠️ The original lodestones/Chroma repo is deprecated. Its README header reads "THIS REPO IS DEPRECATED!" and directs users to "use Chroma1-HD, Chroma1-Base or Chroma1-Flash instead". The deprecated repo still hosts the shared FLUX VAE (ae.safetensors) and the original chroma-unlocked-v48.safetensors weight file, but the canonical V48 distribution is Chroma1-Base.

ℹ️ FP8 on this card is a memory trick, not a speed trick. The RTX 3060 is Ampere (sm_86), which has no FP8 tensor cores — those first shipped on Hopper (sm_90) and consumer Ada (sm_89). The FP8 weight file still loads on the 3060 and keeps its smaller on-disk footprint resident in VRAM (which is exactly why the fit holds on 12 GB), but at compute time the runtime dequantizes the FP8 weights to BF16/FP16 per operation. You get the VRAM savings; you do not get the FP8 throughput boost an Ada or Blackwell card enjoys. See the Architecture note in Troubleshooting.

Requirements

ComponentMinimumTested
GPU12 GB VRAM (the canonical BF16 single-file checkpoint is 17.8 GB on disk per the Chroma1-Base Files tab, and even the 9.19 GB scaled-FP8 transformer peaks ~14 GB if the T5 encoder is kept resident alongside it — so on 12 GB the T5 must be CPU-offloaded or a GGUF quant used)RTX 3060 (12 GB, Ampere sm_86)
RAM32 GB system recommended (T5 fp8 ~4.9 GB lives in host RAM when offloaded)
Storage~10 GB (FP8-scaled transformer 9.19 GB + T5 XXL fp8 4.89 GB + FLUX VAE 0.33 GB)
SoftwareComfyUI (May 2025 or newer) + ComfyUI-GGUF custom node by city96 (only needed for the GGUF alternative path)

Installation

1. Update ComfyUI

Update to a recent ComfyUI release. The Clybius scaled-FP8 README states it "Requires an up-to-date ComfyUI as of May 1, 2025." The RTX 3060 is Ampere (sm_86) — a mature architecture with stable kernel coverage in every recent PyTorch wheel — so the default pip install torch shipped with ComfyUI already includes the right kernels. No cu128-specific wheel pinning is needed (that is a Blackwell / RTX 50-series requirement only); the standard cu124/cu121 wheels work fine on sm_86.

2. Download the Chroma1-Base (V48) FP8-scaled transformer

Download the standard V48 file from Clybius/Chroma-fp8-scaled — the /v48/ subdirectory carries the FP8-scaled e4m3fn V48 weights, and the repo declares license: apache-2.0 and base_model: lodestones/Chroma (link-back to the canonical V48 lineage):

wget -P ComfyUI/models/diffusion_models/ \
  https://huggingface.co/Clybius/Chroma-fp8-scaled/resolve/main/v48/chroma-unlocked-v48_float8_e4m3fn_scaled_learned_nodistill.safetensors

This file is 9.19 GB. We pick FP8 here purely for its smaller VRAM footprint — on the 3060's Ampere sm_86 there are no FP8 tensor cores, so the weights dequantize to BF16 at compute time (memory savings, no speed win — see the Architecture note in Troubleshooting). It uses pre-scaled FP8 e4m3fn (not on-the-fly --fp8_e5m2 casting), which avoids the noise artifacts that on-the-fly casting introduces on the Chroma family (see Troubleshooting).

ℹ️ License note for the Clybius repo. The /v48/ FP8-scaled files redistribute the Apache-2.0 Chroma1-Base weights. The Clybius README flags that files in its debug, merges, and hybrid-merges folders are CC BY-NC-SA 4.0 (non-commercial) because they incorporate components from a non-commercial development repo. Stay in /v48/ for the Apache-2.0 lineage; avoid the debug/merges folders if you need commercial use.

3. Download the T5 XXL text encoder and FLUX VAE

Chroma1-Base uses the standard FLUX-ecosystem T5 XXL encoder and the FLUX VAE. On a 12 GB card, use the fp8 T5 variant — it will be CPU-offloaded, but the smaller file also means less host RAM and a faster cold load:

# T5 XXL (fp8 — 4.89 GB; the fp16 variant is 9.79 GB) → clip folder
wget -P ComfyUI/models/clip/ \
  https://huggingface.co/comfyanonymous/flux_text_encoders/resolve/main/t5xxl_fp8_e4m3fn.safetensors

# FLUX VAE (ae.safetensors, 0.33 GB) → vae folder
wget -P ComfyUI/models/vae/ \
  https://huggingface.co/lodestones/Chroma/resolve/main/ae.safetensors

The Chroma1-Base HF card ComfyUI section links the T5 XXL encoder and the FLUX VAE ae.safetensors; the VAE still lives in the deprecated lodestones/Chroma repo, which is where the canonical card's link resolves.

4. Load the Chroma ComfyUI workflow

Download the simple Chroma text-to-image workflow JSON from the lodestones/Chroma repo:

# Simple Chroma text-to-image workflow → drag onto the ComfyUI canvas
wget https://huggingface.co/lodestones/Chroma/resolve/main/simple_workflow.json

⚠️ Workflow filename note. The Chroma1-Base card's ComfyUI section links a workflow named ChromaSimpleWorkflow20250507.json, but that exact filename returns 404 (verified live). The canonical file present in the repo is simple_workflow.json (verified live, returns 200). Use the URL above, not the one printed on the card.

Drag the workflow onto the ComfyUI canvas, then:

  1. Point the Load Diffusion Model node at the downloaded chroma-unlocked-v48_float8_e4m3fn_scaled_learned_nodistill.safetensors, and set its weight_dtype to default (as documented in the Clybius README: "Load the model using Load Diffusion Model in ComfyUI / Set weight_dtype to default").
  2. Confirm the text-encoder node points at t5xxl_fp8_e4m3fn.safetensors from step 3.
  3. Confirm the VAE loader points at ae.safetensors.

5. Run ComfyUI so the T5 encoder offloads to CPU

On a 12 GB card, the 9.19 GB FP8 transformer plus a resident 4.89 GB T5 encoder would overflow VRAM. ComfyUI's native smart-memory management runs the text encoder, frees it, then loads the diffusion model — but on a tight 12 GB card you should force the offload explicitly with --lowvram so the T5 conditioning is computed and then evicted to host RAM before the diffusion transformer takes the GPU:

python main.py --lowvram

With --lowvram, ComfyUI streams components off the GPU when memory is tight, keeping the FP8 transformer + VAE resident during sampling while the T5 lives in system RAM.

Alternative: GGUF path (lower VRAM, more granular quant tiers)

If you prefer to drop below the FP8 footprint (e.g. to stack acceleration LoRAs or push resolution), install the ComfyUI-GGUF custom node by city96:

git clone https://github.com/city96/ComfyUI-GGUF ComfyUI/custom_nodes/ComfyUI-GGUF
pip install -r ComfyUI/custom_nodes/ComfyUI-GGUF/requirements.txt

Restart ComfyUI, then pick one quantization from silveroxides/Chroma1-Base-GGUF — file sizes verbatim from the Files tab:

QuantSize
Q2_K3.41 GB
Q3_K_S4.29 GB
Q4_K_S5.43 GB
Q4_K_M5.57 GB
Q5_K_S6.51 GB
Q5_K_M6.65 GB
Q6_K7.65 GB
Q8_09.74 GB
BF1617.8 GB

Drop the .gguf into ComfyUI/models/diffusion_models/ and swap the Load Diffusion Model node for the Unet Loader (GGUF) node from ComfyUI-GGUF. On a 12 GB card the Q4_K_M (5.57 GB) or Q5_K_M (6.65 GB) tiers leave enough room to keep the T5 fp8 encoder resident instead of offloading it. GGUF quantization is a llama.cpp-style format handled by the loader's own quantized matmul kernels — it is arch-independent and does not depend on FP8 tensor-core hardware, so it runs identically on Ampere as on Ada.

Running

Use a 1024×1024 latent for the first run (the resolution Chroma was trained at). The Chroma1-Base diffusers Quickstart on the HF card uses num_inference_steps=40 and guidance_scale=3.0 as a reasonable starting point; in ComfyUI, set the sampler's step count similarly (20–40 steps work; the example workflow defaults are a safe start).

Trigger: Queue Prompt
Output: PNG saved to ComfyUI/output/

The first generation pays a cold-load cost (T5 → CPU, conditioning computed, then weights → VRAM). Subsequent generations with the same model reuse the loaded transformer.

ℹ️ PCIe Gen4 throughput note. The RTX 3060 uses a PCIe Gen4 x16 link. Because the T5 encoder is CPU-offloaded on this card, the conditioning step and any --lowvram component streaming move data across that Gen4 link, so per-image throughput on the offloaded portion is gated by host-bus bandwidth. The path still fits 12 GB — only throughput is affected, and only on the offloaded stages. (The 3060 shares the same Gen4 x16 link as the 12 GB Ada 4070, so the offload-streaming behavior is comparable on that axis; raw compute and memory bandwidth are lower on the 3060 — see Results.)

Results

  • Speed: Omitted. No RTX 3060-named generation-time data point on Chroma1-Base exists. The RTX 3060 is the lowest-bandwidth 12 GB consumer card (~360 GB/s GDDR6, 3584 CUDA cores per TechPowerUp's GA106 spec page) — far below the 16 GB Ada/Ampere siblings — so a figure scaled from a faster card would overstate both the memory-bound (VAE decode) and compute-bound (diffusion) throughput by far more than the ~10% forward-statement threshold. The CPU-offloaded T5 on this card's Gen4 bus makes any cross-card extrapolation worse still, and Chroma1-Base is un-distilled (it runs the full ~40-step diffusion schedule), so per-step compute throughput dominates total time. Once a community measurement lands via /contribute, the /check/chroma-v48/rtx-3060 endpoint will surface it.
  • VRAM usage: Plan for a ~9.5–10.8 GB resident envelope on the FP8 transformer + CPU-offload-T5 path documented above. The FP8-scaled transformer is 9.19 GB on disk per Clybius/Chroma-fp8-scaled; with the T5 XXL fp8 encoder (4.89 GB per the flux_text_encoders Files tab) offloaded to host RAM, only the transformer, the FLUX VAE (0.33 GB), and per-step activations stay resident — fitting inside the RTX 3060's 12 GB envelope (a desktop 12 GB card with a monitor attached exposes ~10.5–11.3 GB usable). The FP8 weight file keeps its smaller footprint resident on Ampere even though sm_86 has no FP8 compute (the bytes are smaller regardless of how they get used at compute time), which is what makes the fit hold. This figure is derived from the cited on-disk sizes, not a measured runtime peak; once an empirical 3060 number lands, /check/chroma-v48/rtx-3060 will replace it. As an independent same-architecture fit check, a community ComfyUI workflow by gabrielx (published Aug 29, 2025) runs a Chroma V48 Q4_0 GGUF transformer with an FP8 T5 encoder on an "RTX 3070 with 8GB VRAM and 40GB RAM" — and the RTX 3070 is the same Ampere sm_86 generation as the 3060, so this confirms the V48 path runs comfortably below 12 GB on a real Ampere consumer card (the 8 GB card even adds an upscaler/refiner pipeline; the plain text-to-image path on the 3060's 12 GB has more headroom).
  • Quality notes: Chroma1-Base is a FLUX.1-Schnell de-distillation — it restores the multi-step diffusion behavior that Schnell distilled away, so it runs more like a FLUX.1-Dev-class model than a 4-step turbo. Don't expect Schnell-tier speed; the canonical Quickstart uses 40 steps. Output quality is independent of GPU architecture — the same FP8 weights with the same seed produce the same image on any card; only per-step throughput differs.

For the full benchmark data, see /check/chroma-v48/rtx-3060.

Troubleshooting

Out-of-memory loading the model on 12 GB

If the FP8 transformer + T5 path OOMs, the T5 encoder is most likely staying resident on the GPU. Start ComfyUI with --lowvram (step 5) so the T5 conditioning is computed on the CPU and evicted before the diffusion transformer loads. If you still hit OOM at 1024×1024, switch to the GGUF alternative and load a smaller tier — Q4_K_M 5.57 GB or Q4_K_S 5.43 GB per the silveroxides Files tab — which leaves enough room to keep the T5 resident and skip offloading entirely.

Architecture note: FP8 weights load on Ampere, but compute is BF16

The RTX 3060 is Ampere (sm_86), which supports FP16 / BF16 / INT8 / TF32 tensor-core math but not FP8 — native FP8 matmul first shipped on Hopper (sm_90) and consumer Ada (sm_89), per NVIDIA's CUDA C++ Programming Guide on per-compute-capability tensor-core types. The FP8 e4m3fn weight file in step 2 still loads and keeps its compact footprint resident in VRAM — that is the whole reason it fits 12 GB — but at compute time the runtime dequantizes the FP8 weights to BF16/FP16 on the fly. So on this card FP8 is a VRAM escape hatch, not a speed win: you trade a small per-op dequantization overhead for the smaller resident footprint. The drbaph HiDream-O1 FP8 card documents the same behavior verbatim for non-FP8 architectures: "On CUDA-capable GPUs with Hopper or Ada Lovelace architecture (RTX 40xx, H100), FP8 compute is hardware-accelerated. On older GPUs, weights are dequantized on-the-fly — still saving VRAM, with a small speed penalty." If you would rather not pay the dequantization overhead, the GGUF alternative path uses llama.cpp-style quantized matmul kernels that are arch-independent and run natively on Ampere.

Need even more headroom? GGUF-quantize the T5 encoder too

city96's t5-v1_1-xxl-encoder-gguf ships the encoder as GGUF (e.g. Q5_K_M, Q4_K_M), loaded with the CLIPLoader (gguf) node from ComfyUI-GGUF. Pairing a GGUF transformer tier with a GGUF T5 lets you run the whole graph resident on 12 GB without any CPU-offload streaming — useful if the Gen4 bus makes the offloaded path feel slow.

Noise artifacts with --fp8_e5m2-unet

The FP8-scaled file in step 2 uses pre-scaled e4m3fn weights, which is the clean path. Avoid ComfyUI's on-the-fly --fp8_e5m2-unet flag — it produces noise artifacts on Chroma1-family models. Stick to the pre-scaled e4m3fn file (or a GGUF tier from the alternative path), or --fp8_e4m3fn-unet if you need an on-the-fly fp8 cast.

No Blackwell-specific wheel selection needed; FlashAttention-2 works on Ampere

Unlike Blackwell-class GPUs (RTX 50-series, sm_120) which need a cu128-targeted PyTorch wheel, the RTX 3060 is Ampere (sm_86) — the default pip install torch shipped with ComfyUI already includes sm_86 kernels. FlashAttention-2 explicitly lists Ampere GPUs as fully supported alongside Ada and Hopper, with stock pip wheels since 2.x and full BF16 coverage, so no attn_implementation overrides are required — keep the defaults.

"v48", "Chroma1-Base", "Chroma1-HD", "Chroma1-Radiance" — which one is V48?

Per the lodestones/Chroma1-Base README, "Chroma1-Base is Chroma-v.48" — that's the literal V48. Chroma1-HD is a separate model (retrained from v.48). Chroma1-Radiance is a separate output-head variant (no FLUX VAE, different decoder). The deprecated lodestones/Chroma repo's chroma-unlocked-v48.safetensors is the original V48 weight file, but the canonical, currently-maintained V48 distribution is Chroma1-Base.

If your specific issue isn't covered above, please report it via the submission form so the next reader benefits.

common questions
How much VRAM does Chroma V48 need?

About 11 GB — the minimum this recipe targets.

Which GPUs is Chroma V48 tested on?

RTX 3060 (12 GB).

How hard is this setup?

Intermediate — follow the steps above.