self-hosted/ai
§01·recipe · image

Chroma1-Base (V48) on RTX 4070: Uncensored 8.9B FLUX.1-Schnell De-Distillation via FP8 + CPU-Offload T5 in ComfyUI

imageintermediate11GB+ VRAMJun 9, 2026
models
tools
prerequisites
  • NVIDIA RTX 4070 (12GB VRAM, Ada sm_89) or equivalent 12GB consumer card
  • Python 3.10+
  • ComfyUI installed and updated to a recent release (May 2025 or newer)
  • 32 GB system RAM recommended (the T5 encoder is CPU-offloaded on this card)
  • ~10 GB free disk for the FP8-scaled checkpoint + T5 XXL fp8 + FLUX VAE

What You'll Build

A working ComfyUI setup that runs Chroma1-Base — the 8.9B-parameter, Apache 2.0, uncensored re-derivation of FLUX.1-Schnell published by Lodestone Rock and explicitly labeled "Chroma1-Base is Chroma-v.48" on the official HF card — on an RTX 4070 12GB (Ada Lovelace, sm_89). On a 12 GB card the full FP8-resident path the 16 GB siblings use (a 9.19 GB FP8 checkpoint with the 4.89 GB T5 encoder and FLUX VAE all resident at once, ~14 GB total) does not fit. This recipe leads with the FP8 transformer + CPU-offloaded T5 path: the 9.19 GB scaled-FP8 e4m3fn transformer stays resident on the GPU while ComfyUI runs the T5 text encoder on the CPU and frees it before sampling, dropping the resident footprint to a ~9.5–10.8 GB envelope that fits 12 GB with display headroom.

Hardware data: RTX 4070 (12 GB GDDR6X, ~504 GB/s memory bandwidth, Ada sm_89, native FP8 tensor cores, PCIe Gen4 x16) · runs at FP8-scaled e4m3fn (9.19 GB transformer resident) with a CPU-offloaded T5 XXL fp8 encoder (4.89 GB) and the FLUX VAE (0.33 GB) · See benchmark data

ℹ️ Why Chroma1-Base and not Chroma1-HD or Chroma1-Radiance. The Chroma family ships several variants from the same author: Chroma1-Base (the literal V48 weights), Chroma1-HD (a successor retrained from V48), Chroma1-Flash (a CFG-baked fast variant), and Chroma1-Radiance (a different output head — no FLUX VAE, different decoder). This recipe pins Chroma1-Base because that is what V48 specifically is, per the HF card. For HD, Flash, or Radiance, follow their own HF cards — install paths differ (Radiance does not use the FLUX VAE at all).

⚠️ The original lodestones/Chroma repo is deprecated. Its README header reads "THIS REPO IS DEPRECATED!" and directs users to "use Chroma1-HD, Chroma1-Base or Chroma1-Flash instead". The deprecated repo still hosts the shared FLUX VAE (ae.safetensors) and the original chroma-unlocked-v48.safetensors weight file, but the canonical V48 distribution is Chroma1-Base.

Requirements

ComponentMinimumTested
GPU12 GB VRAM (the canonical BF16 single-file checkpoint is 17.8 GB on disk per the Chroma1-Base Files tab, and even the 9.19 GB scaled-FP8 transformer peaks ~14 GB if the T5 encoder is kept resident alongside it — so on 12 GB the T5 must be CPU-offloaded or a GGUF quant used)RTX 4070 (12 GB, Ada sm_89)
RAM32 GB system recommended (T5 fp8 ~4.9 GB lives in host RAM when offloaded)
Storage~10 GB (FP8-scaled transformer 9.19 GB + T5 XXL fp8 4.89 GB + FLUX VAE 0.33 GB)
SoftwareComfyUI (May 2025 or newer) + ComfyUI-GGUF custom node by city96 (only needed for the GGUF alternative path)

Installation

1. Update ComfyUI

Update to a recent ComfyUI release. The Clybius scaled-FP8 README states it "Requires an up-to-date ComfyUI as of May 1, 2025." The RTX 4070 is Ada Lovelace (sm_89), so the default pip install torch shipped with ComfyUI already includes the right kernels — no cu128-specific wheel pinning is needed (that is a Blackwell / RTX 50-series requirement only). FP8 e4m3fn is a hardware-native datatype on Ada's 4th-generation tensor cores.

2. Download the Chroma1-Base (V48) FP8-scaled transformer

On the RTX 4070's Ada 4th-gen tensor cores, FP8 e4m3fn is native, so the pre-scaled FP8 redistribution is the recommended transformer. Download the standard V48 file from Clybius/Chroma-fp8-scaled — the /v48/ subdirectory carries the FP8-scaled e4m3fn V48 weights, and the repo declares license: apache-2.0 and base_model: lodestones/Chroma (link-back to the canonical V48 lineage):

wget -P ComfyUI/models/diffusion_models/ \
  https://huggingface.co/Clybius/Chroma-fp8-scaled/resolve/main/v48/chroma-unlocked-v48_float8_e4m3fn_scaled_learned_nodistill.safetensors

This file is 9.19 GB. It uses pre-scaled FP8 e4m3fn (not on-the-fly --fp8_e5m2 casting), which avoids the noise artifacts that on-the-fly casting introduces on the Chroma family (see Troubleshooting).

ℹ️ License note for the Clybius repo. The /v48/ FP8-scaled files redistribute the Apache-2.0 Chroma1-Base weights. The Clybius README flags that files in its debug, merges, and hybrid-merges folders are CC BY-NC-SA 4.0 (non-commercial) because they incorporate components from a non-commercial development repo. Stay in /v48/ for the Apache-2.0 lineage; avoid the debug/merges folders if you need commercial use.

3. Download the T5 XXL text encoder and FLUX VAE

Chroma1-Base uses the standard FLUX-ecosystem T5 XXL encoder and the FLUX VAE. On a 12 GB card, use the fp8 T5 variant — it will be CPU-offloaded, but the smaller file also means less host RAM and a faster cold load:

# T5 XXL (fp8 — 4.89 GB; the fp16 variant is 9.79 GB) → clip folder
wget -P ComfyUI/models/clip/ \
  https://huggingface.co/comfyanonymous/flux_text_encoders/resolve/main/t5xxl_fp8_e4m3fn.safetensors

# FLUX VAE (ae.safetensors, 0.33 GB) → vae folder
wget -P ComfyUI/models/vae/ \
  https://huggingface.co/lodestones/Chroma/resolve/main/ae.safetensors

The Chroma1-Base HF card ComfyUI section links the T5 XXL encoder and the FLUX VAE ae.safetensors; the VAE still lives in the deprecated lodestones/Chroma repo, which is where the canonical card's link resolves.

4. Load the Chroma ComfyUI workflow

Download the simple Chroma text-to-image workflow JSON from the lodestones/Chroma repo:

# Simple Chroma text-to-image workflow → drag onto the ComfyUI canvas
wget https://huggingface.co/lodestones/Chroma/resolve/main/simple_workflow.json

⚠️ Workflow filename note. The Chroma1-Base card's ComfyUI section links a workflow named ChromaSimpleWorkflow20250507.json, but that exact filename returns 404 (verified live). The canonical file present in the repo is simple_workflow.json (verified live, returns 200). Use the URL above, not the one printed on the card.

Drag the workflow onto the ComfyUI canvas, then:

  1. Point the Load Diffusion Model node at the downloaded chroma-unlocked-v48_float8_e4m3fn_scaled_learned_nodistill.safetensors, and set its weight_dtype to default (as documented in the Clybius README: "Load the model using Load Diffusion Model in ComfyUI / Set weight_dtype to default").
  2. Confirm the text-encoder node points at t5xxl_fp8_e4m3fn.safetensors from step 3.
  3. Confirm the VAE loader points at ae.safetensors.

5. Run ComfyUI so the T5 encoder offloads to CPU

On a 12 GB card, the 9.19 GB FP8 transformer plus a resident 4.89 GB T5 encoder would overflow VRAM. ComfyUI's native smart-memory management runs the text encoder, frees it, then loads the diffusion model — but on a tight 12 GB card you should force the offload explicitly with --lowvram so the T5 conditioning is computed and then evicted to host RAM before the diffusion transformer takes the GPU:

python main.py --lowvram

With --lowvram, ComfyUI streams components off the GPU when memory is tight, keeping the FP8 transformer + VAE resident during sampling while the T5 lives in system RAM.

Alternative: GGUF path (lower VRAM, more granular quant tiers)

If you prefer to drop below the FP8 footprint (e.g. to stack acceleration LoRAs or push resolution), install the ComfyUI-GGUF custom node by city96:

git clone https://github.com/city96/ComfyUI-GGUF ComfyUI/custom_nodes/ComfyUI-GGUF
pip install -r ComfyUI/custom_nodes/ComfyUI-GGUF/requirements.txt

Restart ComfyUI, then pick one quantization from silveroxides/Chroma1-Base-GGUF — file sizes verbatim from the Files tab:

QuantSize
Q2_K3.41 GB
Q3_K_S4.29 GB
Q4_K_S5.43 GB
Q4_K_M5.57 GB
Q5_K_S6.51 GB
Q5_K_M6.65 GB
Q6_K7.65 GB
Q8_09.74 GB
BF1617.8 GB

Drop the .gguf into ComfyUI/models/diffusion_models/ and swap the Load Diffusion Model node for the Unet Loader (GGUF) node from ComfyUI-GGUF. On a 12 GB card the Q4_K_M (5.57 GB) or Q5_K_M (6.65 GB) tiers leave enough room to keep the T5 fp8 encoder resident instead of offloading it.

Running

Use a 1024×1024 latent for the first run (the resolution Chroma was trained at). The Chroma1-Base diffusers Quickstart on the HF card uses num_inference_steps=40 and guidance_scale=3.0 as a reasonable starting point; in ComfyUI, set the sampler's step count similarly (20–40 steps work; the example workflow defaults are a safe start).

Trigger: Queue Prompt
Output: PNG saved to ComfyUI/output/

The first generation pays a cold-load cost (T5 → CPU, conditioning computed, then weights → VRAM). Subsequent generations with the same model reuse the loaded transformer.

ℹ️ PCIe Gen4 throughput note. The RTX 4070 uses a PCIe Gen4 x16 link, about half the host-bandwidth of a PCIe Gen5 card. Because the T5 encoder is CPU-offloaded on this card, the conditioning step and any --lowvram component streaming move data across that Gen4 link, so per-image throughput on the offloaded portion is lower than on a faster-bus card running the same path. The path still fits 12 GB — only throughput is affected, and only on the offloaded stages.

ℹ️ No Blackwell-specific wheel selection needed. The RTX 4070 is Ada Lovelace (sm_89), not Blackwell (RTX 50-series, sm_120) — the default pip install torch shipped with ComfyUI already includes sm_89 kernels, and FlashAttention-2 has full sm_89 coverage. No cu128-specific wheel pinning or attn_implementation overrides are required, and the FP8 e4m3fn datatype the recommended path uses is hardware-native on Ada's 4th-gen tensor cores.

Results

  • Speed: Omitted. No RTX 4070-named generation-time data point on Chroma1-Base exists. The RTX 4070 has ~30% fewer CUDA cores and ~25% less memory bandwidth than the 16 GB Ada siblings (RTX 4070 Ti SUPER / RTX 4080), so a figure scaled from those cards would overstate both the memory-bound (VAE decode) and compute-bound (diffusion) throughput by far more than the ~10% forward-statement threshold — and the CPU-offloaded T5 on this card's Gen4 bus makes any cross-card extrapolation worse still. Once a community measurement lands via /contribute, the /check/chroma-v48/rtx-4070 endpoint will surface it.
  • VRAM usage: Plan for a ~9.5–10.8 GB resident envelope on the FP8 transformer + CPU-offload-T5 path documented above. The FP8-scaled transformer is 9.19 GB on disk per Clybius/Chroma-fp8-scaled; with the T5 XXL fp8 encoder (4.89 GB per the flux_text_encoders Files tab) offloaded to host RAM, only the transformer, the FLUX VAE (0.33 GB), and per-step activations stay resident — fitting inside the RTX 4070's 12 GB envelope (a desktop 12 GB card with a monitor attached exposes ~10.5–11.3 GB usable). This figure is derived from the cited on-disk sizes, not a measured runtime peak; once an empirical 4070 number lands, /check/chroma-v48/rtx-4070 will replace it. As an independent fit check, a community ComfyUI workflow by gabrielx (published Aug 29, 2025) runs a Chroma V48 Q4_0 GGUF transformer with an FP8 T5 encoder on an "RTX 3070 with 8GB VRAM and 40GB RAM" — confirming the V48 path runs comfortably below 12 GB on a real consumer card (the 8 GB card even adds an upscaler/refiner pipeline; the plain text-to-image path on the 4070's 12 GB has more headroom).
  • Quality notes: Chroma1-Base is a FLUX.1-Schnell de-distillation — it restores the multi-step diffusion behavior that Schnell distilled away, so it runs more like a FLUX.1-Dev-class model than a 4-step turbo. Don't expect Schnell-tier speed; the canonical Quickstart uses 40 steps.

For the full benchmark data, see /check/chroma-v48/rtx-4070.

Troubleshooting

Out-of-memory loading the model on 12 GB

If the FP8 transformer + T5 path OOMs, the T5 encoder is most likely staying resident on the GPU. Start ComfyUI with --lowvram (step 5) so the T5 conditioning is computed on the CPU and evicted before the diffusion transformer loads. If you still hit OOM at 1024×1024, switch to the GGUF alternative and load a smaller tier — Q4_K_M 5.57 GB or Q4_K_S 5.43 GB per the silveroxides Files tab — which leaves enough room to keep the T5 resident and skip offloading entirely.

Need even more headroom? GGUF-quantize the T5 encoder too

city96's t5-v1_1-xxl-encoder-gguf ships the encoder as GGUF (e.g. Q5_K_M, Q4_K_M), loaded with the CLIPLoader (gguf) node from ComfyUI-GGUF. Pairing a GGUF transformer tier with a GGUF T5 lets you run the whole graph resident on 12 GB without any CPU-offload streaming — useful if the Gen4 bus makes the offloaded path feel slow.

Noise artifacts with --fp8_e5m2-unet

The FP8-scaled file in step 2 uses pre-scaled e4m3fn weights, which is the clean path. Avoid ComfyUI's on-the-fly --fp8_e5m2-unet flag — it produces noise artifacts on Chroma1-family models. Stick to the pre-scaled e4m3fn file (or a GGUF tier from the alternative path), or --fp8_e4m3fn-unet if you need an on-the-fly fp8 cast.

"v48", "Chroma1-Base", "Chroma1-HD", "Chroma1-Radiance" — which one is V48?

Per the lodestones/Chroma1-Base README, "Chroma1-Base is Chroma-v.48" — that's the literal V48. Chroma1-HD is a separate model (retrained from v.48). Chroma1-Radiance is a separate output-head variant (no FLUX VAE, different decoder). The deprecated lodestones/Chroma repo's chroma-unlocked-v48.safetensors is the original V48 weight file, but the canonical, currently-maintained V48 distribution is Chroma1-Base.

If your specific issue isn't covered above, please report it via the submission form so the next reader benefits.