Chroma1-Base (V48) on RTX 5070: Uncensored 8.9B FLUX.1-Schnell De-Distillation via GGUF Q4_K_M in ComfyUI

What You'll Build

A working ComfyUI setup that runs Chroma1-Base — the 8.9B-parameter, Apache 2.0, uncensored re-derivation of FLUX.1-Schnell published by Lodestone Rock and explicitly labeled "Chroma1-Base is Chroma-v.48" on the official HF card — on an RTX 5070 12GB (Blackwell, sm_120). On a 12 GB card the scaled-FP8 transformer path that the 16 GB siblings use (a 9.19 GB FP8 checkpoint with the T5 encoder and FLUX VAE alongside, ~14 GB total) does not fit. This recipe leads with the GGUF Q4_K_M path: a 5.57 GB quantized transformer loaded via the ComfyUI-GGUF Unet Loader (GGUF) node, which drops the resident footprint to a ~10.8 GB envelope that fits 12 GB with display headroom.

Hardware data: RTX 5070 (12 GB GDDR7, ~672 GB/s memory bandwidth, Blackwell sm_120, native FP8 tensor cores) · runs at GGUF Q4_K_M (5.57 GB transformer) with the T5 XXL fp8 encoder (4.89 GB) and FLUX VAE (0.33 GB) · See benchmark data

ℹ️ Why Chroma1-Base and not Chroma1-HD or Chroma1-Radiance. The Chroma family ships several variants from the same author: Chroma1-Base (the literal V48 weights), Chroma1-HD (a successor retrained from V48), Chroma1-Flash (a CFG-baked fast variant), and Chroma1-Radiance (a different output head — no FLUX VAE, different decoder). This recipe pins Chroma1-Base because that is what V48 specifically is, per the HF card. For HD, Flash, or Radiance, follow their own HF cards — install paths differ (Radiance does not use the FLUX VAE at all).

⚠️ The original lodestones/Chroma repo is deprecated. Its README now directs users to "use Chroma1-HD, Chroma1-Base or Chroma1-Flash instead". The deprecated repo still hosts the shared FLUX VAE (ae.safetensors) and the original chroma-unlocked-v48.safetensors weight file, but the canonical V48 distribution is Chroma1-Base.

Requirements

Component	Minimum	Tested
GPU	12 GB VRAM (the canonical BF16 single-file checkpoint is 17.8 GB on disk per the Chroma1-Base Files tab, and even the 9.19 GB scaled-FP8 path peaks ~14 GB with the encoder + VAE alongside — both overflow 12 GB, so a GGUF Q4/Q5 quant is required)	RTX 5070 (12 GB, Blackwell sm_120)
RAM	16 GB system	—
Storage	~11 GB (GGUF Q4_K_M 5.57 GB + T5 XXL fp8 4.89 GB + FLUX VAE 0.33 GB)	—
Software	ComfyUI (May 2025 or newer) + ComfyUI-GGUF custom node by city96	—

Installation

1. Update ComfyUI and install the cu128 PyTorch wheel

Update to a recent ComfyUI release (the Chroma weights require an up-to-date ComfyUI as of May 1, 2025). The RTX 5070 is Blackwell (sm_120) — a newer GPU architecture that requires CUDA 12.8+ kernels. If you manage ComfyUI's Python environment yourself, install the cu128 PyTorch wheel:

pip install torch torchvision --index-url https://download.pytorch.org/whl/cu128

The cu128 wheel ships sm_120 kernels for ComfyUI's default attention backend (scaled_dot_product_attention, SDPA). You do not need FlashAttention-2 — FA2 sm_120 wheel coverage is still in progress at Dao-AILab/flash-attention#2168, and ComfyUI's native diffusion path does not depend on it.

2. Install the ComfyUI-GGUF custom node

The GGUF transformer needs the Unet Loader (GGUF) node, which ships in city96's ComfyUI-GGUF custom node pack:

git clone https://github.com/city96/ComfyUI-GGUF ComfyUI/custom_nodes/ComfyUI-GGUF
pip install -r ComfyUI/custom_nodes/ComfyUI-GGUF/requirements.txt

Restart ComfyUI after installing so the Unet Loader (GGUF) and CLIPLoader (gguf) nodes register.

3. Download the Chroma1-Base (V48) GGUF Q4_K_M checkpoint

The 12 GB lead path uses the GGUF Q4_K_M V48 weights from silveroxides/Chroma1-Base-GGUF, an Apache 2.0 redistribution whose card lists base_model: lodestones/Chroma1-Base. The repo ships per-quant-tier files (sizes verbatim from the Files tab): Q4_K_M 5.57 GB, Q4_K_S 5.43 GB, Q5_K_M 6.65 GB, Q6_K 7.65 GB, Q8_0 9.74 GB, plus smaller Q3_K_S 4.29 GB and Q2_K 3.41 GB tiers.

# Chroma1-Base (V48) GGUF Q4_K_M checkpoint (5.57 GB) → diffusion_models folder
wget -P ComfyUI/models/diffusion_models/ \
  https://huggingface.co/silveroxides/Chroma1-Base-GGUF/resolve/main/Chroma1-Base-Q4_K_M.gguf

Q4_K_M (5.57 GB) is the recommended starting tier for a 12 GB card: it leaves room for the T5 encoder and FLUX VAE while keeping more quality than the Q3/Q2 tiers. On Blackwell sm_120 the GGUF transformer is consumed by ComfyUI-GGUF's loader and dequantized on the fly.

4. Download the T5 XXL text encoder and FLUX VAE

Chroma1-Base uses the standard FLUX-ecosystem T5 XXL encoder and the FLUX VAE. On a 12 GB card, use the fp8 T5 variant to keep the encoder footprint down:

# T5 XXL (fp8 — keeps the encoder 4.89 GB instead of the 9.79 GB fp16 variant) → clip folder
wget -P ComfyUI/models/clip/ \
  https://huggingface.co/comfyanonymous/flux_text_encoders/resolve/main/t5xxl_fp8_e4m3fn.safetensors

# FLUX VAE (ae.safetensors, 0.33 GB) → vae folder
wget -P ComfyUI/models/vae/ \
  https://huggingface.co/lodestones/Chroma/resolve/main/ae.safetensors

The T5 XXL and FLUX VAE are the assets the Chroma1-Base HF card "ComfyUI" section points to; the VAE still lives in the deprecated lodestones/Chroma repo, which is where the canonical card's link resolves.

5. Load the Chroma ComfyUI workflow

The Chroma ComfyUI workflow JSON ships in the deprecated lodestones/Chroma repo:

# Simple Chroma text-to-image workflow → drag onto the ComfyUI canvas
wget https://huggingface.co/lodestones/Chroma/resolve/main/simple_workflow.json

⚠️ Workflow filename note. The Chroma1-Base card's ComfyUI section links a workflow named ChromaSimpleWorkflow20250507.json, but that exact filename returns 404. The canonical file present in the repo is simple_workflow.json (verified live). Use the URL above, not the one printed on the card.

Drag the workflow onto the ComfyUI canvas, then:

Swap the workflow's Load Diffusion Model node for the Unet Loader (GGUF) node (from ComfyUI-GGUF), and point it at Chroma1-Base-Q4_K_M.gguf from step 3.
Confirm the text-encoder node points at t5xxl_fp8_e4m3fn.safetensors from step 4.
Confirm the VAE loader points at ae.safetensors.

Running

Use a 1024×1024 latent for the first run (the resolution Chroma was trained at). The Chroma1-Base diffusers Quickstart on the HF card uses num_inference_steps=40 and guidance_scale=3.0 as a reasonable starting point; in ComfyUI, set the sampler's step count similarly.

Trigger: Queue Prompt
Output: PNG saved to ComfyUI/output/

The first generation pays a cold-load cost (weights → VRAM, text encoder → VRAM). Subsequent generations with the same model reuse the loaded weights.

Results

Speed: Omitted. No RTX 5070-named generation-time data point on Chroma1-Base exists, and a number cannot be borrowed from the 16 GB Blackwell siblings: the RTX 5070 has ~672 GB/s memory bandwidth and 6144 CUDA cores versus the RTX 5070 Ti's ~896 GB/s and 8960 cores, so a 5070 Ti figure would overstate both the memory-bound (VAE decode) and compute-bound (diffusion) throughput on the 5070 by roughly 25–31% — outside the forward-statement threshold. Once a community measurement lands via /contribute, the /check/chroma-v48/rtx-5070 endpoint will surface it.
VRAM usage: Plan for a ~11 GB resident envelope on the GGUF Q4_K_M path documented above. The Q4_K_M transformer is 5.57 GB on disk per the silveroxides Files tab; the T5 XXL fp8 encoder (4.89 GB per the flux_text_encoders Files tab) and the FLUX VAE (0.33 GB) load alongside, for ~10.8 GB before per-step activations. That fits inside the RTX 5070's 12 GB envelope with display headroom. This figure is derived from the cited on-disk sizes, not a measured runtime peak; once an empirical 5070 number lands, /check/chroma-v48/rtx-5070 will replace it. As an independent fit check, a community ComfyUI workflow by gabrielx (published Aug 29, 2025) runs a Chroma V48 Q4_0 GGUF transformer with an FP8 T5 encoder on an "RTX 3070 with 8GB VRAM and 40GB RAM" — confirming the V48 Q4 GGUF path runs comfortably below 12 GB on a real consumer card (the 8 GB card runs an upscaler/refiner pipeline on top; the plain text-to-image path on a 12 GB card has more headroom).
Quality notes: Chroma1-Base is a FLUX.1-Schnell de-distillation — it restores the multi-step diffusion behavior that Schnell distilled away, so it runs more like a FLUX.1-Dev-class model than a 4-step turbo. Don't expect Schnell-tier speed; the canonical Quickstart uses 40 steps. Q4_K_M trades some fidelity versus FP8/BF16; if you have display headroom to spare, Q5_K_M (6.65 GB) is a step up in quality (see Troubleshooting).

For the full benchmark data, see /check/chroma-v48/rtx-5070.

Troubleshooting

Need more VRAM headroom? Quantize the T5 encoder too, or use `--lowvram`

If the Q4_K_M envelope is too tight on your card (a desktop 12 GB card with a monitor attached exposes only ~10.5–11.3 GB usable), there are two documented ways to free a few more GB:

GGUF-quantize the T5 encoder. city96's t5-v1_1-xxl-encoder-gguf ships the encoder as GGUF — e.g. Q5_K_M at 3.39 GB or Q4_K_M at 2.90 GB versus the 4.89 GB fp8 safetensors. Load it with the CLIPLoader (gguf) node (also from ComfyUI-GGUF). With Q4_K_M transformer (5.57 GB) + GGUF Q5_K_M T5 (3.39 GB) + VAE (0.33 GB) the resident envelope drops to ~9.3 GB.
Run ComfyUI with --lowvram. ComfyUI's native low-VRAM mode streams model components off the GPU when memory is tight, at the cost of some speed. The ComfyUI-GGUF README notes that the Force/Set CLIP Device node is not part of the pack — for explicit device placement use ComfyUI's own flags rather than expecting the GGUF node pack to offload the encoder for you.

Out-of-memory at high resolution or batch size > 1

The Q4_K_M path leaves modest headroom on a 12 GB card. If 1280×1280 or larger, or batch_size > 1, pushes the 5070 to OOM, drop to a smaller GGUF tier — Q4_K_S 5.43 GB, Q3_K_S 4.29 GB, or Q2_K 3.41 GB per the silveroxides Files tab — to free a few GB, or keep resolution at 1024×1024. The HF diffusers Chroma docs note that Chroma can use the same memory optimizations as Flux (VAE tiling/slicing) on the diffusers path.

BF16 and scaled-FP8 do not fit 12 GB — use GGUF on this card

The canonical full-precision path is the BF16 single-file checkpoint (Chroma1-Base.safetensors, 17.8 GB on disk per the Chroma1-Base Files tab), and even the scaled-FP8 transformer (9.19 GB) peaks around 14 GB with the T5 encoder and FLUX VAE on top — which is why the 16 GB-class sibling recipes (RTX 5070 Ti, RTX 4080) use scaled-FP8 and the 24 GB siblings (RTX 4090, RTX 3090, RTX 5090) use BF16. Neither fits the 5070's 12 GB; stay on the GGUF Q4/Q5 path on this card.

"v48", "Chroma1-Base", "Chroma1-HD", "Chroma1-Radiance" — which one is V48?

Per the lodestones/Chroma1-Base README, "Chroma1-Base is Chroma-v.48" — that's the literal V48. Chroma1-HD is a separate model (retrained from v.48). Chroma1-Radiance is a separate output-head variant (no FLUX VAE, different decoder). The deprecated lodestones/Chroma repo's chroma-unlocked-v48.safetensors is the original V48 weight file, but the canonical, currently-maintained V48 distribution is Chroma1-Base.

If your specific issue isn't covered above, please report it via the submission form so the next reader benefits.