self-hosted/ai
§01·recipe · image

Chroma1-Base (V48) on RTX 5070 Ti: Uncensored 8.9B FLUX.1-Schnell De-Distillation via Blackwell-Native FP8 in ComfyUI

imageintermediate14GB+ VRAMJun 3, 2026
models
tools
prerequisites
  • NVIDIA RTX 5070 Ti (16GB VRAM, Blackwell sm_120) or equivalent 16GB consumer card
  • Python 3.10+
  • ComfyUI installed and updated to a recent release (May 2025 or newer)
  • PyTorch built against the CUDA 12.8 (cu128) wheel for sm_120 kernels
  • ~16 GB free disk for the FP8 checkpoint + T5 XXL fp8 + FLUX VAE

What You'll Build

A working ComfyUI setup that runs Chroma1-Base — the 8.9B-parameter, Apache 2.0, uncensored re-derivation of FLUX.1-Schnell published by Lodestone Rock and explicitly labeled "Chroma1-Base is Chroma-v.48" on the official HF card — on an RTX 5070 Ti 16GB (Blackwell, sm_120). This recipe leads with the scaled-FP8 path: the RTX 5070 Ti's Blackwell sm_120 GPU has native FP8 (E4M3) tensor cores, so a pre-quantized scaled-FP8 checkpoint runs at hardware speed and fits the 16 GB envelope with the T5 encoder and FLUX VAE alongside.

Hardware data: RTX 5070 Ti (16 GB GDDR7, ~896 GB/s memory bandwidth, Blackwell sm_120, native FP8 tensor cores) · runs at scaled-FP8 (9.19 GB checkpoint) with the T5 XXL fp8 encoder and FLUX VAE · See benchmark data

ℹ️ Why Chroma1-Base and not Chroma1-HD or Chroma1-Radiance. The Chroma family ships several current variants from the same author: Chroma1-Base (the literal V48 weights), Chroma1-HD (a successor retrained from V48 as a finetune-ready base), Chroma1-Flash (a CFG-baked fast variant), and Chroma1-Radiance (a different output head — no FLUX VAE, different decoder). This recipe pins Chroma1-Base because that is what V48 specifically is, per the Chroma1-Base HF card (its "P.S" line). For Chroma1-HD, Chroma1-Flash, or Chroma1-Radiance, follow their own respective HF cards — install paths differ (Radiance does not use the FLUX VAE at all).

⚠️ The original lodestones/Chroma repo is deprecated. Its README now opens with "THIS REPO IS DEPRECATED!" and "use Chroma1-HD, Chroma1-Base or Chroma1-Flash instead". The deprecated repo still hosts the shared FLUX VAE (ae.safetensors) and the original chroma-unlocked-v48.safetensors weight file, but the canonical V48 distribution is Chroma1-Base.

Requirements

ComponentMinimumTested
GPU16 GB VRAM (the canonical BF16 single-file checkpoint is 17.8 GB on disk per the Chroma1-Base Files tab, so it overflows 16 GB before the encoder loads — a scaled-FP8 or GGUF quant is required)RTX 5070 Ti (16 GB, Blackwell sm_120)
RAM16 GB system
Storage~16 GB (FP8 checkpoint 9.19 GB + T5 XXL fp8 4.89 GB + FLUX VAE 0.33 GB)
SoftwareComfyUI (May 2025 or newer) + ComfyUI-GGUF custom node by city96 for the alternative GGUF path

Installation

1. Update ComfyUI and install the cu128 PyTorch wheel

Update to a recent ComfyUI release (the scaled-FP8 weights linked below note they require an up-to-date ComfyUI as of May 1, 2025). The RTX 5070 Ti is Blackwell (sm_120) — a newer GPU architecture that requires CUDA 12.8+ kernels. If you manage ComfyUI's Python environment yourself, install the cu128 PyTorch wheel:

pip install torch torchvision --index-url https://download.pytorch.org/whl/cu128

The cu128 wheel ships sm_120 kernels for ComfyUI's default attention backend (scaled_dot_product_attention, SDPA). You do not need FlashAttention-2 — FA2 sm_120 wheel coverage is still in progress at Dao-AILab/flash-attention#2168, and ComfyUI's native diffusion path does not depend on it.

2. Download the Chroma1-Base (V48) scaled-FP8 checkpoint

The Blackwell-native lead path uses the scaled-FP8 V48 weights from Clybius/Chroma-fp8-scaled, an Apache 2.0 redistribution whose card lists base_model: lodestones/Chroma. The card describes it as a "high-precision variant" of Chroma, "utilizing the full dynamic range of FP8 (-448 to 448)" and leveraging that headroom "to maintain higher precision compared to standard FP8 safetensors" — i.e. a pre-quantized, scale-aware FP8 file, not an on-the-fly cast.

# Chroma1-Base (V48) scaled-FP8 e4m3fn checkpoint (9.19 GB) → diffusion_models folder
wget -P ComfyUI/models/diffusion_models/ \
  https://huggingface.co/Clybius/Chroma-fp8-scaled/resolve/main/v48/chroma-unlocked-v48_float8_e4m3fn_scaled_learned_nodistill.safetensors

This file is the V48 lineage in scaled FP8 (9.19 GB on disk per the Clybius/Chroma-fp8-scaled Files tab). On Blackwell sm_120 the FP8 weights are consumed by native FP8 tensor cores rather than dequantized in software — that's the hardware-speed advantage the 5070 Ti has over Ampere-class cards.

3. Download the T5 XXL text encoder and FLUX VAE

Chroma1-Base uses the standard FLUX-ecosystem T5 XXL encoder and the FLUX VAE. On a 16 GB card, use the fp8 T5 variant to keep the encoder footprint down:

# T5 XXL (fp8 — keeps the encoder ~4.9 GB instead of the ~9.5 GB fp16 variant) → clip folder
wget -P ComfyUI/models/clip/ \
  https://huggingface.co/comfyanonymous/flux_text_encoders/resolve/main/t5xxl_fp8_e4m3fn.safetensors

# FLUX VAE (ae.safetensors, 0.33 GB) → vae folder
wget -P ComfyUI/models/vae/ \
  https://huggingface.co/lodestones/Chroma/resolve/main/ae.safetensors

The T5 XXL and FLUX VAE assets are the ones the Chroma1-Base HF card "ComfyUI" section points to; the VAE still lives in the deprecated lodestones/Chroma repo, which is where the canonical card's link resolves.

4. Load the Chroma ComfyUI workflow

The Chroma ComfyUI workflow JSON ships in the deprecated lodestones/Chroma repo. Two files are present and both download cleanly: the simple text-to-image workflow simple_workflow.json and the larger ComfyUI_Chroma1-HD_T2I-workflow.json.

# Simple Chroma text-to-image workflow → drag onto the ComfyUI canvas
wget https://huggingface.co/lodestones/Chroma/resolve/main/simple_workflow.json

⚠️ Workflow filename note. The Chroma1-Base card's ComfyUI section links a workflow named ChromaSimpleWorkflow20250507.json, but that exact filename now returns 404 in the deprecated repo — the canonical file present in the repo is simple_workflow.json (verified live). Use the URL above, not the one printed on the card.

Drag the workflow onto the ComfyUI canvas, then:

  1. In the Load Diffusion Model node, select the chroma-unlocked-v48_float8_e4m3fn_scaled_learned_nodistill.safetensors file from step 2, and leave weight_dtype set to default (the Clybius card's usage note is "Load the model using Load Diffusion Model in ComfyUI" and "Set weight_dtype to default" — the file is already FP8, so no further cast is wanted).
  2. Confirm the text-encoder node points at t5xxl_fp8_e4m3fn.safetensors from step 3.
  3. Confirm the VAE loader points at ae.safetensors.

Running

Use a 1024×1024 latent for the first run (the resolution Chroma was trained at). The Chroma1-Base diffusers Quickstart on the HF card uses num_inference_steps=40 and guidance_scale=3.0 as a reasonable starting point; in ComfyUI, set the sampler's step count similarly (the example workflow defaults are a safe start).

Trigger: Queue Prompt
Output: PNG saved to ComfyUI/output/

The first generation pays a cold-load cost (weights → VRAM, text encoder → VRAM). Subsequent generations with the same model reuse the loaded weights.

Results

  • Speed: Omitted. No first-party generation-time data point on Chroma1-Base exists for the RTX 5070 Ti. The only first-party speed thread in the Chroma family (Chroma1-HD discussion #25) measures the Chroma1-HD variant on a different card and configuration — not Chroma1-Base on a 5070 Ti — so it is not quotable here. A 5070 Ti-named number also cannot be borrowed from the 16 GB sibling recipes on other Blackwell cards: the RTX 5070 Ti has ~896 GB/s memory bandwidth and 8960 CUDA cores, versus the RTX 5080's ~960 GB/s and 10752 cores, so a 5080 figure would overstate the 5070 Ti's diffusion (compute-bound) throughput by roughly 17% — outside the forward-statement threshold for compute-bound work. Once a community measurement lands via /contribute, the /check/chroma-v48/rtx-5070-ti endpoint will surface it.
  • VRAM usage: Plan for ~14 GB peak on the scaled-FP8 path documented above. The FP8 V48 checkpoint is 9.19 GB on disk per the Clybius Files tab; the T5 XXL fp8 encoder (4.89 GB per the flux_text_encoders Files tab) and the FLUX VAE (0.33 GB) load alongside, plus per-step activations and latents at 1024×1024. That lands inside the RTX 5070 Ti's 16 GB envelope with thin headroom; if you push resolution or batch size, see Troubleshooting. The canonical BF16 single-file checkpoint is 17.8 GB on disk (per the Chroma1-Base Files tab) and does not fit 16 GB — see "BF16 needs ~22 GB" in Troubleshooting. This envelope is derived from the cited on-disk sizes, not a measured runtime peak; once an empirical 5070 Ti number lands, /check/chroma-v48/rtx-5070-ti will replace it.
  • Quality notes: Chroma1-Base is a FLUX.1-Schnell de-distillation — it restores the multi-step diffusion behavior that Schnell distilled away, so it runs more like a FLUX.1-Dev-class model than a 4-step turbo. Don't expect Schnell-tier speed; the canonical Quickstart uses 40 steps.

For the full benchmark data, see /check/chroma-v48/rtx-5070-ti.

Troubleshooting

"v48", "Chroma1-Base", "Chroma1-HD", "Chroma1-Radiance" — which one is V48?

Per the lodestones/Chroma1-Base README, "Chroma1-Base is Chroma-v.48" — that's the literal V48. Chroma1-HD is a separate model in the same author's lineup (retrained from v.48) — adjacent lineage, not the same weights. Chroma1-Radiance is a separate output-head variant (no FLUX VAE, different decoder) — close cousin, distinct architecture. The deprecated lodestones/Chroma repo's chroma-unlocked-v48.safetensors is the original V48 weight file, but the canonical, currently-maintained V48 distribution is Chroma1-Base.

Noise artifacts when casting weights to FP8 on-the-fly (vs. the pre-scaled FP8 file)

There are two different "FP8 paths" and only one of them is the recommended lead path here. Community users on the close-cousin Chroma1-Radiance ComfyUI thread report noise when the model is cast to FP8 at load time, not when a pre-scaled FP8 file is used: user Shiny2480 writes "I was launching with "--fp8_e5m2-unet ". After removing that argument the noise is gone and the images came.", and user bk227865 notes "Did you change the weight_dtype in the model loader ? if i change it from default to fp8 it makes noise." Both describe on-the-fly casting (the --fp8_e5m2-unet launch flag, or flipping the loader weight_dtype to fp8). The Clybius scaled-FP8 file in step 2 is already quantized with scale-awareness, so you load it with weight_dtype = default — leave that flag alone and do not pass --fp8_e5m2-unet. (These reports are from community users on the Chroma1-Radiance thread, not the Lodestone team, and concern Radiance/HD — treat as adjacent-family guidance, not a Chroma1-Base measurement.)

BF16 needs ~22 GB — it does not fit 16 GB

The canonical full-precision path is the BF16 single-file checkpoint (Chroma1-Base.safetensors, 17.8 GB on disk per the Chroma1-Base Files tab). With the T5 XXL fp16 encoder (~9.5 GB) and FLUX VAE on top, the BF16 diffusers path runs at roughly 22–24 GB peak with enable_model_cpu_offload() — that is why the 24 GB-class sibling recipes (RTX 4090, RTX 3090, RTX 5090) use BF16 and the 5070 Ti's 16 GB does not. Stay on the scaled-FP8 (or GGUF) path on this card.

Want maximum in-family quality instead of raw FP8 speed? Use GGUF Q8_0

If you prefer a quality-first quant over the Blackwell-native FP8 speed path, the silveroxides/Chroma1-Base-GGUF repository ships per-quant-tier files (sizes verbatim from the Files tab): Q8_0 9.74 GB, Q6_K 7.65 GB, Q5_K_M 6.65 GB, Q4_K_M 5.57 GB. Q8_0 is generally close to BF16 in the FLUX-family quantization literature and fits 16 GB comfortably with the T5 fp8 encoder + VAE. Install the ComfyUI-GGUF custom node (git clone into ComfyUI/custom_nodes, then pip install -r requirements.txt), restart ComfyUI, drop the .gguf into ComfyUI/models/diffusion_models/, and swap the Load Diffusion Model node for the Unet Loader (GGUF) node pointing at it.

Out-of-memory at high resolution or batch size > 1

The scaled-FP8 path leaves only thin headroom on a 16 GB card. If 1280×1280 or larger, or batch_size > 1, pushes the 5070 Ti to OOM, drop to a smaller GGUF tier (Q6_K 7.65 GB or Q5_K_M 6.65 GB per the table above) to free a few GB, or keep resolution at 1024×1024. The HF diffusers Chroma docs note that Chroma can use all the same memory optimizations as Flux (VAE tiling/slicing) if you run the diffusers path instead of ComfyUI.

Quality regressions from acceleration LoRAs

On the close-cousin Chroma1-Radiance thread, community user Seeker36087 reports needing acceleration LoRAs "which then impart unwanted styles and compromises to the image and seem to negatively affect prompt adherence" on the Radiance/HD variants. Establish a quality baseline with the unaccelerated FP8 (or GGUF) path before stacking acceleration LoRAs. (Adjacent-family community testimony, not a Chroma1-Base citation.)

If your specific issue isn't covered above, please report it via the submission form so the next reader benefits.