self-hosted/ai
§01·recipe · image

HiDream-O1-Image on RTX 3060: 2048×2048 Text-to-Image with FP8 in ComfyUI

imageintermediate10GB+ VRAMJun 14, 2026

This intermediate recipe sets up HiDream-O1-Image on the RTX 3060, needing about 10 GB of VRAM.

models
tools
prerequisites
  • NVIDIA RTX 3060 (12 GB VRAM, Ampere, sm_86) or any 12 GB+ consumer GPU
  • Python 3.10+ and an updated ComfyUI install (a build that ships the native HiDream-O1 templates)
  • PyTorch built against CUDA 12.x — avoid PyTorch 2.9.x (upstream-flagged issue)

What You'll Build

A local text-to-image and instruction-edit pipeline for HiDream-O1-Image, an 8B unified pixel-space transformer with a reasoning-driven prompt agent, generating up to 2048×2048 natively. The full BF16 path needs 17–20 GB and won't fit the RTX 3060 — so this recipe leads the FP8 checkpoint, which fits in the ~10 GB range per the drbaph FP8 card. That card names 12 GB cards explicitly, so the 3060 is a tight-but-workable fit once you account for display overhead.

Hardware data: RTX 3060 (12 GB VRAM, Ampere sm_86) · FP8 fits in ~10 GB per the drbaph FP8 model card · See benchmark data

12 GB is the design target — but watch display headroom. The drbaph FP8 card states the FP8 model "fits comfortably within ~10 GB of VRAM — making it accessible on 12 GB GPUs (RTX 3080/4070/4080, etc.) with minimal quality trade-off." and was "Tested on 12 GB cards at 2048 × 2048 resolution." — the RTX 3060 is a 12 GB card in exactly that bracket. On a 12 GB desktop card with a monitor attached, the OS display server already consumes ~0.7–1.5 GB, leaving roughly ~10.5–11.3 GB usable (a headless Linux box gets closer to ~11.6 GB). The FP8 peak (~10–11 GB, see Results) lands inside that envelope but with much less slack than the 16 GB+ cards have — if you hit OOM on aggressive samplers, see Troubleshooting for the resolution and display-offload escape hatches.

Architecture note — FP8 on Ampere is a memory trick, not a speed-up. The RTX 3060 sits on Ampere (sm_86), which lacks the hardware FP8 datapath of Ada Lovelace / Hopper. Per the drbaph FP8 model card: "On CUDA-capable GPUs with Hopper or Ada Lovelace architecture (RTX 40xx, H100), FP8 compute is hardware-accelerated. On older GPUs, weights are dequantized on-the-fly — still saving VRAM, with a small speed penalty." The 3060 is squarely in the "older GPUs / dequantized on-the-fly" bucket — the FP8 weights load at the smaller footprint (so the VRAM saving is real), but at compute time the runtime dequantizes them to BF16 per-op. You get the fit; you do not get the FP8 acceleration an Ada card enjoys. Use FP8 here because it's the only path that fits 12 GB, not because it's faster.

Requirements

ComponentMinimumTested
GPU12 GB VRAM (FP8) — BF16/FP16 needs 17–20 GB per the drbaph FP8 card and ~18–20 GB per the Saganaki22 READMERTX 3060 (12 GB, Ampere, sm_86, 360 GB/s)
RAM16 GB system RAM
StorageFP8 diffusion checkpoint ~8.07 GB + Gemma 4 E4B FP8-scaled text encoder ~9.06 GB ≈ ~17 GB on disk (verify current sizes on the Comfy-Org HiDream-O1-Image mirror and the Comfy-Org gemma-4 mirror)
SoftwareUpdated ComfyUI (native HiDream-O1 templates), transformers 4.57.1–5.3 per the Saganaki22 README

Installation

This recipe targets the official ComfyUI-native path documented at docs.comfy.org/tutorials/image/hidream/hidream-o1. The native path uses the Comfy-Org repackaged mirror; no custom node is required for image generation.

1. Update ComfyUI

The RTX 3060 is Ampere (sm_86). The default pip install torch (the standard cu124/CUDA-12.x wheel) already includes sm_86 kernels — Ampere has had full mainline CUDA / cuBLAS coverage since 2021, so no special wheel selection is required, unlike the Blackwell sm_120 cards (RTX 50-series) that need the cu128 build. Use the ComfyUI Manager's update flow, or pull from upstream:

cd ComfyUI
git pull
python -m pip install -r requirements.txt

The HiDream O1 workflow template only appears in Browse Templates once the build includes the native loader — make sure ComfyUI is fully updated before looking for it. Avoid PyTorch 2.9.x (see Troubleshooting).

2. Download the FP8 checkpoint and Gemma 4 text encoder

# Diffusion checkpoint (FP8-scaled — ~8.07 GB on disk)
huggingface-cli download Comfy-Org/HiDream-O1-Image \
    checkpoints/hidream_o1_image_fp8_scaled.safetensors \
    --local-dir ComfyUI/models/

# Text encoder (Gemma 4 E4B, FP8-scaled — ~9.06 GB on disk)
huggingface-cli download Comfy-Org/gemma-4 \
    text_encoders/gemma4_e4b_it_fp8_scaled.safetensors \
    --local-dir ComfyUI/models/

The Comfy-Org/HiDream-O1-Image repo links back to the canonical HiDream-ai/HiDream-O1-Image source, and the Comfy-Org/gemma-4 repo repackages Google's gemma-4-E4B-it. ComfyUI loads the text encoder for the prompt-encode pass and the diffusion checkpoint for sampling — they are not both fully resident at peak, which is why the FP8 runtime peak stays near ~10 GB rather than the ~17 GB on-disk sum. That distinction is what makes 12 GB workable: the on-disk total alone would not fit, but the runtime peak does.

3. (Optional) Install the Saganaki22 custom node

The native path is enough for image generation. If you also want LoRA training, the reasoning-prompt-agent UI, or the self-contained single-file FP8 loader, install the community node:

cd ComfyUI/custom_nodes
git clone https://github.com/Saganaki22/HiDream_O1-ComfyUI.git
cd HiDream_O1-ComfyUI
python -m pip install -r requirements.txt

This node loads the drbaph/HiDream-O1-Image-FP8 single-file FP8 weights (~8.81 GB) directly. The RTX 3060 is Ampere (sm_86) and the default pip install torch already includes sm_86 kernels; prebuilt FlashAttention-2 wheels also cover sm_86, so attention "just works" — no Blackwell-style cu128 wheel or sdpa fallback is needed. The loader's flash, sdpa, and sage attention backends all run on Ampere.

Running

Launch ComfyUI as usual:

python main.py

In the ComfyUI UI, open Workflow → Browse Templates → Image and load the "HiDream O1 Full: Image generation" template, per the official ComfyUI tutorial. The bundled template wires:

  • CheckpointLoaderSimplehidream_o1_image_fp8_scaled.safetensors
  • Text encoder → gemma4_e4b_it_fp8_scaled.safetensors
  • Default sampling: 50 inference steps at up to 2048×2048

First run warms the FP8 weights into memory — expect a noticeable cold-start delay before the first sampling step; subsequent runs reuse the loaded model. On Ampere, the on-the-fly dequantize-to-BF16 happens inside each compute op, so the cold-start delay is normal and the per-step time will be slower than the same FP8 workflow on an Ada card.

Results

  • Speed: Not quoted — no cited source benchmarks HiDream-O1-Image on the RTX 3060 (or any Ampere GPU) at a known throughput, and the backend has no benchmark for this pair yet (/check/hidream-o1-image/rtx-3060 returns verdict: unknown). Quoting a figure from a faster card would overstate it twice over: the 3060 has only 3584 CUDA cores and 360 GB/s memory bandwidth — far below any Ada/Blackwell 12 GB card — and on Ampere the FP8 path carries the dequantize-on-the-fly penalty (see the Architecture note). If you measure throughput on a 3060, please contribute it via the submission form so it lands at /check/hidream-o1-image/rtx-3060.
  • VRAM usage: ~10–11 GB peak with the FP8 model — which fits the RTX 3060's 12 GB, but tightly. The drbaph/HiDream-O1-Image-FP8 card states verbatim: "By quantizing to 8-bit floats, the model fits comfortably within ~10 GB of VRAM — making it accessible on 12 GB GPUs (RTX 3080/4070/4080, etc.) with minimal quality trade-off." and that it was "Tested on 12 GB cards at 2048 × 2048 resolution." The Saganaki22 ComfyUI README independently corroborates the per-precision footprints in its loader table: Full FP8 at ~10–11 GB versus Full BF16/FP16 at ~18–20 GB. On the 12 GB RTX 3060 with a display attached (~10.5–11.3 GB usable), the FP8 footprint leaves only a thin margin for sampler buffers and ComfyUI overhead — comfortable on a headless box, snug with a monitor. See /check/hidream-o1-image/rtx-3060 for live data once benchmarked.
  • Quality notes: HiDream-O1 is an 8B model (open-sourced as "HiDream-O1-Image (8B)" per the canonical card) that specializes in long-text rendering (0.979 on LongText-Bench-EN, 0.978 on LongText-Bench-ZH per the official card) and prompt adherence (DPG-Bench 89.83, GenEval 0.90, HPSv3 10.37). The built-in Reasoning-Driven Prompt Agent rewrites raw user input through layout, subject, and physics reasoning before generation — that's the meaning of the "-O1" suffix. None of this is hardware-dependent; outputs are identical across GPUs at the same seed/workflow and precision.

For the full benchmark data, see /check/hidream-o1-image/rtx-3060.

Troubleshooting

Out of memory at 2048×2048 (the 12 GB squeeze)

This is the failure mode most likely to bite on the RTX 3060, because the FP8 footprint (~10–11 GB) and the usable VRAM on a display-attached 12 GB card (~10.5–11.3 GB) are close. Mitigations, in order:

  • Run the display off the iGPU or a second card so the full 12 GB is available to ComfyUI — this is the single biggest win and brings the card back to the ~11.6 GB usable a headless box sees.
  • Drop the resolution to 1536×1536 (the loader snaps to valid resolutions internally), which lowers the activation footprint.
  • Switch to the Dev variant — same FP8 memory footprint, 28 inference steps instead of 50, downloadable as hidream_o1_image_dev_fp8_scaled.safetensors from the same Comfy-Org mirror.

PyTorch 2.9.x crashes on first inference

The canonical HiDream-ai model card flags this in its May 13, 2026 update note: PyTorch 2.9.x is not recommended due to an upstream issue, linked from the card to QwenLM/Qwen3-VL#1811 — an upstream Qwen3-VL-tagged dependency issue (Gemma 4 E4B is the workflow's text encoder; "Qwen3-VL" appears only as a library-internal compatibility tag, not as a model this workflow loads). The Saganaki22 README confirms it independently and notes the node logs a warning when it detects a 2.9.x build. Pin PyTorch to 2.8.x or 2.10+ until upstream patches land. This is architecture-agnostic and applies equally to the RTX 3060.

BF16/FP16 weights won't fit 12 GB

The full-precision BF16 and FP16 checkpoints need 17–20 GB per the drbaph FP8 card VRAM table (~18–20 GB per the Saganaki22 loader table) — the hidream_o1_image_bf16.safetensors weights alone are ~16.4 GB on disk per the Comfy-Org mirror. That path does not fit the 12 GB RTX 3060. Stay on the FP8-scaled checkpoint, which is the documented-fitting path on this card; if you need bit-exact BF16 numerics, you need a ≥20 GB card (e.g. an RTX 3090 — the drbaph BF16 card recommends a GPU with at least 20 GB for the full BF16 path).

flash-attn install fails

Not a problem in the ComfyUI-native path — CheckpointLoaderSimple does not require flash-attn. On the RTX 3060 (Ampere sm_86), prebuilt FlashAttention-2 wheels do ship sm_86 kernels, so FA2 installs cleanly if you want it. If you switch to the upstream CLI (python inference.py), the official model card documents the workaround for environments without flash-attn: "If you do not (or cannot) install flash-attn, you must edit models/pipeline.py line 341 and change "use_flash_attn": True to "use_flash_attn": False" — otherwise inference will fail to import the kernel.

MXFP8 weights look faster but don't load

The Comfy-Org mirror also ships hidream_o1_image_mxfp8.safetensors (~8.92 GB). The official ComfyUI tutorial describes this MXFP8 quantized variant as using fp8/mxfp8 matmuls on safe MLP layers for a speedup on supported hardware. MXFP8 hardware matmul support is currently limited to NVIDIA Blackwell (RTX 5090 / 5080 / 5070, etc., sm_120) — the RTX 3060 is Ampere (sm_86) and lacks both the native FP8 datapath (Ada/Hopper feature) and the MXFP8 datapath (Blackwell feature). Use the plain fp8_scaled checkpoint on the RTX 3060 instead; it's the path that fits 12 GB.

common questions
How much VRAM does HiDream-O1-Image need?

About 10 GB — the minimum this recipe targets.

Which GPUs is HiDream-O1-Image tested on?

RTX 3060 (12 GB).

How hard is this setup?

Intermediate — follow the steps above.