HiDream-O1-Image on RTX 4060 Ti 16GB: 2048×2048 Text-to-Image with FP8 in ComfyUI

What You'll Build

A local text-to-image and instruction-edit pipeline for HiDream-O1-Image, an 8B unified pixel-space transformer with a reasoning-driven prompt agent, generating up to 2048×2048 natively. The FP8 variant fits in roughly 10 GB of VRAM, leaving ~6 GB of headroom on the 16 GB 4060 Ti for activations, the Gemma 4 text encoder, and ComfyUI overhead.

Hardware data: RTX 4060 Ti 16GB · FP8 fits in ~10 GB per the drbaph FP8 model card · See benchmark data

Architecture note: HiDream-O1 is a Pixel-level Unified Transformer (UiT) — pixels, text, and task conditions share a single token space, "without external VAEs or disjoint text encoders" per the canonical card. The ComfyUI runtime targeted by this recipe uses Gemma 4 E4B as the workflow's external text encoder (gemma4_e4b_it_fp8_scaled.safetensors, per the official Comfy tutorial). The PyTorch 2.9.x incompatibility flag comes from a different axis — an internal Qwen3-VL-tagged library dependency in the HiDream stack that hits a Conv3d latency regression in PT 2.9.x surfaced via Qwen3-VL inference paths (see QwenLM/Qwen3-VL#1811 comment thread).

Requirements

Component	Minimum	Tested
GPU	12 GB VRAM (FP8) — BF16/FP16 needs ~18–20 GB per the Saganaki22 ComfyUI README	RTX 4060 Ti 16GB (Ada Lovelace, sm_89)
RAM	16 GB system RAM	—
Storage	See file listings on the Comfy-Org HiDream-O1-Image mirror and Comfy-Org gemma-4 text-encoder mirror for current sizes	—
Software	Updated ComfyUI (native HiDream-O1 templates), `transformers` 4.57.1–5.3 per the Saganaki22 README	—

Installation

This recipe targets the official ComfyUI-native path documented at docs.comfy.org/tutorials/image/hidream/hidream-o1. The native path has shipped since the May 2026 ComfyUI release and uses the Comfy-Org repackaged mirror; no custom node is required.

1. Update ComfyUI

Use the ComfyUI Manager's update flow, or pull from the upstream repo:

cd ComfyUI
git pull
python -m pip install -r requirements.txt

The official tutorial notes: "Make sure your ComfyUI is updated" — the HiDream O1 workflow template only appears once the build includes the native loader.

2. Download the FP8 checkpoint and Gemma 4 text encoder

# Diffusion checkpoint
huggingface-cli download Comfy-Org/HiDream-O1-Image \
    checkpoints/hidream_o1_image_fp8_scaled.safetensors \
    --local-dir ComfyUI/models/

# Text encoder (Gemma 4 E4B, FP8-scaled)
huggingface-cli download Comfy-Org/gemma-4 \
    text_encoders/gemma4_e4b_it_fp8_scaled.safetensors \
    --local-dir ComfyUI/models/

The Comfy-Org/HiDream-O1-Image repo links back to the canonical HiDream-ai/HiDream-O1-Image source, and the Comfy-Org/gemma-4 repo repackages Google's gemma-4-E4B-it.

3. (Optional) Install the Saganaki22 custom node

The native path is enough for image generation. If you also want LoRA training or the reasoning-prompt-agent UI, install the community node:

cd ComfyUI/custom_nodes
git clone https://github.com/Saganaki22/HiDream_O1-ComfyUI.git
cd HiDream_O1-ComfyUI
python -m pip install -r requirements.txt

This is the same node the 5060 Ti sibling recipe uses. Unlike Blackwell GPUs (sm_120) which sometimes need a cu128 PyTorch build for sm_120 kernels, the RTX 4060 Ti is Ada Lovelace (sm_89) and the default pip install torch already includes sm_89 kernels — no special wheel selection is required.

Running

Launch ComfyUI as usual:

python main.py

In the ComfyUI UI, open Workflow → Browse Templates → Image and load "HiDream O1 Full: Image generation", per the official ComfyUI tutorial. The bundled template wires:

CheckpointLoaderSimple → hidream_o1_image_fp8_scaled.safetensors
Text encoder → gemma4_e4b_it_fp8_scaled.safetensors
Default sampling: 50 inference steps at up to 2048×2048

For a CLI run via the official inference.py (no ComfyUI):

python inference.py \
    --model_path /path/to/HiDream-O1-Image \
    --prompt "A bookstore window at dusk with a hand-lettered sign reading 'OPEN LATE'" \
    --output_image results/t2i.png \
    --height 2048 \
    --width 2048

Results

Speed: Not quoted — no cited source benchmarks HiDream-O1-Image on the RTX 4060 Ti or any Ada Lovelace GPU at known throughput. Per the drbaph FP8 card, Ada Lovelace (RTX 40xx, including the 4060 Ti) is in the FP8 hardware-accelerated tier alongside Hopper — "On CUDA-capable GPUs with Hopper or Ada Lovelace architecture (RTX 40xx, H100), FP8 compute is hardware-accelerated." The "small speed penalty" caveat applies only to pre-Ada cards (Ampere and earlier), where weights are dequantized on-the-fly. This is FP8 (E4M3/E5M2) — distinct from mxfp8 (block-scaled), which IS Blackwell-only and is scoped in Troubleshooting below. Empirical numbers will land at /check/hidream-o1-image/rtx-4060-ti-16gb once a community benchmark seeds the backend.
VRAM usage: ~10 GB peak with the FP8 model. The drbaph/HiDream-O1-Image-FP8 card states verbatim: "By quantizing to 8-bit floats, the model fits comfortably within ~10 GB of VRAM — making it accessible on 12 GB GPUs (RTX 3080/4070/4080, etc.) with minimal quality trade-off." The RTX 4070 named here is the same Ada Lovelace generation as the 4060 Ti 16GB, so the citation transfers cleanly within the architecture. The Saganaki22 ComfyUI README independently corroborates "Full FP8: ~10–11 GB" and "BF16/FP16: ~18–20 GB". See /check/hidream-o1-image/rtx-4060-ti-16gb for live data once benchmarked.
Quality notes: HiDream-O1 specializes in long-text rendering (0.978 on LongText-Bench-EN/ZH per the official card) and prompt adherence (DPG-Bench 89.83, GenEval 0.90). The built-in Reasoning-Driven Prompt Agent rewrites raw user input through layout, subject, and physics reasoning before generation — that's the meaning of the "-O1" suffix.

For the full benchmark data, see /check/hidream-o1-image/rtx-4060-ti-16gb.

Troubleshooting

PyTorch 2.9.x crashes on first inference

The Saganaki22/HiDream_O1-ComfyUI README and the official HF card both flag this: "PyTorch 2.9.x is not recommended due to the issue" — a Qwen3-VL backbone incompatibility (tracked at QwenLM/Qwen3-VL#1811). Pin PyTorch to 2.8.x or 2.10+ until upstream patches land. This is architecture-agnostic and applies equally to the 4060 Ti.

Out of memory at 2048×2048

The 4060 Ti 16GB has ~6 GB of headroom above the cited FP8 footprint, but desktop-running-a-display-server overhead and ComfyUI's intermediate buffers can eat into that envelope. If the FP8 model spikes past 16 GB on aggressive samplers, drop the resolution to 1536×1536 (the loader snaps to valid resolutions internally) or switch to the Dev variant — same memory footprint, 28 inference steps instead of 50, downloadable as hidream_o1_image_dev_fp8_scaled.safetensors from the same Comfy-Org mirror.

`flash-attn` install fails

Not a problem in the ComfyUI-native path — CheckpointLoaderSimple does not require flash-attn. If you switch to the upstream CLI (python inference.py), the official model card documents the workaround: "If you do not (or cannot) install flash-attn, you must edit models/pipeline.py line 341 and change \"use_flash_attn\": True to \"use_flash_attn\": False — otherwise inference will fail to import the kernel."

MXFP8 weights look faster but don't load

The Comfy-Org mirror also ships hidream_o1_image_mxfp8.safetensors. The official ComfyUI tutorial describes this as "MXFP8 quantized variant" with speedup on hardware that supports fp8/mxfp8 matmuls on safe MLP layers. MXFP8 hardware matmul support is currently limited to NVIDIA Blackwell (RTX 5090 / 5080 / 5060 Ti, etc., sm_120) — the RTX 4060 Ti is Ada Lovelace (sm_89) and lacks the native mxfp8 datapath. Use the plain fp8_scaled checkpoint on the 4060 Ti instead.