What You'll Build
A local text-to-image and instruction-edit pipeline for HiDream-O1-Image on the RTX 4090, running the canonical BF16 path at the full 2048×2048 native resolution. With 24 GB of VRAM, the 4090 unlocks the BF16/FP16 precision tier that 16 GB consumer cards have to skip — leaving headroom for the Gemma 4 text encoder, activations, and ComfyUI buffers at peak resolution.
Hardware data: RTX 4090 (24 GB VRAM) · BF16 runs at ~17–20 GB peak per the drbaph BF16 model card — "24 GB cards (RTX 3090/4090, A5000, etc.) will have no issues" · See benchmark data
Architecture note: HiDream-O1 is a Pixel-level Unified Transformer (UiT) — pixels, text, and task conditions share a single token space, "without external VAEs or disjoint text encoders" per the canonical card. The ComfyUI runtime targeted by this recipe uses Gemma 4 E4B as the workflow's external text encoder (
gemma4_e4b_it_bf16.safetensors, per the official Comfy tutorial). The PyTorch 2.9.x incompatibility flag comes from a different axis — an internal Qwen3-VL-tagged library dependency in the HiDream stack that hits a Conv3d latency regression in PT 2.9.x surfaced via Qwen3-VL inference paths (see QwenLM/Qwen3-VL#1811 comment thread).
Requirements
| Component | Minimum | Tested |
|---|---|---|
| GPU | 20 GB VRAM (BF16/FP16) per the drbaph BF16 README and the Saganaki22 README | RTX 4090 (24 GB, Ada Lovelace, sm_89) |
| RAM | 16 GB system RAM | — |
| Storage | BF16 diffusion checkpoint 16.4 GB (hidream_o1_image_bf16.safetensors on Comfy-Org) + Gemma 4 BF16 text encoder 16.0 GB (gemma4_e4b_it_bf16.safetensors on Comfy-Org) ≈ ~33 GB on disk | — |
| Software | Updated ComfyUI (native HiDream-O1 templates), transformers 4.57.1–5.3 per the Saganaki22 README | — |
Installation
This recipe targets the official ComfyUI-native path documented at docs.comfy.org/tutorials/image/hidream/hidream-o1. The native path has shipped since the May 2026 ComfyUI release and uses the Comfy-Org repackaged mirror; no custom node is required.
1. Update ComfyUI
Use the ComfyUI Manager's update flow, or pull from the upstream repo:
cd ComfyUI
git pull
python -m pip install -r requirements.txt
The official tutorial notes: "Make sure your ComfyUI is updated" — the HiDream O1 workflow template only appears once the build includes the native loader.
2. Download the BF16 checkpoint and Gemma 4 BF16 text encoder
# Diffusion checkpoint (BF16, full precision — 16.4 GB on disk)
huggingface-cli download Comfy-Org/HiDream-O1-Image \
checkpoints/hidream_o1_image_bf16.safetensors \
--local-dir ComfyUI/models/
# Text encoder (Gemma 4 E4B, BF16 — 16.0 GB on disk)
huggingface-cli download Comfy-Org/gemma-4 \
text_encoders/gemma4_e4b_it_bf16.safetensors \
--local-dir ComfyUI/models/
The Comfy-Org/HiDream-O1-Image repo links back to the canonical HiDream-ai/HiDream-O1-Image source, and the Comfy-Org/gemma-4 repo repackages Google's gemma-4-E4B-it. The official ComfyUI tutorial lists hidream_o1_image_bf16.safetensors as "Full bf16 precision (largest)" — the highest-quality checkpoint available in the workflow.
3. (Optional) Install the Saganaki22 custom node
The native path is enough for image generation. If you also want LoRA training or the reasoning-prompt-agent UI, install the community node:
cd ComfyUI/custom_nodes
git clone https://github.com/Saganaki22/HiDream_O1-ComfyUI.git
cd HiDream_O1-ComfyUI
python -m pip install -r requirements.txt
The RTX 4090 is Ada Lovelace (sm_89) — the default pip install torch already includes sm_89 kernels, so no special wheel selection is required. FlashAttention-2 has full sm_89 kernel coverage; flash, sdpa, and sage attention backends all work in the node's loader.
Running
Launch ComfyUI as usual:
python main.py
In the ComfyUI UI, open Workflow → Browse Templates → Image and load "HiDream O1 Full: Image generation", per the official ComfyUI tutorial. The bundled template wires:
CheckpointLoaderSimple→hidream_o1_image_bf16.safetensors- Text encoder →
gemma4_e4b_it_bf16.safetensors - Default sampling: 50 inference steps at up to 2048×2048
For a CLI run via the official inference.py (no ComfyUI):
python inference.py \
--model_path /path/to/HiDream-O1-Image \
--prompt "A bookstore window at dusk with a hand-lettered sign reading 'OPEN LATE'" \
--output_image results/t2i.png \
--height 2048 \
--width 2048
Results
- Speed: Not quoted — no cited source benchmarks HiDream-O1-Image on the RTX 4090 (or any Ada Lovelace GPU) at a known throughput on the BF16 path. The model card and redistributor cards document precision/memory only, not inference latency. Empirical numbers will land at /check/hidream-o1-image/rtx-4090 once a community benchmark seeds the backend.
- VRAM usage: ~17–20 GB peak with the BF16 model at 2048×2048. The drbaph BF16 card states verbatim: "A GPU with at least 20 GB VRAM is recommended for comfortable use at full 2048 × 2048 resolution. 24 GB cards (RTX 3090/4090, A5000, etc.) will have no issues." The Saganaki22 ComfyUI README independently corroborates "Full BF16: ~18–20 GB" and "Full FP16: ~18–20 GB", with the FP8 variant at "~10–11 GB" for users who want a smaller-card fallback. The BF16 file size on disk is 16.4 GB (
hidream_o1_image_bf16.safetensorsFiles tab on Comfy-Org) — the runtime peak adds the Gemma 4 E4B text encoder pass and activation buffers. See /check/hidream-o1-image/rtx-4090 for live data once benchmarked. - Quality notes: HiDream-O1 specializes in long-text rendering (0.978 on LongText-Bench-EN/0.979 on LongText-Bench-ZH per the official card) and prompt adherence (DPG-Bench 89.83, GenEval 0.90, HPSv3 10.37). The built-in Reasoning-Driven Prompt Agent rewrites raw user input through layout, subject, and physics reasoning before generation — that's the meaning of the "-O1" suffix. The model debuted at #8 in the Artificial Analysis Text to Image Arena (2026-05-05).
For the full benchmark data, see /check/hidream-o1-image/rtx-4090.
Troubleshooting
PyTorch 2.9.x crashes on first inference
The Saganaki22/HiDream_O1-ComfyUI README and the official HF card both flag this: "PyTorch 2.9.x is not recommended due to the issue" — a Qwen3-VL backbone incompatibility (tracked at QwenLM/Qwen3-VL#1811). Pin PyTorch to 2.8.x or 2.10+ until upstream patches land. This is architecture-agnostic and applies to the RTX 4090 the same way as to any other card.
Out of memory at 2048×2048 with BF16
The 4090 has ~4 GB of headroom above the cited BF16 runtime envelope (~18–20 GB peak vs 24 GB hardware), but desktop-running-a-display-server overhead and ComfyUI's intermediate buffers can eat into that. Two cited mitigations:
- Drop the text encoder to FP8. Swap
gemma4_e4b_it_bf16.safetensors(16.0 GB on disk) forgemma4_e4b_it_fp8_scaled.safetensors(9.06 GB on disk) — the text encoder pass dominates the peak window. The official ComfyUI tutorial lists this as an alternative. - Drop the diffusion checkpoint to FP8. Swap
hidream_o1_image_bf16.safetensors(16.4 GB) forhidream_o1_image_fp8_scaled.safetensors(8.07 GB on disk) — the drbaph FP8 card places the FP8 footprint at "~10 GB". Per the same card, "On CUDA-capable GPUs with Hopper or Ada Lovelace architecture (RTX 40xx, H100), FP8 compute is hardware-accelerated" — the 4090 is in the FP8 hardware-accelerated tier, so this is a free speed/memory trade with minimal quality loss.
If neither fix is enough, drop the resolution to 1536×1536 (the loader snaps to valid internal resolutions) or switch to the Dev variant — hidream_o1_image_dev_bf16.safetensors is 16.4 GB and uses 28 inference steps instead of 50.
flash-attn install fails
Not a problem in the ComfyUI-native path — CheckpointLoaderSimple does not require flash-attn. If you switch to the upstream CLI (python inference.py), the official model card documents the workaround: "If you do not (or cannot) install flash-attn, you must edit models/pipeline.py line 341 and change \"use_flash_attn\": True to \"use_flash_attn\": False — otherwise inference will fail to import the kernel."
MXFP8 weights look faster but don't load
The Comfy-Org mirror also ships hidream_o1_image_mxfp8.safetensors. The official ComfyUI tutorial describes this as the "MXFP8 quantized variant" with speedup on hardware that supports fp8/mxfp8 matmuls on safe MLP layers. MXFP8 hardware matmul support is currently limited to NVIDIA Blackwell (RTX 5090 / 5080 / 5060 Ti, sm_120) — the RTX 4090 is Ada Lovelace (sm_89) and lacks the native mxfp8 datapath. Use hidream_o1_image_bf16.safetensors (this recipe's default) or hidream_o1_image_fp8_scaled.safetensors on the 4090 instead.