HiDream-O1-Image on RTX 4090: 2048×2048 Text-to-Image with BF16 in ComfyUI

What You'll Build

A local text-to-image and instruction-edit pipeline for HiDream-O1-Image on the RTX 4090, running the canonical BF16 path at the full 2048×2048 native resolution. With 24 GB of VRAM, the 4090 unlocks the BF16/FP16 precision tier that 16 GB consumer cards have to skip — leaving headroom for the Gemma 4 text encoder, activations, and ComfyUI buffers at peak resolution.

Hardware data: RTX 4090 (24 GB VRAM) · BF16 runs at ~17–20 GB peak per the drbaph BF16 model card — "24 GB cards (RTX 3090/4090, A5000, etc.) will have no issues" · See benchmark data

Architecture note: HiDream-O1 is a Pixel-level Unified Transformer (UiT) — pixels, text, and task conditions share a single token space, "without external VAEs or disjoint text encoders" per the canonical card. The ComfyUI runtime targeted by this recipe uses Gemma 4 E4B as the workflow's external text encoder (gemma4_e4b_it_bf16.safetensors, per the official Comfy tutorial). The PyTorch 2.9.x incompatibility flag comes from a different axis — an internal Qwen3-VL-tagged library dependency in the HiDream stack that hits a Conv3d latency regression in PT 2.9.x surfaced via Qwen3-VL inference paths (see QwenLM/Qwen3-VL#1811 comment thread).

Requirements

Component	Minimum	Tested
GPU	20 GB VRAM (BF16/FP16) per the drbaph BF16 README and the Saganaki22 README	RTX 4090 (24 GB, Ada Lovelace, sm_89)
RAM	16 GB system RAM	—
Storage	BF16 diffusion checkpoint 16.4 GB (`hidream_o1_image_bf16.safetensors` on Comfy-Org) + Gemma 4 BF16 text encoder 16.0 GB (`gemma4_e4b_it_bf16.safetensors` on Comfy-Org) ≈ ~33 GB on disk	—
Software	Updated ComfyUI (native HiDream-O1 templates), `transformers` 4.57.1–5.3 per the Saganaki22 README	—

Installation

This recipe targets the official ComfyUI-native path documented at docs.comfy.org/tutorials/image/hidream/hidream-o1. The native path has shipped since the May 2026 ComfyUI release and uses the Comfy-Org repackaged mirror; no custom node is required.

1. Update ComfyUI

Use the ComfyUI Manager's update flow, or pull from the upstream repo:

cd ComfyUI
git pull
python -m pip install -r requirements.txt

The official tutorial notes: "Make sure your ComfyUI is updated" — the HiDream O1 workflow template only appears once the build includes the native loader.

2. Download the BF16 checkpoint and Gemma 4 BF16 text encoder

# Diffusion checkpoint (BF16, full precision — 16.4 GB on disk)
huggingface-cli download Comfy-Org/HiDream-O1-Image \
    checkpoints/hidream_o1_image_bf16.safetensors \
    --local-dir ComfyUI/models/

# Text encoder (Gemma 4 E4B, BF16 — 16.0 GB on disk)
huggingface-cli download Comfy-Org/gemma-4 \
    text_encoders/gemma4_e4b_it_bf16.safetensors \
    --local-dir ComfyUI/models/

The Comfy-Org/HiDream-O1-Image repo links back to the canonical HiDream-ai/HiDream-O1-Image source, and the Comfy-Org/gemma-4 repo repackages Google's gemma-4-E4B-it. The official ComfyUI tutorial lists hidream_o1_image_bf16.safetensors as "Full bf16 precision (largest)" — the highest-quality checkpoint available in the workflow.

3. (Optional) Install the Saganaki22 custom node

The native path is enough for image generation. If you also want LoRA training or the reasoning-prompt-agent UI, install the community node:

cd ComfyUI/custom_nodes
git clone https://github.com/Saganaki22/HiDream_O1-ComfyUI.git
cd HiDream_O1-ComfyUI
python -m pip install -r requirements.txt

The RTX 4090 is Ada Lovelace (sm_89) — the default pip install torch already includes sm_89 kernels, so no special wheel selection is required. FlashAttention-2 has full sm_89 kernel coverage; flash, sdpa, and sage attention backends all work in the node's loader.

Running

Launch ComfyUI as usual:

python main.py

In the ComfyUI UI, open Workflow → Browse Templates → Image and load "HiDream O1 Full: Image generation", per the official ComfyUI tutorial. The bundled template wires:

CheckpointLoaderSimple → hidream_o1_image_bf16.safetensors
Text encoder → gemma4_e4b_it_bf16.safetensors
Default sampling: 50 inference steps at up to 2048×2048

For a CLI run via the official inference.py (no ComfyUI):

python inference.py \
    --model_path /path/to/HiDream-O1-Image \
    --prompt "A bookstore window at dusk with a hand-lettered sign reading 'OPEN LATE'" \
    --output_image results/t2i.png \
    --height 2048 \
    --width 2048

Results

Speed: Not quoted — no cited source benchmarks HiDream-O1-Image on the RTX 4090 (or any Ada Lovelace GPU) at a known throughput on the BF16 path. The model card and redistributor cards document precision/memory only, not inference latency. Empirical numbers will land at /check/hidream-o1-image/rtx-4090 once a community benchmark seeds the backend.
VRAM usage: ~17–20 GB peak with the BF16 model at 2048×2048. The drbaph BF16 card states verbatim: "A GPU with at least 20 GB VRAM is recommended for comfortable use at full 2048 × 2048 resolution. 24 GB cards (RTX 3090/4090, A5000, etc.) will have no issues." The Saganaki22 ComfyUI README independently corroborates "Full BF16: ~18–20 GB" and "Full FP16: ~18–20 GB", with the FP8 variant at "~10–11 GB" for users who want a smaller-card fallback. The BF16 file size on disk is 16.4 GB (hidream_o1_image_bf16.safetensors Files tab on Comfy-Org) — the runtime peak adds the Gemma 4 E4B text encoder pass and activation buffers. See /check/hidream-o1-image/rtx-4090 for live data once benchmarked.
Quality notes: HiDream-O1 specializes in long-text rendering (0.978 on LongText-Bench-EN/0.979 on LongText-Bench-ZH per the official card) and prompt adherence (DPG-Bench 89.83, GenEval 0.90, HPSv3 10.37). The built-in Reasoning-Driven Prompt Agent rewrites raw user input through layout, subject, and physics reasoning before generation — that's the meaning of the "-O1" suffix. The model debuted at #8 in the Artificial Analysis Text to Image Arena (2026-05-05).

For the full benchmark data, see /check/hidream-o1-image/rtx-4090.

Troubleshooting

PyTorch 2.9.x crashes on first inference

The Saganaki22/HiDream_O1-ComfyUI README and the official HF card both flag this: "PyTorch 2.9.x is not recommended due to the issue" — a Qwen3-VL backbone incompatibility (tracked at QwenLM/Qwen3-VL#1811). Pin PyTorch to 2.8.x or 2.10+ until upstream patches land. This is architecture-agnostic and applies to the RTX 4090 the same way as to any other card.

Out of memory at 2048×2048 with BF16

The 4090 has ~4 GB of headroom above the cited BF16 runtime envelope (~18–20 GB peak vs 24 GB hardware), but desktop-running-a-display-server overhead and ComfyUI's intermediate buffers can eat into that. Two cited mitigations:

Drop the text encoder to FP8. Swap gemma4_e4b_it_bf16.safetensors (16.0 GB on disk) for gemma4_e4b_it_fp8_scaled.safetensors (9.06 GB on disk) — the text encoder pass dominates the peak window. The official ComfyUI tutorial lists this as an alternative.
Drop the diffusion checkpoint to FP8. Swap hidream_o1_image_bf16.safetensors (16.4 GB) for hidream_o1_image_fp8_scaled.safetensors (8.07 GB on disk) — the drbaph FP8 card places the FP8 footprint at "~10 GB". Per the same card, "On CUDA-capable GPUs with Hopper or Ada Lovelace architecture (RTX 40xx, H100), FP8 compute is hardware-accelerated" — the 4090 is in the FP8 hardware-accelerated tier, so this is a free speed/memory trade with minimal quality loss.

If neither fix is enough, drop the resolution to 1536×1536 (the loader snaps to valid internal resolutions) or switch to the Dev variant — hidream_o1_image_dev_bf16.safetensors is 16.4 GB and uses 28 inference steps instead of 50.

`flash-attn` install fails

Not a problem in the ComfyUI-native path — CheckpointLoaderSimple does not require flash-attn. If you switch to the upstream CLI (python inference.py), the official model card documents the workaround: "If you do not (or cannot) install flash-attn, you must edit models/pipeline.py line 341 and change \"use_flash_attn\": True to \"use_flash_attn\": False — otherwise inference will fail to import the kernel."

MXFP8 weights look faster but don't load

The Comfy-Org mirror also ships hidream_o1_image_mxfp8.safetensors. The official ComfyUI tutorial describes this as the "MXFP8 quantized variant" with speedup on hardware that supports fp8/mxfp8 matmuls on safe MLP layers. MXFP8 hardware matmul support is currently limited to NVIDIA Blackwell (RTX 5090 / 5080 / 5060 Ti, sm_120) — the RTX 4090 is Ada Lovelace (sm_89) and lacks the native mxfp8 datapath. Use hidream_o1_image_bf16.safetensors (this recipe's default) or hidream_o1_image_fp8_scaled.safetensors on the 4090 instead.