What You'll Build
A local text-to-image and instruction-edit pipeline for HiDream-O1-Image, an 8B unified pixel-space transformer with a reasoning-driven prompt agent, generating up to 2048×2048 natively. The FP8 variant fits in the ~10 GB range per the drbaph FP8 card, leaving comfortable headroom on the 16 GB RTX 5080 for activations, the Gemma 4 text encoder, and ComfyUI overhead.
Hardware data: RTX 5080 (16 GB VRAM, Blackwell sm_120) · FP8 fits in ~10 GB per the drbaph FP8 model card · See benchmark data
Architecture note: HiDream-O1 is a Pixel-level Unified Transformer (UiT) — per the canonical model card, it is built "without external VAEs or disjoint text encoders," encoding raw pixels, text, and task conditions in a single shared token space. (The ComfyUI runtime this recipe installs still loads a separate Gemma 4 E4B text encoder for prompt conditioning —
gemma4_e4b_it_fp8_scaled.safetensors, per the official Comfy tutorial.) The PyTorch 2.9.x avoidance comes from a different axis: an upstream Qwen3-VL-tagged dependency that hits a regression on 2.9.x (tracked at QwenLM/Qwen3-VL#1811).
Requirements
| Component | Minimum | Tested |
|---|---|---|
| GPU | 12 GB VRAM (FP8) — BF16/FP16 needs 17–20 GB per the drbaph FP8 card and Saganaki22 README | RTX 5080 (16 GB, Blackwell, sm_120) |
| RAM | 16 GB system RAM | — |
| Storage | FP8 diffusion checkpoint ~8.07 GB + Gemma 4 E4B FP8-scaled text encoder ~9.06 GB ≈ ~17 GB on disk (verify current sizes on the Comfy-Org HiDream-O1-Image mirror and the Comfy-Org gemma-4 mirror) | — |
| Software | Updated ComfyUI (native HiDream-O1 templates), transformers 4.57.1–5.3 per the Saganaki22 README | — |
Installation
This recipe targets the official ComfyUI-native path documented at docs.comfy.org/tutorials/image/hidream/hidream-o1. The native path uses the Comfy-Org repackaged mirror; no custom node is required for image generation.
1. Update ComfyUI and use the cu128 PyTorch wheel
The RTX 5080 is Blackwell (sm_120). The default pip install torch ships sm_120 kernels via the cu128 wheel — make sure your environment has it, and avoid PyTorch 2.9.x (see Troubleshooting). Use the ComfyUI Manager's update flow, or pull from upstream:
cd ComfyUI
git pull
python -m pip install -r requirements.txt
# RTX 5080 is Blackwell (sm_120); the cu128 wheel ships sm_120 kernels.
python -m pip install --upgrade --index-url https://download.pytorch.org/whl/cu128 \
"torch>=2.10"
The official tutorial notes: "Make sure your ComfyUI is updated." — the HiDream O1 workflow template only appears once the build includes the native loader.
2. Download the FP8 checkpoint and Gemma 4 text encoder
# Diffusion checkpoint (FP8-scaled — ~8.07 GB on disk)
huggingface-cli download Comfy-Org/HiDream-O1-Image \
checkpoints/hidream_o1_image_fp8_scaled.safetensors \
--local-dir ComfyUI/models/
# Text encoder (Gemma 4 E4B, FP8-scaled — ~9.06 GB on disk)
huggingface-cli download Comfy-Org/gemma-4 \
text_encoders/gemma4_e4b_it_fp8_scaled.safetensors \
--local-dir ComfyUI/models/
The Comfy-Org/HiDream-O1-Image repo links back to the canonical HiDream-ai/HiDream-O1-Image source, and the Comfy-Org/gemma-4 repo repackages Google's gemma-4-E4B-it. ComfyUI loads the text encoder for the prompt-encode pass and the diffusion checkpoint for sampling — they are not both fully resident at peak, which is why the FP8 runtime peak stays near ~10 GB rather than the ~17 GB on-disk sum.
3. (Optional) Install the Saganaki22 custom node
The native path is enough for image generation. If you also want LoRA training, the reasoning-prompt-agent UI, or the self-contained single-file FP8 loader, install the community node:
cd ComfyUI/custom_nodes
git clone https://github.com/Saganaki22/HiDream_O1-ComfyUI.git
cd HiDream_O1-ComfyUI
python -m pip install -r requirements.txt
This node loads the drbaph/HiDream-O1-Image-FP8 single-file FP8 weights (~8.81 GB) directly. Because the RTX 5080 is Blackwell (sm_120), keep the cu128 PyTorch wheel from step 1.
Running
Launch ComfyUI as usual:
python main.py
In the ComfyUI UI, open Workflow → Browse Templates → Image and load "HiDream O1 Full: Image generation", per the official ComfyUI tutorial. The bundled template wires:
CheckpointLoaderSimple→hidream_o1_image_fp8_scaled.safetensors- Text encoder →
gemma4_e4b_it_fp8_scaled.safetensors - Default sampling: 50 inference steps at up to 2048×2048
For a CLI run via the official inference.py (no ComfyUI; loads the full-precision canonical weights, not the FP8 mirror):
python inference.py \
--model_path /path/to/HiDream-O1-Image \
--prompt "A bookstore window at dusk with a hand-lettered sign reading 'OPEN LATE'" \
--output_image results/t2i.png \
--height 2048 \
--width 2048
First run warms the FP8 weights into memory — expect a noticeable cold-start delay before the first sampling step; subsequent runs reuse the loaded model.
Results
- Speed: Not quoted — no cited source benchmarks HiDream-O1-Image on the RTX 5080 at a known throughput, and the backend has no benchmark for this pair yet. The RTX 5080's memory bandwidth (~960 GB/s) is more than double that of the 16 GB RTX 5060 Ti tier (~448 GB/s), so memory-bound stages (text-encode, VAE-free decode) and compute-bound diffusion steps should both be materially faster than the smaller Blackwell card — but extrapolating a precise figure across that bandwidth gap would be a fabrication. Empirical numbers will land at /check/hidream-o1-image/rtx-5080 once a community benchmark seeds the backend — please contribute one if you run this.
- VRAM usage: ~10 GB peak with the FP8 model. The drbaph/HiDream-O1-Image-FP8 card states verbatim: "By quantizing to 8-bit floats, the model fits comfortably within ~10 GB of VRAM — making it accessible on 12 GB GPUs (RTX 3080/4070/4080, etc.) with minimal quality trade-off." The Saganaki22 ComfyUI README independently corroborates "Full FP8 | ~10–11 GB" (and BF16/FP16 at "~18–20 GB"). On the 16 GB RTX 5080 that leaves ~5–6 GB of headroom above the FP8 footprint for sampler buffers and ComfyUI overhead. See /check/hidream-o1-image/rtx-5080 for live data once benchmarked.
- Quality notes: HiDream-O1 specializes in long-text rendering (0.978 on LongText-Bench-EN, 0.979 on LongText-Bench-ZH per the official card) and prompt adherence (DPG-Bench 89.83, GenEval 0.90, HPSv3 10.37). The built-in Reasoning-Driven Prompt Agent rewrites raw user input through layout, subject, and physics reasoning before generation — that's the meaning of the "-O1" suffix.
For the full benchmark data, see /check/hidream-o1-image/rtx-5080.
Troubleshooting
PyTorch 2.9.x crashes on first inference
The canonical HiDream-ai model card flags this in its May 13, 2026 release notes: PyTorch 2.9.x is not recommended due to an upstream issue, linked from the card to QwenLM/Qwen3-VL#1811 — a Qwen3-VL-backbone regression. The Saganaki22 README confirms it independently and notes the node logs a warning when it detects 2.9.x. Pin PyTorch to 2.8.x or 2.10+ (the cu128 torch>=2.10 install in step 1 satisfies this on Blackwell). This is architecture-agnostic and applies equally to the 5080.
Out of memory at 2048×2048
The 16 GB RTX 5080 has ~5–6 GB of headroom above the cited FP8 footprint, but desktop display-server overhead and ComfyUI's intermediate buffers can eat into that envelope. If the FP8 model spikes past 16 GB on aggressive samplers, drop the resolution to 1536×1536 (the loader snaps to valid resolutions internally) or switch to the Dev variant — same FP8 memory footprint, 28 inference steps instead of 50, downloadable as hidream_o1_image_dev_fp8_scaled.safetensors from the same Comfy-Org mirror.
flash-attn install fails
Not a problem in the ComfyUI-native path — CheckpointLoaderSimple does not require flash-attn, and the Blackwell-supported sdpa (PyTorch scaled_dot_product_attention) is the default. FlashAttention-2's prebuilt wheels still lack sm_120 kernels on Blackwell as of this writing (open at Dao-AILab/flash-attention#2168), so do not expect FA2 to load on the 5080 — sdpa is the recommended backend. If you switch to the upstream CLI (python inference.py), the official model card documents the workaround: "If you do not (or cannot) install flash-attn, you must edit models/pipeline.py line 341 and change "use_flash_attn": True to "use_flash_attn": False" — otherwise inference will fail to import the kernel.
The full BF16 checkpoint won't fit 16 GB
BF16/FP16 needs 17–20 GB per the drbaph FP8 card VRAM table (the hidream_o1_image_bf16.safetensors weights alone are ~16.4 GB on disk per the Comfy-Org mirror) — that path does not fit the 16 GB RTX 5080. Stay on the FP8-scaled checkpoint, which fits the 16 GB envelope with room for the text encoder and VAE alongside; if you need bit-exact BF16 numerics, you need a ≥20 GB card.
MXFP8 weights look faster but you're not sure they apply
The Comfy-Org mirror also ships hidream_o1_image_mxfp8.safetensors (~8.92 GB). The official ComfyUI tutorial describes the FP8/MXFP8 variant as "FP8, uses fp8/mxfp8 matmuls on safe MLP layers for speedup on supported hardware". The RTX 5080 is Blackwell (sm_120) and has the native mxfp8 datapath, so MXFP8 is a valid speed-oriented alternative on this card — unlike Ada Lovelace (RTX 40xx) and older, which lack the mxfp8 matmul path. Both fp8_scaled and mxfp8 fit the 16 GB envelope; this recipe leads with fp8_scaled for the broadest compatibility, but you can swap to mxfp8 on the 5080 to exercise the Blackwell-only matmul acceleration.