self-hosted/ai
§01·recipe · image

HiDream-O1-Image on RTX 5090: 2048×2048 Text-to-Image with MXFP8 Blackwell-Native Acceleration in ComfyUI

imageintermediate17GB+ VRAMMay 24, 2026
models
tools
prerequisites
  • NVIDIA RTX 5090 (32 GB VRAM, Blackwell, sm_120) or any 20 GB+ consumer GPU with PyTorch 2.10+
  • Python 3.10+ and an updated ComfyUI install (build that ships the native HiDream-O1 templates and the MXFP8 op support)
  • PyTorch ≥ 2.10 with a CUDA 12.8+ wheel (cu128) — required for MXFP8 dispatch on Blackwell, and to avoid the PyTorch 2.9.x Qwen3-VL regression

What You'll Build

A local text-to-image and instruction-edit pipeline for HiDream-O1-Image, an 8B pixel-space transformer with a reasoning-driven prompt agent, generating up to 2048×2048 natively. This recipe leads with the MXFP8 checkpoint because the RTX 5090's Blackwell (sm_120) GPU has hardware acceleration for mxfp8 matmuls that Ada Lovelace and older lack — making MXFP8 a real speed win on this card. The 32 GB envelope keeps the full BF16 Gemma 4 E4B text encoder resident alongside, so prompt-encoding stays at full precision while the diffusion transformer benefits from MXFP8.

Hardware data: RTX 5090 (32 GB VRAM, Blackwell sm_120) · MXFP8 diffusion checkpoint 8.92 GB on disk + Gemma 4 E4B BF16 text encoder 16.0 GB on disk ≈ ~17 GB runtime peak (derived from per-tier file sizes + Saganaki22's FP8 envelope) · See benchmark data

Why MXFP8 here. The ComfyUI feat: Support mxfp8 PR #12907 (merged 2026-03-14 by kijai, COLLABORATOR) states verbatim in the PR description: "As this is already supported in comfy-kitchen. Blackwell only in practice." When asked whether the path works on Ada, kijai replied: "Right, no it doesn't work… comfy-kitchen does handle it so outputs work with error spam, so better to disable for non-blackwell, adding that." MXFP8 matmul acceleration on consumer hardware is therefore an RTX 50-series-only feature today — the 5090 (and the rest of the Blackwell lineup) is the only consumer NVIDIA card class with the datapath. The PR's own 5090 measurement on the LTX-2.3-22B-distilled model shows BF16 at "8/8 [02:09<00:00, 16.14s/it]" dropping to MXFP8 at "8/8 [01:17<00:00, 9.68s/it]" — a ~1.67× speedup on that model. No equivalent measurement exists for HiDream-O1-Image yet, so this recipe does not quote a HiDream-specific MXFP8 speedup; the architectural mechanism is the load-bearing claim.

Requirements

ComponentMinimumTested
GPU20 GB VRAM (BF16/FP16) per the drbaph BF16 README and the Saganaki22 README; MXFP8 path requires Blackwell (sm_120) for hardware acceleration per ComfyUI PR #12907RTX 5090 (32 GB, Blackwell, sm_120)
RAM16 GB system RAM
StorageMXFP8 diffusion checkpoint 8.92 GB (hidream_o1_image_mxfp8.safetensors on Comfy-Org) + Gemma 4 E4B BF16 text encoder 16.0 GB (gemma4_e4b_it_bf16.safetensors on Comfy-Org) ≈ ~25 GB on disk
SoftwareUpdated ComfyUI with native HiDream-O1 templates and MXFP8 support (PR #12907 merged 2026-03-14); PyTorch ≥ 2.10 with cu128 wheel; transformers 4.57.1–5.3 per the Saganaki22 README

Installation

This recipe targets the official ComfyUI-native path documented at docs.comfy.org/tutorials/image/hidream/hidream-o1. No custom node is required; the native path has shipped since the May 2026 ComfyUI release, and MXFP8 support landed in March 2026 via PR #12907.

1. Update ComfyUI and pin PyTorch ≥ 2.10 with cu128

Use the ComfyUI Manager update flow, or pull from upstream:

cd ComfyUI
git pull
python -m pip install -r requirements.txt
# RTX 5090 is Blackwell (sm_120) — the cu128 wheel ships sm_120 kernels;
# pinning ≥ 2.10 is required for MXFP8 dispatch (see PR #12907 comment thread).
python -m pip install --upgrade --index-url https://download.pytorch.org/whl/cu128 \
    "torch>=2.10"

The ≥ 2.10 constraint comes directly from PR #12907's resolution comment by kijai: "Okay, it's the pytorch version, confirmed same error in 2.9.1 while in 2.10.0 it works." The community reporter on the thread was running an "NVIDIA GeForce RTX 5090 : native" card with 32 GB VRAM and 196 GB RAM (xueqing0622, community) — same hardware class this recipe targets.

The PyTorch 2.9.x avoidance is doubly load-bearing: the official HiDream-ai card flags it independently in its May 13 release notes — "PyTorch 2.9.x is not recommended due to the [issue]" — linking QwenLM/Qwen3-VL#1811, which in turn traces back to a Conv3d latency regression on 2.9.x affecting the Qwen3-VL backbone HiDream-O1 inherits. Going straight to 2.10+ on Blackwell satisfies both constraints at once.

2. Download the MXFP8 checkpoint and Gemma 4 BF16 text encoder

# Diffusion checkpoint (MXFP8 — 8.92 GB on disk, Blackwell-native matmul path)
huggingface-cli download Comfy-Org/HiDream-O1-Image \
    checkpoints/hidream_o1_image_mxfp8.safetensors \
    --local-dir ComfyUI/models/

# Text encoder (Gemma 4 E4B, BF16 — 16.0 GB on disk; 32 GB envelope can carry it)
huggingface-cli download Comfy-Org/gemma-4 \
    text_encoders/gemma4_e4b_it_bf16.safetensors \
    --local-dir ComfyUI/models/

The Comfy-Org/HiDream-O1-Image repo links back to the canonical HiDream-ai/HiDream-O1-Image source, and the Comfy-Org/gemma-4 repo repackages Google's gemma-4-E4B-it. The official ComfyUI tutorial describes the FP8 scaled variant as "FP8, uses fp8/mxfp8 matmuls on safe MLP layers for speedup on supported hardware" — the 5090's sm_120 datapath is what unlocks the speedup that the tutorial defers to.

3. (Optional) Install the Saganaki22 custom node

The native path is enough for image generation. If you also want LoRA training or the reasoning-prompt-agent UI, install the community node:

cd ComfyUI/custom_nodes
git clone https://github.com/Saganaki22/HiDream_O1-ComfyUI.git
cd HiDream_O1-ComfyUI
python -m pip install -r requirements.txt

The RTX 5090 is Blackwell (sm_120) — make sure the PyTorch wheel installed in step 1 is the cu128 build. FlashAttention-2's prebuilt wheels still lack sm_120 kernels as of this writing (open at Dao-AILab/flash-attention#2168), so the recommended attention backend on Blackwell is sdpa (PyTorch's scaled_dot_product_attention), which is the default. The ComfyUI native path uses sdpa automatically; the Saganaki22 node also exposes sage as an alternative backend if you have a sm_120-compatible build of SageAttention installed.

Running

Launch ComfyUI as usual:

python main.py

In the ComfyUI UI, open Workflow → Browse Templates → Image and load "HiDream O1 Full: Image generation", per the official ComfyUI tutorial. The bundled template wires:

  • CheckpointLoaderSimple → swap the default hidream_o1_image_bf16.safetensors for hidream_o1_image_mxfp8.safetensors
  • Text encoder → gemma4_e4b_it_bf16.safetensors
  • Default sampling: 50 inference steps at up to 2048×2048

For a CLI run via the official inference.py (no ComfyUI):

python inference.py \
    --model_path /path/to/HiDream-O1-Image \
    --prompt "A bookstore window at dusk with a hand-lettered sign reading 'OPEN LATE'" \
    --output_image results/t2i.png \
    --height 2048 \
    --width 2048

The upstream CLI loads the canonical full-precision weights, not the Comfy-Org MXFP8 mirror — for the MXFP8 speedup you need the ComfyUI graph.

Results

  • Speed: Not quoted for HiDream-O1-Image — no cited source benchmarks this specific model on the RTX 5090 at a known throughput. The MXFP8 speedup demonstrated by PR #12907 on the LTX-2.3-22B-distilled model (a different architecture) was "BF16: 16.14 s/it; mxfp8: 9.68 s/it" on the RTX 5090 — a ~1.67× speedup on that model. The number does not transfer directly to HiDream-O1 (different transformer shape, different MLP/attention ratio); empirical HiDream-O1 numbers will land at /check/hidream-o1-image/rtx-5090 once a community benchmark seeds the backend.
  • VRAM usage: ~17 GB derived envelope with the MXFP8 checkpoint + BF16 Gemma 4 E4B text encoder at 2048×2048. The component math: MXFP8 diffusion checkpoint is 8.92 GB on disk (Comfy-Org Files tab), and the FP8-class diffusion path uses "~10–11 GB" per the Saganaki22 README and "~10 GB of VRAM" per the drbaph FP8 card. The 16.0 GB Gemma 4 E4B BF16 text encoder is loaded then unloaded per the standard ComfyUI flow, but peak memory during the text-encode pass is text-encoder-dominated — call it ~15–17 GB. The 32 GB card leaves at least 15 GB of headroom for sampler buffers, multi-stream batching, or colocating with a second model. See /check/hidream-o1-image/rtx-5090 for live data once benchmarked.
  • Quality notes: HiDream-O1 specializes in long-text rendering (0.978 on LongText-Bench-EN / 0.979 on LongText-Bench-ZH per the official card) and prompt adherence (DPG-Bench 89.83, GenEval 0.90, HPSv3 10.37). The built-in Reasoning-Driven Prompt Agent rewrites raw user input through layout, subject, and physics reasoning before generation — that's the meaning of the "-O1" suffix. The model debuted at #8 in the Artificial Analysis Text to Image Arena (2026-05-05).

For the full benchmark data, see /check/hidream-o1-image/rtx-5090.

Troubleshooting

BufferError: float8 types are not supported by dlpack (PyTorch 2.9.x)

This is the exact failure path the PR #12907 comment thread walks through. Community reporter xueqing0622 hit it on PyTorch 2.9.1 with the RTX 5090; kijai (COLLABORATOR) diagnosed: "Okay, it's the pytorch version, confirmed same error in 2.9.1 while in 2.10.0 it works." Upgrade to PyTorch ≥ 2.10 with the cu128 wheel as shown in Installation step 1.

FlashAttention-2 fails to load on Blackwell

Dao-AILab/flash-attention#2168 tracks the missing sm_120 kernels on Blackwell — the issue reports verbatim: "no kernel image is available for execution on the device" when trying to load FA2 on an RTX 5090 (compute capability 12.0). The issue remains open. The ComfyUI native HiDream-O1 path uses PyTorch's built-in scaled_dot_product_attention (sdpa) by default, which has full Blackwell support — no FA2 needed. If you've imported FA2 in a custom node or workflow that auto-detects backends, force sdpa explicitly, or remove flash-attn from your environment. (For the upstream CLI inference.py, the HiDream-ai model card documents the override: edit models/pipeline.py line 341 and change "use_flash_attn": True to "use_flash_attn": False.)

MXFP8 outputs look glitched or error-spam during MLP forward

The ComfyUI MXFP8 implementation is guarded to Blackwell hardware specifically because, per kijai in PR #12907: "comfy-kitchen does handle it so outputs work with error spam, so better to disable for non-blackwell." If you've forced MXFP8 on a non-Blackwell GPU (Ada / Hopper / Ampere) via a custom workflow or by bypassing the guard, switch to hidream_o1_image_fp8_scaled.safetensors (8.07 GB on disk) instead — the FP8-scaled variant uses standard float8_e4m3fn and runs on Ada Lovelace and Hopper with hardware acceleration per the drbaph FP8 card: "On CUDA-capable GPUs with Hopper or Ada Lovelace architecture (RTX 40xx, H100), FP8 compute is hardware-accelerated." On the RTX 5090 (Blackwell), FP8-scaled also works but you give up the MXFP8 speedup; the Blackwell hardware accelerates both FP8 and MXFP8 paths, so MXFP8 is preferred when the model file is available.

Want to use the BF16 (full-precision) checkpoint instead

The 32 GB envelope on the 5090 fits the full BF16 checkpoint comfortably (the drbaph BF16 card states verbatim: "24 GB cards (RTX 3090/4090, A5000, etc.) will have no issues." — the 5090's 32 GB sits above that tier with 8 GB extra headroom). Swap hidream_o1_image_mxfp8.safetensors for hidream_o1_image_bf16.safetensors (16.4 GB on disk per Comfy-Org Files tab). Runtime peak rises to ~18–20 GB per Saganaki22's published envelope. You lose the MXFP8 Blackwell-only speedup but get bit-exact reference numerics.