HiDream-O1-Image on RTX 5070: 2048×2048 Text-to-Image with FP8 in ComfyUI

What You'll Build

A local text-to-image and instruction-edit pipeline for HiDream-O1-Image, an 8B unified pixel-space transformer with a reasoning-driven prompt agent, generating up to 2048×2048 natively. The FP8 variant fits in the ~10 GB range per the drbaph FP8 card — and the RTX 5070 is exactly the 12 GB card that card targets, so this is a tight-but-comfortable fit once you account for display overhead.

Hardware data: RTX 5070 (12 GB VRAM, Blackwell sm_120) · FP8 fits in ~10 GB per the drbaph FP8 model card · See benchmark data

12 GB is the design target — but watch display headroom. The drbaph FP8 card states the FP8 model is "accessible on 12 GB GPUs (RTX 3080/4070/4080, etc.)" and was "Tested on 12 GB cards at 2048 × 2048 resolution". On a 12 GB desktop card with a monitor attached, the OS display server already consumes ~0.7–1.5 GB, leaving roughly ~10.5–11.3 GB usable. The FP8 peak (~10–11 GB, see Results) lands inside that envelope, but with much less slack than the 16 GB siblings have — if you hit OOM on aggressive samplers, see Troubleshooting for the resolution and display-offload escape hatches.

Architecture note: HiDream-O1 is a Pixel-level Unified Transformer (UiT) — per the canonical model card, it is built "without external VAEs or disjoint text encoders," encoding raw pixels, text, and task conditions in a single shared token space. (The ComfyUI runtime this recipe installs still loads a separate Gemma 4 E4B text encoder for prompt conditioning — gemma4_e4b_it_fp8_scaled.safetensors, per the official Comfy tutorial.) The PyTorch 2.9.x avoidance comes from a different axis: an upstream Qwen3-VL-tagged dependency the card flags for 2.9.x (tracked at QwenLM/Qwen3-VL#1811).

Requirements

Component	Minimum	Tested
GPU	12 GB VRAM (FP8) — BF16/FP16 needs 17–20 GB per the drbaph FP8 card and Saganaki22 README	RTX 5070 (12 GB, Blackwell, sm_120)
RAM	16 GB system RAM	—
Storage	FP8 diffusion checkpoint ~8.07 GB + Gemma 4 E4B FP8-scaled text encoder ~9.06 GB ≈ ~17 GB on disk (verify current sizes on the Comfy-Org HiDream-O1-Image mirror and the Comfy-Org gemma-4 mirror)	—
Software	Updated ComfyUI (native HiDream-O1 templates), `transformers` 4.57.1–5.3 per the Saganaki22 README	—

Installation

This recipe targets the official ComfyUI-native path documented at docs.comfy.org/tutorials/image/hidream/hidream-o1. The native path uses the Comfy-Org repackaged mirror; no custom node is required for image generation.

1. Update ComfyUI and use the cu128 PyTorch wheel

The RTX 5070 is Blackwell (sm_120). The default pip install torch ships sm_120 kernels via the cu128 wheel — make sure your environment has it, and avoid PyTorch 2.9.x (see Troubleshooting). Use the ComfyUI Manager's update flow, or pull from upstream:

cd ComfyUI
git pull
python -m pip install -r requirements.txt
# RTX 5070 is Blackwell (sm_120); the cu128 wheel ships sm_120 kernels.
python -m pip install --upgrade --index-url https://download.pytorch.org/whl/cu128 \
    "torch>=2.10"

The HiDream O1 workflow template only appears in Browse Templates once the build includes the native loader — make sure ComfyUI is fully updated before looking for it.

2. Download the FP8 checkpoint and Gemma 4 text encoder

# Diffusion checkpoint (FP8-scaled — ~8.07 GB on disk)
huggingface-cli download Comfy-Org/HiDream-O1-Image \
    checkpoints/hidream_o1_image_fp8_scaled.safetensors \
    --local-dir ComfyUI/models/

# Text encoder (Gemma 4 E4B, FP8-scaled — ~9.06 GB on disk)
huggingface-cli download Comfy-Org/gemma-4 \
    text_encoders/gemma4_e4b_it_fp8_scaled.safetensors \
    --local-dir ComfyUI/models/

The Comfy-Org/HiDream-O1-Image repo links back to the canonical HiDream-ai/HiDream-O1-Image source, and the Comfy-Org/gemma-4 repo repackages Google's gemma-4-E4B-it. ComfyUI loads the text encoder for the prompt-encode pass and the diffusion checkpoint for sampling — they are not both fully resident at peak, which is why the FP8 runtime peak stays near ~10 GB rather than the ~17 GB on-disk sum. That distinction is what makes 12 GB workable: the on-disk total alone would not fit, but the runtime peak does.

3. (Optional) Install the Saganaki22 custom node

The native path is enough for image generation. If you also want LoRA training, the reasoning-prompt-agent UI, or the self-contained single-file FP8 loader, install the community node:

cd ComfyUI/custom_nodes
git clone https://github.com/Saganaki22/HiDream_O1-ComfyUI.git
cd HiDream_O1-ComfyUI
python -m pip install -r requirements.txt

This node loads the drbaph/HiDream-O1-Image-FP8 single-file FP8 weights (~8.81 GB) directly. Because the RTX 5070 is Blackwell (sm_120), keep the cu128 PyTorch wheel from step 1.

Running

Launch ComfyUI as usual:

python main.py

In the ComfyUI UI, open Workflow → Browse Templates → Image and load the "HiDream O1 Full: Image generation" template, per the official ComfyUI tutorial. The bundled template wires:

CheckpointLoaderSimple → hidream_o1_image_fp8_scaled.safetensors
Text encoder → gemma4_e4b_it_fp8_scaled.safetensors
Default sampling: 50 inference steps at up to 2048×2048

For a CLI run via the official inference.py (no ComfyUI; loads the full-precision canonical weights, not the FP8 mirror — so it needs a ≥20 GB card, not the 12 GB RTX 5070):

python inference.py \
    --model_path /path/to/HiDream-O1-Image \
    --prompt "A bookstore window at dusk with a hand-lettered sign reading 'OPEN LATE'" \
    --output_image results/t2i.png \
    --height 2048 \
    --width 2048

First run warms the FP8 weights into memory — expect a noticeable cold-start delay before the first sampling step; subsequent runs reuse the loaded model.

Results

Speed: Not quoted — no cited source benchmarks HiDream-O1-Image on the RTX 5070 at a known throughput, and the backend has no benchmark for this pair yet. Quoting a number measured on a different card (including the same-family RTX 5070 Ti or 5080) would be a fabrication: the RTX 5070's ~672 GB/s memory bandwidth and 6144 CUDA cores both sit well below those siblings (the 5070 Ti runs ~896 GB/s and 8960 cores), so neither a memory-bound stage (text-encode, decode) nor a compute-bound diffusion step would match a 5070 Ti or 5080 figure. Empirical numbers will land at /check/hidream-o1-image/rtx-5070 once a community benchmark seeds the backend — please contribute one if you run this.
VRAM usage: ~10–11 GB peak with the FP8 model — which fits the RTX 5070's 12 GB, but tightly. The drbaph/HiDream-O1-Image-FP8 card states verbatim: "By quantizing to 8-bit floats, the model fits comfortably within ~10 GB of VRAM — making it accessible on 12 GB GPUs (RTX 3080/4070/4080, etc.) with minimal quality trade-off." and that it was "Tested on 12 GB cards at 2048 × 2048 resolution." The Saganaki22 ComfyUI README independently corroborates "~10–11 GB" for the Full FP8 model (and BF16/FP16 at "~18–20 GB"). On the 12 GB RTX 5070 with a display attached (~10.5–11.3 GB usable), the FP8 footprint leaves only a thin margin for sampler buffers and ComfyUI overhead — comfortable on a headless box, snug with a monitor. See /check/hidream-o1-image/rtx-5070 for live data once benchmarked.
Quality notes: HiDream-O1 specializes in long-text rendering (0.979 on LongText-Bench-EN, 0.978 on LongText-Bench-ZH per the official card) and prompt adherence (DPG-Bench 89.83, GenEval 0.90, HPSv3 10.37). The built-in Reasoning-Driven Prompt Agent rewrites raw user input through layout, subject, and physics reasoning before generation — that's the meaning of the "-O1" suffix.

For the full benchmark data, see /check/hidream-o1-image/rtx-5070.

Troubleshooting

Out of memory at 2048×2048 (the 12 GB squeeze)

This is the failure mode most likely to bite on the RTX 5070, because the FP8 footprint (~10–11 GB) and the usable VRAM on a display-attached 12 GB card (~10.5–11.3 GB) are close. Mitigations, in order:

Run the display off the iGPU or a second card so the full 12 GB is available to ComfyUI — this is the single biggest win and brings the card back to the ~11.6 GB usable a headless box sees.
Drop the resolution to 1536×1536 (the loader snaps to valid resolutions internally), which lowers the activation footprint.
Switch to the Dev variant — same FP8 memory footprint, 28 inference steps instead of 50, downloadable as hidream_o1_image_dev_fp8_scaled.safetensors from the same Comfy-Org mirror.

PyTorch 2.9.x crashes on first inference

The canonical HiDream-ai model card flags this in its May 13, 2026 release notes: PyTorch 2.9.x is not recommended due to an upstream issue, linked from the card to QwenLM/Qwen3-VL#1811 — an upstream Qwen3-VL-tagged issue. The Saganaki22 README confirms it independently and notes the node logs a warning when it detects 2.9.x. Pin PyTorch to 2.8.x or 2.10+ (the cu128 torch>=2.10 install in step 1 satisfies this on Blackwell). This is architecture-agnostic and applies equally to the 5070.

`flash-attn` install fails

Not a problem in the ComfyUI-native path — CheckpointLoaderSimple does not require flash-attn, and the Blackwell-supported sdpa (PyTorch scaled_dot_product_attention) is the default. FlashAttention-2's prebuilt wheels still lack sm_120 kernels on Blackwell as of this writing (open at Dao-AILab/flash-attention#2168), so do not expect FA2 to load on the 5070 — sdpa is the recommended backend. If you switch to the upstream CLI (python inference.py), the official model card documents the workaround: edit models/pipeline.py line 341 and change "use_flash_attn": True to "use_flash_attn": False — otherwise inference will fail to import the kernel.

The full BF16 checkpoint won't fit 12 GB

BF16/FP16 needs 17–20 GB per the drbaph FP8 card VRAM table (the hidream_o1_image_bf16.safetensors weights alone are ~16.4 GB on disk per the Comfy-Org mirror) — that path does not fit the 12 GB RTX 5070, and never did on the 16 GB siblings either. Stay on the FP8-scaled checkpoint; if you need bit-exact BF16 numerics, you need a ≥20 GB card.

MXFP8 weights look faster but you're not sure they apply

The Comfy-Org mirror also ships hidream_o1_image_mxfp8.safetensors (~8.92 GB). The official ComfyUI tutorial describes the FP8/MXFP8 variant as using fp8/mxfp8 matmuls on safe MLP layers for a speedup on supported hardware. The RTX 5070 is Blackwell (sm_120) and has the native mxfp8 datapath, so MXFP8 is a valid speed-oriented alternative on this card — unlike Ada Lovelace (RTX 40xx) and older, which lack the mxfp8 matmul path. Both fp8_scaled and mxfp8 carry the same ~10–11 GB envelope on the 5070; this recipe leads with fp8_scaled for the broadest compatibility, but you can swap to mxfp8 to exercise the Blackwell-only matmul acceleration. (Note Blackwell's native FP8 tensor cores already accelerate the standard fp8_scaled path too — the drbaph card's "older GPUs dequantize on-the-fly" caveat does not apply to the sm_120 RTX 5070.)