self-hosted/ai
§01·recipe · image

HiDream-O1-Image on RTX 5070 Ti: 2048×2048 Text-to-Image with FP8 in ComfyUI

imageintermediate10GB+ VRAMJun 3, 2026
models
tools
prerequisites
  • NVIDIA RTX 5070 Ti (16 GB VRAM, Blackwell, sm_120) or any 12 GB+ consumer GPU
  • Python 3.10+ and an updated ComfyUI install (a build that ships the native HiDream-O1 templates)
  • PyTorch with a CUDA 12.8 (cu128) wheel for sm_120 kernels — avoid PyTorch 2.9.x (upstream issue the card flags)

What You'll Build

A local text-to-image and instruction-edit pipeline for HiDream-O1-Image, an 8B unified pixel-space transformer with a reasoning-driven prompt agent, generating up to 2048×2048 natively. The FP8 variant fits in the ~10 GB range per the drbaph FP8 card, leaving comfortable headroom on the 16 GB RTX 5070 Ti for activations, the Gemma 4 text encoder, and ComfyUI overhead.

Hardware data: RTX 5070 Ti (16 GB VRAM, Blackwell sm_120) · FP8 fits in ~10 GB per the drbaph FP8 model card · See benchmark data

Architecture note: HiDream-O1 is a Pixel-level Unified Transformer (UiT) — per the canonical model card, it is built "without external VAEs or disjoint text encoders," encoding raw pixels, text, and task conditions in a single shared token space. (The ComfyUI runtime this recipe installs still loads a separate Gemma 4 E4B text encoder for prompt conditioning — gemma4_e4b_it_fp8_scaled.safetensors, per the official Comfy tutorial.) The PyTorch 2.9.x avoidance comes from a different axis: an upstream Qwen3-VL-tagged dependency the card flags for 2.9.x (tracked at QwenLM/Qwen3-VL#1811).

Requirements

ComponentMinimumTested
GPU12 GB VRAM (FP8) — BF16/FP16 needs 17–20 GB per the drbaph FP8 card and Saganaki22 READMERTX 5070 Ti (16 GB, Blackwell, sm_120)
RAM16 GB system RAM
StorageFP8 diffusion checkpoint ~8.07 GB + Gemma 4 E4B FP8-scaled text encoder ~9.06 GB ≈ ~17 GB on disk (verify current sizes on the Comfy-Org HiDream-O1-Image mirror and the Comfy-Org gemma-4 mirror)
SoftwareUpdated ComfyUI (native HiDream-O1 templates), transformers 4.57.1–5.3 per the Saganaki22 README

Installation

This recipe targets the official ComfyUI-native path documented at docs.comfy.org/tutorials/image/hidream/hidream-o1. The native path uses the Comfy-Org repackaged mirror; no custom node is required for image generation.

1. Update ComfyUI and use the cu128 PyTorch wheel

The RTX 5070 Ti is Blackwell (sm_120). The default pip install torch ships sm_120 kernels via the cu128 wheel — make sure your environment has it, and avoid PyTorch 2.9.x (see Troubleshooting). Use the ComfyUI Manager's update flow, or pull from upstream:

cd ComfyUI
git pull
python -m pip install -r requirements.txt
# RTX 5070 Ti is Blackwell (sm_120); the cu128 wheel ships sm_120 kernels.
python -m pip install --upgrade --index-url https://download.pytorch.org/whl/cu128 \
    "torch>=2.10"

The official tutorial notes: "Make sure your ComfyUI is updated." — the HiDream O1 workflow template only appears once the build includes the native loader.

2. Download the FP8 checkpoint and Gemma 4 text encoder

# Diffusion checkpoint (FP8-scaled — ~8.07 GB on disk)
huggingface-cli download Comfy-Org/HiDream-O1-Image \
    checkpoints/hidream_o1_image_fp8_scaled.safetensors \
    --local-dir ComfyUI/models/

# Text encoder (Gemma 4 E4B, FP8-scaled — ~9.06 GB on disk)
huggingface-cli download Comfy-Org/gemma-4 \
    text_encoders/gemma4_e4b_it_fp8_scaled.safetensors \
    --local-dir ComfyUI/models/

The Comfy-Org/HiDream-O1-Image repo links back to the canonical HiDream-ai/HiDream-O1-Image source, and the Comfy-Org/gemma-4 repo repackages Google's gemma-4-E4B-it. ComfyUI loads the text encoder for the prompt-encode pass and the diffusion checkpoint for sampling — they are not both fully resident at peak, which is why the FP8 runtime peak stays near ~10 GB rather than the ~17 GB on-disk sum.

3. (Optional) Install the Saganaki22 custom node

The native path is enough for image generation. If you also want LoRA training, the reasoning-prompt-agent UI, or the self-contained single-file FP8 loader, install the community node:

cd ComfyUI/custom_nodes
git clone https://github.com/Saganaki22/HiDream_O1-ComfyUI.git
cd HiDream_O1-ComfyUI
python -m pip install -r requirements.txt

This node loads the drbaph/HiDream-O1-Image-FP8 single-file FP8 weights (~8.81 GB) directly. Because the RTX 5070 Ti is Blackwell (sm_120), keep the cu128 PyTorch wheel from step 1.

Running

Launch ComfyUI as usual:

python main.py

In the ComfyUI UI, open Workflow → Browse Templates → Image and load "HiDream O1 Full: Image generation", per the official ComfyUI tutorial. The bundled template wires:

  • CheckpointLoaderSimplehidream_o1_image_fp8_scaled.safetensors
  • Text encoder → gemma4_e4b_it_fp8_scaled.safetensors
  • Default sampling: 50 inference steps at up to 2048×2048

For a CLI run via the official inference.py (no ComfyUI; loads the full-precision canonical weights, not the FP8 mirror):

python inference.py \
    --model_path /path/to/HiDream-O1-Image \
    --prompt "A bookstore window at dusk with a hand-lettered sign reading 'OPEN LATE'" \
    --output_image results/t2i.png \
    --height 2048 \
    --width 2048

First run warms the FP8 weights into memory — expect a noticeable cold-start delay before the first sampling step; subsequent runs reuse the loaded model.

Results

  • Speed: Not quoted — no cited source benchmarks HiDream-O1-Image on the RTX 5070 Ti at a known throughput, and the backend has no benchmark for this pair yet. Quoting a number measured on a different card (including the same-family RTX 5080) would be a fabrication: the 5070 Ti's ~896 GB/s memory bandwidth and 8960 CUDA cores both sit below the 5080's, so neither a memory-bound stage (text-encode, decode) nor a compute-bound diffusion step would match a 5080 figure. Empirical numbers will land at /check/hidream-o1-image/rtx-5070-ti once a community benchmark seeds the backend — please contribute one if you run this.
  • VRAM usage: ~10 GB peak with the FP8 model. The drbaph/HiDream-O1-Image-FP8 card states verbatim: "By quantizing to 8-bit floats, the model fits comfortably within ~10 GB of VRAM — making it accessible on 12 GB GPUs (RTX 3080/4070/4080, etc.) with minimal quality trade-off." The Saganaki22 ComfyUI README independently corroborates "Full FP8 | ~10–11 GB" (and BF16/FP16 at "~18–20 GB"). On the 16 GB RTX 5070 Ti that leaves ~5–6 GB of headroom above the FP8 footprint for sampler buffers and ComfyUI overhead. See /check/hidream-o1-image/rtx-5070-ti for live data once benchmarked.
  • Quality notes: HiDream-O1 specializes in long-text rendering (0.979 on LongText-Bench-EN, 0.978 on LongText-Bench-ZH per the official card) and prompt adherence (DPG-Bench 89.83, GenEval 0.90, HPSv3 10.37). The built-in Reasoning-Driven Prompt Agent rewrites raw user input through layout, subject, and physics reasoning before generation — that's the meaning of the "-O1" suffix.

For the full benchmark data, see /check/hidream-o1-image/rtx-5070-ti.

Troubleshooting

PyTorch 2.9.x crashes on first inference

The canonical HiDream-ai model card flags this in its May 13, 2026 release notes: PyTorch 2.9.x is not recommended due to an upstream issue, linked from the card to QwenLM/Qwen3-VL#1811 — an upstream Qwen3-VL-tagged issue. The Saganaki22 README confirms it independently and notes the node logs a warning when it detects 2.9.x. Pin PyTorch to 2.8.x or 2.10+ (the cu128 torch>=2.10 install in step 1 satisfies this on Blackwell). This is architecture-agnostic and applies equally to the 5070 Ti.

Out of memory at 2048×2048

The 16 GB RTX 5070 Ti has ~5–6 GB of headroom above the cited FP8 footprint, but desktop display-server overhead and ComfyUI's intermediate buffers can eat into that envelope. If the FP8 model spikes past 16 GB on aggressive samplers, drop the resolution to 1536×1536 (the loader snaps to valid resolutions internally) or switch to the Dev variant — same FP8 memory footprint, 28 inference steps instead of 50, downloadable as hidream_o1_image_dev_fp8_scaled.safetensors from the same Comfy-Org mirror.

flash-attn install fails

Not a problem in the ComfyUI-native path — CheckpointLoaderSimple does not require flash-attn, and the Blackwell-supported sdpa (PyTorch scaled_dot_product_attention) is the default. FlashAttention-2's prebuilt wheels still lack sm_120 kernels on Blackwell as of this writing (open at Dao-AILab/flash-attention#2168), so do not expect FA2 to load on the 5070 Ti — sdpa is the recommended backend. If you switch to the upstream CLI (python inference.py), the official model card documents the workaround: "If you do not (or cannot) install flash-attn, you must edit models/pipeline.py line 341 and change "use_flash_attn": True to "use_flash_attn": False" — otherwise inference will fail to import the kernel.

The full BF16 checkpoint won't fit 16 GB

BF16/FP16 needs 17–20 GB per the drbaph FP8 card VRAM table (the hidream_o1_image_bf16.safetensors weights alone are ~16.4 GB on disk per the Comfy-Org mirror) — that path does not fit the 16 GB RTX 5070 Ti. Stay on the FP8-scaled checkpoint, which fits the 16 GB envelope with room for the text encoder and VAE alongside; if you need bit-exact BF16 numerics, you need a ≥20 GB card.

MXFP8 weights look faster but you're not sure they apply

The Comfy-Org mirror also ships hidream_o1_image_mxfp8.safetensors (~8.92 GB). The official ComfyUI tutorial describes the FP8/MXFP8 variant as "FP8, uses fp8/mxfp8 matmuls on safe MLP layers for speedup on supported hardware". The RTX 5070 Ti is Blackwell (sm_120) and has the native mxfp8 datapath, so MXFP8 is a valid speed-oriented alternative on this card — unlike Ada Lovelace (RTX 40xx) and older, which lack the mxfp8 matmul path. Both fp8_scaled and mxfp8 fit the 16 GB envelope; this recipe leads with fp8_scaled for the broadest compatibility, but you can swap to mxfp8 on the 5070 Ti to exercise the Blackwell-only matmul acceleration.