HiDream-O1-Image on RTX 4070 Ti SUPER: 2048×2048 Text-to-Image with FP8 in ComfyUI

What You'll Build

A local text-to-image and instruction-edit pipeline for HiDream-O1-Image, an 8B unified pixel-space transformer with a reasoning-driven prompt agent, generating up to 2048×2048 natively. The FP8 variant fits in roughly 10 GB of VRAM, leaving ~6 GB of headroom on the 16 GB RTX 4070 Ti SUPER for activations, the Gemma 4 text encoder, and ComfyUI overhead.

Hardware data: RTX 4070 Ti SUPER (16GB VRAM) · FP8 fits in ~10 GB per the drbaph FP8 model card · See benchmark data

Architecture note: HiDream-O1 is a Pixel-level Unified Transformer (UiT) — pixels, text, and task conditions share a single token space, "without external VAEs or disjoint text encoders" per the canonical card. The ComfyUI runtime targeted by this recipe uses Gemma 4 E4B as the workflow's external text encoder (gemma4_e4b_it_fp8_scaled.safetensors, per the official Comfy tutorial). The PyTorch 2.9.x incompatibility flag comes from a different axis — a Qwen3-VL-tagged dependency in the HiDream stack tracked at QwenLM/Qwen3-VL#1811; see Troubleshooting.

Requirements

Component	Minimum	Tested
GPU	12 GB VRAM (FP8) — BF16/FP16 needs ~18–20 GB per the Saganaki22 ComfyUI README	RTX 4070 Ti SUPER (16GB, Ada Lovelace sm_89)
RAM	16 GB system RAM	—
Storage	~8 GB FP8 checkpoint + ~9 GB FP8 text encoder — see live file sizes on the Comfy-Org HiDream-O1-Image mirror and Comfy-Org gemma-4 text-encoder mirror	—
Software	Updated ComfyUI (native HiDream-O1 templates), `transformers` 4.57.1–5.3 per the Saganaki22 README	—

Installation

This recipe targets the official ComfyUI-native path documented at docs.comfy.org/tutorials/image/hidream/hidream-o1. The native path has shipped since the May 2026 ComfyUI release and uses the Comfy-Org repackaged mirror; no custom node is required.

1. Update ComfyUI

Use the ComfyUI Manager's update flow, or pull from the upstream repo:

cd ComfyUI
git pull
python -m pip install -r requirements.txt

The official tutorial notes: "Make sure your ComfyUI is updated" — the HiDream O1 workflow template only appears once the build includes the native loader.

2. Download the FP8 checkpoint and Gemma 4 text encoder

# Diffusion checkpoint (~8 GB FP8-scaled)
huggingface-cli download Comfy-Org/HiDream-O1-Image \
    checkpoints/hidream_o1_image_fp8_scaled.safetensors \
    --local-dir ComfyUI/models/

# Text encoder (Gemma 4 E4B, FP8-scaled, ~9 GB)
huggingface-cli download Comfy-Org/gemma-4 \
    text_encoders/gemma4_e4b_it_fp8_scaled.safetensors \
    --local-dir ComfyUI/models/

The Comfy-Org/HiDream-O1-Image repo links back to the canonical HiDream-ai/HiDream-O1-Image source, and the Comfy-Org/gemma-4 repo repackages Google's gemma-4-E4B-it.

3. (Optional) Install the Saganaki22 custom node

The native path is enough for image generation. If you also want LoRA training or the reasoning-prompt-agent UI, install the community node:

cd ComfyUI/custom_nodes
git clone https://github.com/Saganaki22/HiDream_O1-ComfyUI.git
cd HiDream_O1-ComfyUI
python -m pip install -r requirements.txt

The RTX 4070 Ti SUPER is Ada Lovelace (sm_89) and the default pip install torch already includes sm_89 kernels — no special wheel selection (e.g. a cu128 build for Blackwell sm_120 cards) is required, and prebuilt FlashAttention wheels cover sm_89, so attention "just works."

Running

Launch ComfyUI as usual:

python main.py

In the ComfyUI UI, open Workflow → Browse Templates → Image and load "HiDream O1 Full: Image generation", per the official ComfyUI tutorial. The bundled template wires:

CheckpointLoaderSimple → hidream_o1_image_fp8_scaled.safetensors
Text encoder → gemma4_e4b_it_fp8_scaled.safetensors
Default sampling: 50 inference steps at up to 2048×2048

For a CLI run via the official inference.py (no ComfyUI):

python inference.py \
    --model_path /path/to/HiDream-O1-Image \
    --prompt "A bookstore window at dusk with a hand-lettered sign reading 'OPEN LATE'" \
    --output_image results/t2i.png \
    --height 2048 \
    --width 2048

Results

Speed: Not quoted — no cited source benchmarks HiDream-O1-Image on the RTX 4070 Ti SUPER at a known throughput, and the backend has no benchmark for this pair yet (/check/hidream-o1-image/rtx-4070-ti-super returns verdict: unknown). The 4070 Ti SUPER has ~13% fewer CUDA cores and ~6% less memory bandwidth than the RTX 4080, so the 4080 sibling recipe's framing would overstate it; no measurement on a directly comparable card exists either. What does transfer is the FP8 acceleration tier: per the drbaph FP8 card, Ada Lovelace (RTX 40xx, which includes the 4070 Ti SUPER) is hardware-accelerated for FP8 compute — "On CUDA-capable GPUs with Hopper or Ada Lovelace architecture (RTX 40xx, H100), FP8 compute is hardware-accelerated. On older GPUs, weights are dequantized on-the-fly — still saving VRAM, with a small speed penalty." The "small speed penalty" caveat applies only to pre-Ada cards (Ampere and earlier); the 4070 Ti SUPER gets the full FP8 acceleration. If you measure throughput on a 4070 Ti SUPER, please contribute it via the submission form so it lands at /check/hidream-o1-image/rtx-4070-ti-super.
VRAM usage: ~10 GB peak with the FP8 model. The drbaph/HiDream-O1-Image-FP8 card states verbatim: "By quantizing to 8-bit floats, the model fits comfortably within ~10 GB of VRAM — making it accessible on 12 GB GPUs (RTX 3080/4070/4080, etc.) with minimal quality trade-off." — that bracket names the RTX 4070 and 4080, which sit just below and just above the 4070 Ti SUPER, so the 16 GB 4070 Ti SUPER comfortably subsumes the ~10 GB footprint. The Saganaki22 ComfyUI README independently corroborates the per-precision footprints in its loader table: Full FP8 at ~10–11 GB versus Full BF16/FP16 at ~18–20 GB. The ~6 GB of headroom on the 16 GB 4070 Ti SUPER covers activations, the Gemma 4 encoder, and ComfyUI buffers. See /check/hidream-o1-image/rtx-4070-ti-super for live data once benchmarked.
Quality notes: HiDream-O1 specializes in long-text rendering (0.979 on LongText-Bench-EN, 0.978 on LongText-Bench-ZH per the official card) and prompt adherence (DPG-Bench 89.83, GenEval 0.90). The built-in Reasoning-Driven Prompt Agent rewrites raw user input through layout, subject, and physics reasoning before generation — that is the meaning of the "-O1" suffix.

For the full benchmark data, see /check/hidream-o1-image/rtx-4070-ti-super.

Troubleshooting

PyTorch 2.9.x crashes on first inference

The canonical HiDream-ai/HiDream-O1-Image card flags this in its May 13, 2026 update note: PyTorch 2.9.x is not recommended because of a Qwen3-VL backbone issue, tracked at QwenLM/Qwen3-VL#1811. The Saganaki22/HiDream_O1-ComfyUI node echoes the same upstream guidance and logs a warning when it detects a 2.9.x build. Pin PyTorch to 2.8.x or 2.10+ until upstream patches land. This is architecture-agnostic and applies equally to the 4070 Ti SUPER.

Out of memory at 2048×2048

The 4070 Ti SUPER 16GB has ~6 GB of headroom above the cited FP8 footprint, but desktop-running-a-display-server overhead and ComfyUI's intermediate buffers can eat into that envelope. If the FP8 model spikes past 16 GB on aggressive samplers, drop the resolution to 1536×1536 (the loader snaps to valid resolutions internally) or switch to the Dev variant — same memory footprint, 28 inference steps instead of 50, downloadable as hidream_o1_image_dev_fp8_scaled.safetensors from the same Comfy-Org mirror.

BF16/FP16 weights won't fit

The full-precision BF16 and FP16 checkpoints need ~18–20 GB per the Saganaki22 loader table — that overflows the 16 GB 4070 Ti SUPER. Stay on the FP8-scaled checkpoint, which is the documented-fitting path on this card. The 24 GB+ BF16 path is covered by the RTX 3090 / 3090 Ti sibling recipes, not this one.

`flash-attn` install fails

Not a problem in the ComfyUI-native path — CheckpointLoaderSimple does not require flash-attn. If you switch to the upstream CLI (python inference.py), the official model card documents the workaround: "If you do not (or cannot) install flash-attn, you must edit models/pipeline.py line 341 and change "use_flash_attn": True to "use_flash_attn": False" — otherwise inference will fail to import the kernel.

MXFP8 weights look faster but don't load

The Comfy-Org mirror also ships hidream_o1_image_mxfp8.safetensors. The official ComfyUI tutorial describes this MXFP8 quantized variant as using "fp8/mxfp8 matmuls on safe MLP layers for speedup on supported hardware". MXFP8 hardware matmul support is currently limited to NVIDIA Blackwell (RTX 5090 / 5080 / 5060 Ti, etc., sm_120) — the RTX 4070 Ti SUPER is Ada Lovelace (sm_89) and lacks the native mxfp8 datapath. Use the plain fp8_scaled checkpoint on the 4070 Ti SUPER instead.