self-hosted/ai
§01·recipe · image

HiDream-O1-Image on RTX 4070: 2048×2048 Text-to-Image with FP8 in ComfyUI

imageintermediate10GB+ VRAMJun 6, 2026
models
tools
prerequisites
  • NVIDIA RTX 4070 (12 GB VRAM, Ada Lovelace, sm_89) or any 12 GB+ consumer GPU
  • Python 3.10+ and an updated ComfyUI install (a build that ships the native HiDream-O1 templates)
  • PyTorch built against CUDA 12.x — avoid PyTorch 2.9.x (upstream issue the card flags)

What You'll Build

A local text-to-image and instruction-edit pipeline for HiDream-O1-Image, an 8B unified pixel-space transformer with a reasoning-driven prompt agent, generating up to 2048×2048 natively. The FP8 variant fits in the ~10 GB range per the drbaph FP8 card — and the RTX 4070 is exactly the 12 GB card that card names, so this is a tight-but-workable fit once you account for display overhead.

Hardware data: RTX 4070 (12 GB VRAM, Ada Lovelace sm_89) · FP8 fits in ~10 GB per the drbaph FP8 model card · See benchmark data

12 GB is the design target — but watch display headroom. The drbaph FP8 card states the FP8 model "fits comfortably within ~10 GB of VRAM — making it accessible on 12 GB GPUs (RTX 3080/4070/4080, etc.) with minimal quality trade-off." and was "Tested on 12 GB cards at 2048 × 2048 resolution." — the RTX 4070 is one of the cards that bracket literally names. On a 12 GB desktop card with a monitor attached, the OS display server already consumes ~0.7–1.5 GB, leaving roughly ~10.5–11.3 GB usable. The FP8 peak (~10–11 GB, see Results) lands inside that envelope, but with much less slack than the 16 GB cards have — if you hit OOM on aggressive samplers, see Troubleshooting for the resolution and display-offload escape hatches.

Architecture note: HiDream-O1 is a Pixel-level Unified Transformer (UiT) — per the canonical model card, it natively encodes raw pixels, text, and task-specific conditions in a single shared token space "without external VAEs or disjoint text encoders". (The ComfyUI runtime this recipe installs still loads a separate Gemma 4 E4B text encoder for prompt conditioning — gemma4_e4b_it_fp8_scaled.safetensors, per the official Comfy tutorial.) The PyTorch 2.9.x avoidance comes from a different axis: an upstream Qwen3-VL-tagged dependency the card flags for 2.9.x (tracked at QwenLM/Qwen3-VL#1811).

Requirements

ComponentMinimumTested
GPU12 GB VRAM (FP8) — BF16/FP16 needs 17–20 GB per the drbaph FP8 card and ~18–20 GB per the Saganaki22 READMERTX 4070 (12 GB, Ada Lovelace, sm_89)
RAM16 GB system RAM
StorageFP8 diffusion checkpoint ~8.07 GB + Gemma 4 E4B FP8-scaled text encoder ~9.06 GB ≈ ~17 GB on disk (verify current sizes on the Comfy-Org HiDream-O1-Image mirror and the Comfy-Org gemma-4 mirror)
SoftwareUpdated ComfyUI (native HiDream-O1 templates), transformers 4.57.1–5.3 per the Saganaki22 README

Installation

This recipe targets the official ComfyUI-native path documented at docs.comfy.org/tutorials/image/hidream/hidream-o1. The native path uses the Comfy-Org repackaged mirror; no custom node is required for image generation.

1. Update ComfyUI

The RTX 4070 is Ada Lovelace (sm_89). The default pip install torch (the standard cu124/CUDA-12.x wheel) already includes sm_89 kernels — no special wheel selection is required, unlike the Blackwell sm_120 cards (RTX 50-series) that need the cu128 build. Use the ComfyUI Manager's update flow, or pull from upstream:

cd ComfyUI
git pull
python -m pip install -r requirements.txt

The HiDream O1 workflow template only appears in Browse Templates once the build includes the native loader — make sure ComfyUI is fully updated before looking for it. Avoid PyTorch 2.9.x (see Troubleshooting).

2. Download the FP8 checkpoint and Gemma 4 text encoder

# Diffusion checkpoint (FP8-scaled — ~8.07 GB on disk)
huggingface-cli download Comfy-Org/HiDream-O1-Image \
    checkpoints/hidream_o1_image_fp8_scaled.safetensors \
    --local-dir ComfyUI/models/

# Text encoder (Gemma 4 E4B, FP8-scaled — ~9.06 GB on disk)
huggingface-cli download Comfy-Org/gemma-4 \
    text_encoders/gemma4_e4b_it_fp8_scaled.safetensors \
    --local-dir ComfyUI/models/

The Comfy-Org/HiDream-O1-Image repo links back to the canonical HiDream-ai/HiDream-O1-Image source, and the Comfy-Org/gemma-4 repo repackages Google's gemma-4-E4B-it. ComfyUI loads the text encoder for the prompt-encode pass and the diffusion checkpoint for sampling — they are not both fully resident at peak, which is why the FP8 runtime peak stays near ~10 GB rather than the ~17 GB on-disk sum. That distinction is what makes 12 GB workable: the on-disk total alone would not fit, but the runtime peak does.

3. (Optional) Install the Saganaki22 custom node

The native path is enough for image generation. If you also want LoRA training, the reasoning-prompt-agent UI, or the self-contained single-file FP8 loader, install the community node:

cd ComfyUI/custom_nodes
git clone https://github.com/Saganaki22/HiDream_O1-ComfyUI.git
cd HiDream_O1-ComfyUI
python -m pip install -r requirements.txt

This node loads the drbaph/HiDream-O1-Image-FP8 single-file FP8 weights (~8.81 GB) directly. The RTX 4070 is Ada Lovelace (sm_89) and the default pip install torch already includes sm_89 kernels; prebuilt FlashAttention-2 wheels also cover sm_89, so attention "just works" — no Blackwell-style cu128 wheel or sdpa fallback is needed.

Running

Launch ComfyUI as usual:

python main.py

In the ComfyUI UI, open Workflow → Browse Templates → Image and load the "HiDream O1 Full: Image generation" template, per the official ComfyUI tutorial. The bundled template wires:

  • CheckpointLoaderSimplehidream_o1_image_fp8_scaled.safetensors
  • Text encoder → gemma4_e4b_it_fp8_scaled.safetensors
  • Default sampling: 50 inference steps at up to 2048×2048

For a CLI run via the official inference.py (no ComfyUI; loads the full-precision canonical weights, not the FP8 mirror — so it needs a ≥20 GB card, not the 12 GB RTX 4070):

python inference.py \
    --model_path /path/to/HiDream-O1-Image \
    --prompt "A bookstore window at dusk with a hand-lettered sign reading 'OPEN LATE'" \
    --output_image results/t2i.png \
    --height 2048 \
    --width 2048

First run warms the FP8 weights into memory — expect a noticeable cold-start delay before the first sampling step; subsequent runs reuse the loaded model.

Results

  • Speed: Not quoted — no cited source benchmarks HiDream-O1-Image on the RTX 4070 at a known throughput, and the backend has no benchmark for this pair yet (/check/hidream-o1-image/rtx-4070 returns verdict: unknown). Quoting a figure from a larger Ada card would overstate it: the RTX 4070 has only 5888 CUDA cores and ~504 GB/s memory bandwidth — roughly 30% fewer cores and 25% less bandwidth than the 4070 Ti SUPER — so neither a compute-bound diffusion step nor a memory-bound encode/decode stage would match a bigger-card number. What does transfer is the FP8 acceleration tier: per the drbaph FP8 card, Ada Lovelace gets hardware-accelerated FP8 compute — "On CUDA-capable GPUs with Hopper or Ada Lovelace architecture (RTX 40xx, H100), FP8 compute is hardware-accelerated. On older GPUs, weights are dequantized on-the-fly — still saving VRAM, with a small speed penalty." The "small speed penalty" caveat applies only to pre-Ada cards (Ampere and earlier); the RTX 4070, being Ada Lovelace (RTX 40xx), gets the full FP8 acceleration. If you measure throughput on a 4070, please contribute it via the submission form so it lands at /check/hidream-o1-image/rtx-4070.
  • VRAM usage: ~10–11 GB peak with the FP8 model — which fits the RTX 4070's 12 GB, but tightly. The drbaph/HiDream-O1-Image-FP8 card states verbatim: "By quantizing to 8-bit floats, the model fits comfortably within ~10 GB of VRAM — making it accessible on 12 GB GPUs (RTX 3080/4070/4080, etc.) with minimal quality trade-off." and that it was "Tested on 12 GB cards at 2048 × 2048 resolution." The Saganaki22 ComfyUI README independently corroborates the per-precision footprints in its loader table: Full FP8 at ~10–11 GB versus Full BF16/FP16 at ~18–20 GB. On the 12 GB RTX 4070 with a display attached (~10.5–11.3 GB usable), the FP8 footprint leaves only a thin margin for sampler buffers and ComfyUI overhead — comfortable on a headless box, snug with a monitor. See /check/hidream-o1-image/rtx-4070 for live data once benchmarked.
  • Quality notes: HiDream-O1 specializes in long-text rendering (0.979 on LongText-Bench-EN, 0.978 on LongText-Bench-ZH per the official card) and prompt adherence (DPG-Bench 89.83, GenEval 0.90, HPSv3 10.37). The built-in Reasoning-Driven Prompt Agent rewrites raw user input through layout, subject, and physics reasoning before generation — that's the meaning of the "-O1" suffix.

For the full benchmark data, see /check/hidream-o1-image/rtx-4070.

Troubleshooting

Out of memory at 2048×2048 (the 12 GB squeeze)

This is the failure mode most likely to bite on the RTX 4070, because the FP8 footprint (~10–11 GB) and the usable VRAM on a display-attached 12 GB card (~10.5–11.3 GB) are close. Mitigations, in order:

  • Run the display off the iGPU or a second card so the full 12 GB is available to ComfyUI — this is the single biggest win and brings the card back to the ~11.6 GB usable a headless box sees.
  • Drop the resolution to 1536×1536 (the loader snaps to valid resolutions internally), which lowers the activation footprint.
  • Switch to the Dev variant — same FP8 memory footprint, 28 inference steps instead of 50, downloadable as hidream_o1_image_dev_fp8_scaled.safetensors from the same Comfy-Org mirror.

PyTorch 2.9.x crashes on first inference

The canonical HiDream-ai model card flags this in its May 13, 2026 update note: PyTorch 2.9.x is not recommended due to an upstream issue, linked from the card to QwenLM/Qwen3-VL#1811 — an upstream Qwen3-VL-tagged issue. The Saganaki22 README confirms it independently and notes the node logs a warning when it detects a 2.9.x build. Pin PyTorch to 2.8.x or 2.10+ until upstream patches land. This is architecture-agnostic and applies equally to the RTX 4070.

BF16/FP16 weights won't fit 12 GB

The full-precision BF16 and FP16 checkpoints need 17–20 GB per the drbaph FP8 card VRAM table (~18–20 GB per the Saganaki22 loader table) — the hidream_o1_image_bf16.safetensors weights alone are ~16.4 GB on disk per the Comfy-Org mirror. That path does not fit the 12 GB RTX 4070. Stay on the FP8-scaled checkpoint, which is the documented-fitting path on this card; if you need bit-exact BF16 numerics, you need a ≥20 GB card.

flash-attn install fails

Not a problem in the ComfyUI-native path — CheckpointLoaderSimple does not require flash-attn. On the RTX 4070 (Ada sm_89), prebuilt FlashAttention-2 wheels do ship sm_89 kernels, so FA2 also installs cleanly if you want it. If you switch to the upstream CLI (python inference.py), the official model card documents the workaround for environments without flash-attn: "If you do not (or cannot) install flash-attn, you must edit models/pipeline.py line 341 and change "use_flash_attn": True to "use_flash_attn": False" — otherwise inference will fail to import the kernel.

MXFP8 weights look faster but don't load

The Comfy-Org mirror also ships hidream_o1_image_mxfp8.safetensors (~8.92 GB). The official ComfyUI tutorial describes this MXFP8 quantized variant as using fp8/mxfp8 matmuls on safe MLP layers for a speedup on supported hardware. MXFP8 hardware matmul support is currently limited to NVIDIA Blackwell (RTX 5090 / 5080 / 5070, etc., sm_120) — the RTX 4070 is Ada Lovelace (sm_89) and lacks the native mxfp8 datapath. Use the plain fp8_scaled checkpoint on the RTX 4070 instead; Ada's FP8 tensor cores already accelerate that path.