HiDream-O1-Image Full BF16 on RTX 3090: 2048×2048 Text-to-Image in ComfyUI

What You'll Build

A local text-to-image and instruction-edit pipeline for HiDream-O1-Image (Full variant) on the RTX 3090, running the canonical BF16 path at the full 2048×2048 native resolution. The 3090's 24 GB of VRAM lands the card in the same envelope as the 4090 for this workflow — the BF16 checkpoint is the only stable single-precision path that fits without arch-conditional FP8 tricks.

Hardware data: RTX 3090 (24 GB VRAM, Ampere sm_86) · BF16 runs at ~17–20 GB peak per the drbaph BF16 model card — "24 GB cards (RTX 3090/4090, A5000, etc.) will have no issues" (3090 named verbatim) · See benchmark data

Architecture note — Ampere vs Ada FP8 split. HiDream-O1 is a Pixel-level Unified Transformer (UiT) — "natively unified … without external VAEs or disjoint text encoders … natively encodes raw pixels, text, and task-specific conditions in a single shared token space" per the canonical card. The ComfyUI runtime uses Gemma 4 E4B as the workflow's external text encoder (gemma4_e4b_it_bf16.safetensors on Comfy-Org/gemma-4 Files tab) — not Qwen3-VL; "Qwen3-VL" only appears here as an internal library-dependency tag in the upstream PyTorch-version-compatibility note. The 3090 sits on Ampere (sm_86), which lacks the hardware FP8 datapath of Ada Lovelace / Hopper. Per the drbaph FP8 model card: "On CUDA-capable GPUs with Hopper or Ada Lovelace architecture (RTX 40xx, H100), FP8 compute is hardware-accelerated. On older GPUs, weights are dequantized on-the-fly — still saving VRAM, with a small speed penalty." The 3090 is in the "older GPUs / dequantized on-the-fly" bucket. This recipe targets BF16 — the only path with both a 3090-named recommendation source and full hardware-native arithmetic on Ampere.

Requirements

Component	Minimum	Tested
GPU	20 GB VRAM (BF16/FP16) per the drbaph BF16 README and the Saganaki22 README	RTX 3090 (24 GB, Ampere, sm_86)
RAM	16 GB system RAM	—
Storage	BF16 diffusion checkpoint 16.4 GB (`hidream_o1_image_bf16.safetensors` on Comfy-Org) + Gemma 4 BF16 text encoder 16.0 GB (`gemma4_e4b_it_bf16.safetensors` on Comfy-Org) ≈ ~33 GB on disk	—
Software	Updated ComfyUI (native HiDream-O1 templates), `transformers` 4.57.1–5.3 per the Saganaki22 README	—

Installation

This recipe targets the official ComfyUI-native path documented at docs.comfy.org/tutorials/image/hidream/hidream-o1. The native path has shipped since the May 2026 ComfyUI release and uses the Comfy-Org repackaged mirror; no custom node is required for the basic image-generation flow.

1. Update ComfyUI

Use the ComfyUI Manager's update flow, or pull from the upstream repo:

cd ComfyUI
git pull
python -m pip install -r requirements.txt

The official tutorial notes: "Make sure your ComfyUI is updated" — the HiDream O1 workflow template only appears once the build includes the native loader.

2. Download the BF16 checkpoint and Gemma 4 BF16 text encoder

# Diffusion checkpoint (BF16, full precision — 16.4 GB on disk)
huggingface-cli download Comfy-Org/HiDream-O1-Image \
    checkpoints/hidream_o1_image_bf16.safetensors \
    --local-dir ComfyUI/models/

# Text encoder (Gemma 4 E4B, BF16 — 16.0 GB on disk)
huggingface-cli download Comfy-Org/gemma-4 \
    text_encoders/gemma4_e4b_it_bf16.safetensors \
    --local-dir ComfyUI/models/

The Comfy-Org/HiDream-O1-Image repo links back to the canonical HiDream-ai/HiDream-O1-Image source, and the Comfy-Org/gemma-4 repo repackages Google's gemma-4-E4B-it. The official ComfyUI tutorial lists hidream_o1_image_bf16.safetensors as "Full bf16 precision (largest)" — the highest-quality checkpoint available in the workflow.

3. (Optional) Install the Saganaki22 custom node

The native path is enough for image generation. If you also want LoRA training or the reasoning-prompt-agent UI, install the community node:

cd ComfyUI/custom_nodes
git clone https://github.com/Saganaki22/HiDream_O1-ComfyUI.git
cd HiDream_O1-ComfyUI
python -m pip install -r requirements.txt

The RTX 3090 is Ampere (sm_86). The default pip install torch already includes sm_86 kernels, so no special wheel selection is required. FlashAttention-2 has full sm_86 kernel coverage — stock pip install flash-attn works on the 3090, and the loader's flash, sdpa, and sage attention backends all run. (The FA3 path is Hopper-only; FA2 is the correct backend on Ampere and is what the workflow uses.)

Running

Launch ComfyUI as usual:

python main.py

In the ComfyUI UI, open Workflow → Browse Templates → Image and load "HiDream O1 Full: Image generation", per the official ComfyUI tutorial. The bundled template wires:

CheckpointLoaderSimple → hidream_o1_image_bf16.safetensors
Text encoder → gemma4_e4b_it_bf16.safetensors
Default sampling: 50 inference steps at up to 2048×2048

For a CLI run via the official inference.py (no ComfyUI):

python inference.py \
    --model_path /path/to/HiDream-O1-Image \
    --prompt "A bookstore window at dusk with a hand-lettered sign reading 'OPEN LATE'" \
    --output_image results/t2i.png \
    --height 2048 \
    --width 2048

Results

Speed: Not quoted — no cited source benchmarks HiDream-O1-Image on the RTX 3090 (or any Ampere GPU) at a known throughput. The drbaph BF16 card, Saganaki22 README, official HiDream card, and ComfyUI tutorial all document precision/memory only, not inference latency. Empirical numbers will land at /check/hidream-o1-image/rtx-3090 once a community benchmark seeds the backend.
VRAM usage: ~17–20 GB peak with the BF16 model at 2048×2048. The drbaph BF16 card states verbatim: "A GPU with at least 20 GB VRAM is recommended for comfortable use at full 2048 × 2048 resolution. 24 GB cards (RTX 3090/4090, A5000, etc.) will have no issues." — the RTX 3090 is named explicitly in the comfortable-headroom tier. The Saganaki22 ComfyUI README independently corroborates "Full BF16: ~18–20 GB" and "Full FP16: ~18–20 GB", with the FP8 variant at "~10–11 GB" for users who want a smaller-card fallback. The BF16 file size on disk is 16.4 GB (hidream_o1_image_bf16.safetensors Files tab on Comfy-Org) — the runtime peak adds the Gemma 4 E4B text encoder pass and activation buffers. See /check/hidream-o1-image/rtx-3090 for live data once benchmarked.
Quality notes: HiDream-O1 specializes in long-text rendering (0.979 on LongText-Bench-EN, 0.978 on LongText-Bench-ZH per the official card) and prompt adherence (DPG-Bench 89.83, GenEval 0.90, HPSv3 10.37). The built-in Reasoning-Driven Prompt Agent rewrites raw user input through layout, subject, and physics reasoning before generation — that's the meaning of the "-O1" suffix. The model debuted at #8 in the Artificial Analysis Text to Image Arena. None of this is hardware-dependent; outputs are identical across BF16 GPUs at the same seed/workflow.

For the full benchmark data, see /check/hidream-o1-image/rtx-3090.

Troubleshooting

PyTorch 2.9.x crashes on first inference

The Saganaki22/HiDream_O1-ComfyUI README and the official HF card both flag this verbatim: "PyTorch 2.9.x is not recommended due to the issue" — upstream-tagged as a Qwen3-VL-library-dependency incompatibility. Pin PyTorch to 2.8.x or 2.10+ until upstream patches land. This is architecture-agnostic and applies to the RTX 3090 the same way as to any other card.

ℹ️ Disambiguation. Gemma 4 E4B is the workflow's text encoder, configured in the ComfyUI template — that's what loads gemma4_e4b_it_bf16.safetensors. "Qwen3-VL" appears in the upstream PT 2.9.x note as the name of a library-internal compatibility tag, NOT as a model loaded by this workflow.

Out of memory at 2048×2048 with BF16

The 3090 has ~4 GB of headroom above the cited BF16 runtime envelope (~18–20 GB peak vs 24 GB hardware), same envelope as the 4090. Desktop-running-a-display-server overhead and ComfyUI's intermediate buffers can eat into that. Two cited mitigations:

Drop the text encoder to FP8. Swap gemma4_e4b_it_bf16.safetensors (16.0 GB on disk) for gemma4_e4b_it_fp8_scaled.safetensors (9.06 GB on disk) — the text encoder pass dominates the peak window. The official ComfyUI tutorial lists this as an alternative. Ampere note: the FP8 weights are de-quantized on-the-fly to BF16 at compute time on the 3090 — there's no hardware-FP8 matmul on sm_86 — but the VRAM saving is real (the weights are stored in FP8 between passes).
Drop the diffusion checkpoint to FP8. Swap hidream_o1_image_bf16.safetensors (16.4 GB) for hidream_o1_image_fp8_scaled.safetensors (8.07 GB on disk) — the drbaph FP8 card places the FP8 runtime peak at "~10 GB". Read the architecture caveat first: per drbaph, "On CUDA-capable GPUs with Hopper or Ada Lovelace architecture (RTX 40xx, H100), FP8 compute is hardware-accelerated. On older GPUs, weights are dequantized on-the-fly — still saving VRAM, with a small speed penalty." The 3090 (Ampere, sm_86) is in the "older GPUs / dequantized on-the-fly" bucket — you get the VRAM saving with the cited small speed penalty. Use the BF16 path (this recipe's default) if you have the headroom; FP8 is a memory escape hatch, not a speed win, on the 3090.

If neither fix is enough, drop the resolution to 1536×1536 (the loader snaps to valid internal resolutions) or switch to the Dev variant — hidream_o1_image_dev_bf16.safetensors is 16.4 GB and uses 28 inference steps instead of 50.

`flash-attn` install fails

Not a problem in the ComfyUI-native path — CheckpointLoaderSimple does not require flash-attn. If you switch to the upstream CLI (python inference.py), the official model card documents the workaround: "If you do not (or cannot) install flash-attn, you must edit models/pipeline.py line 341 and change \"use_flash_attn\": True to \"use_flash_attn\": False — otherwise inference will fail to import the kernel." On a 3090 the install usually succeeds — FA2 has full sm_86 kernel coverage from the stock wheels.

MXFP8 weights look faster but don't load

The Comfy-Org mirror also ships hidream_o1_image_mxfp8.safetensors. The official ComfyUI tutorial describes this as the "MXFP8 quantized variant" with speedup on hardware that supports fp8/mxfp8 matmuls on safe MLP layers. MXFP8 hardware matmul support is currently limited to NVIDIA Blackwell (RTX 5090 / 5080 / 5060 Ti, sm_120). The RTX 3090 is Ampere (sm_86) — it lacks BOTH the native FP8 datapath (Ada/Hopper feature) AND the MXFP8 datapath (Blackwell feature). Use hidream_o1_image_bf16.safetensors (this recipe's default) on the 3090; do not waste time on the MXFP8 variant.