What You'll Build
A local text-to-image and instruction-edit pipeline for HiDream-O1-Image, an 8B natively-unified image foundation model, running in ComfyUI on a 24 GB Radeon RX 7900 XTX (RDNA3, Navi 31, gfx1100) through the ROCm stack — generating up to 2,048 × 2,048 images from text prompts. HiDream-O1 is "a natively unified image generative foundation model built on a Pixel-level Unified Transformer (UiT) without external VAEs or disjoint text encoders" per the canonical model card. On a 24 GB card you run the full BF16 diffusion weights — there is no need to quantize down, and (unlike the NVIDIA path) no FP8 escape hatch to reach for.
Hardware data: RX 7900 XTX (24GB VRAM) · BF16 · ComfyUI on ROCm 7.2 · See benchmark data
⚠️ This is a ROCm recipe, not CUDA. The RX 7900 XTX runs on AMD's ROCm/HIP stack — there is no
cu124/cu128wheel, no xformers install, and no FP8/FP4 path here. RDNA3's WMMA units accept FP16, BF16, INT8, and INT4 only (AMD GPUOpen, "How to accelerate AI applications on RDNA 3 using WMMA"), so an FP8 checkpoint would just upcast to BF16 with no memory saving and no compute acceleration — and at 24 GB you don't need it. The attention path is PyTorch SDPA (ComfyUI's default; the explicit flag is--use-pytorch-cross-attention), not FlashAttention-2 and not xformers.
ℹ️ No discrete-RDNA3 measurement yet. HiDream-O1-Image is ComfyUI-native and runs through the same ComfyUI-on-ROCm stack that AMD officially documents and tests on the RX 7900 XTX (their ComfyUI-on-Radeon tutorial states "This tutorial was tested on an AMD Radeon RX 7900 XTX GPU"). But that tutorial covers a different model — no public source confirms HiDream-O1 specifically on a discrete gfx1100 card yet. Treat the VRAM figures below as a derived envelope to verify on your first run, and please contribute your measurement.
Requirements
| Component | Minimum | Tested |
|---|---|---|
| GPU | 20 GB VRAM (BF16 path) — ROCm-supported AMD card | RX 7900 XTX (24 GB, RDNA3 / gfx1100) |
| RAM | 32 GB system (for ComfyUI model offload between encode and sample) | — |
| Storage | ~25 GB (BF16 diffusion + FP8 encoder) or ~32 GB (BF16 diffusion + BF16 encoder) | per the Comfy-Org HiDream-O1-Image mirror + Comfy-Org gemma-4 mirror |
| Driver | AMD ROCm 7.2.x on Linux | — |
| Software | ComfyUI (native HiDream-O1 templates) + PyTorch (ROCm 7.2 build), Python 3.10+ | — |
HiDream-O1-Image is 8B parameters and released under the MIT License (confirmed verbatim in the canonical card: license: mit); the weights are not gated on Hugging Face — no access request is required. Architecturally it is a Qwen3-VL-class unified transformer ("architectures": ["Qwen3VLForConditionalGeneration"]), which is why the PyTorch-2.9.x caveat below matters. This recipe uses the ComfyUI-native path, which pairs the diffusion checkpoint with a separate Gemma 4 E4B text encoder per the official Comfy tutorial.
Installation
This recipe targets the official ComfyUI-native path documented at docs.comfy.org/tutorials/image/hidream/hidream-o1, rewritten for the ROCm stack. No custom node is required.
1. Install (or update) ComfyUI
Per the ComfyUI README:
git clone https://github.com/comfyanonymous/ComfyUI.git
cd ComfyUI
If you already have ComfyUI, update it instead — the HiDream O1 workflow template only appears once the build ships the native loader:
cd ComfyUI
git pull
2. Install PyTorch for ROCm
The RX 7900 XTX (gfx1100) is an officially ROCm-supported GPU, so it uses the stable ROCm PyTorch wheel. Per the ComfyUI README "AMD GPUs (Linux)" section:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm7.2
ℹ️ Verify the ROCm tag before you copy it. The
rocmX.Ytag moves over time (6.3 → 6.4 → 7.x). Read the current line in the live ComfyUI README (or the selector at pytorch.org/get-started/locally) before running. Avoid PyTorch 2.9.x for this model: the canonical card's May 13, 2026 update notes "PyTorch 2.9.x is not recommended due to the issue." — a Qwen3-VL backbone bug. Pin a 2.8.x or 2.10+ ROCm wheel.
3. Install ComfyUI dependencies
pip install -r requirements.txt
4. Download the BF16 checkpoint and Gemma 4 text encoder
On a 24 GB card, download the BF16 diffusion checkpoint (the official Comfy tutorial lists hidream_o1_image_bf16.safetensors and notes the bf16 option is the "largest"). File sizes are verified from the Comfy-Org HiDream-O1-Image tree (16,365,209,824 bytes ≈ 16.37 GB) and the Comfy-Org gemma-4 tree (BF16 encoder 16,024,746,334 bytes ≈ 16.02 GB; FP8-scaled encoder 9,057,782,194 bytes ≈ 9.06 GB):
# Diffusion checkpoint — BF16, ~16.37 GB → models/checkpoints/
huggingface-cli download Comfy-Org/HiDream-O1-Image \
checkpoints/hidream_o1_image_bf16.safetensors \
--local-dir ComfyUI/models/
# Text encoder (Gemma 4 E4B) — FP8-scaled, ~9.06 GB → models/text_encoders/
# (FP8 here only shrinks the file; on RDNA3 it upcasts to BF16 at load — no accel,
# but the smaller encoder leaves more BF16-diffusion headroom on the card.)
huggingface-cli download Comfy-Org/gemma-4 \
text_encoders/gemma4_e4b_it_fp8_scaled.safetensors \
--local-dir ComfyUI/models/
The Comfy-Org/HiDream-O1-Image repo repackages the canonical HiDream-ai/HiDream-O1-Image weights, and Comfy-Org/gemma-4 repackages Google's gemma-4-E4B-it. If you prefer an all-BF16 stack and want to spend the extra disk, swap in text_encoders/gemma4_e4b_it_bf16.safetensors (~16.02 GB) from the same gemma-4 mirror — but see the VRAM note in Results before doing so.
Running
Launch ComfyUI from the repo root:
python main.py --use-pytorch-cross-attention
ComfyUI's default attention backend on this stack is PyTorch's scaled-dot-product attention (SDPA). Passing --use-pytorch-cross-attention forces it explicitly — per ComfyUI's cli_args.py the flag is documented as "Use the new pytorch 2.0 cross attention function." On RDNA3 this is the correct attention path; do not install xformers or a FlashAttention wheel.
This starts the server (default http://127.0.0.1:8188). In the UI, open Workflow → Browse Templates → Image and load "HiDream O1 Full: Image generation" per the official ComfyUI tutorial. The bundled template wires a checkpoint loader (hidream_o1_image_bf16.safetensors) and a text encoder (gemma4_e4b_it_fp8_scaled.safetensors), with default sampling of 50 inference steps at up to 2048×2048. Enter a prompt and queue; generated PNGs land in ComfyUI/output/ with the full workflow embedded. For a lighter run, the "HiDream O1 Dev" template uses 28 inference steps instead of 50.
Results
- Speed: Not quoted. The backend has no benchmark for this pair yet — /check/hidream-o1-image/rx-7900-xtx returns
verdict: unknown— and no public source benchmarks HiDream-O1-Image on a discrete RX 7900 XTX (gfx1100) at a known throughput. Quoting a number measured on different hardware would mislead. If you measure it/s or seconds-per-image on a 7900 XTX, please contribute it so it lands at /check/hidream-o1-image/rx-7900-xtx. - VRAM usage: Plan on ~17–20 GB in the BF16 path — a derived envelope, not a measured peak. The arithmetic: the BF16 diffusion checkpoint is ~16.37 GB resident (Comfy-Org tree), and the Gemma 4 E4B encoder (~9.06 GB FP8-on-disk, upcast to BF16 at load) runs first, then ComfyUI offloads it from VRAM before the diffusion sampling pass, so the two heavy components are not both resident at the sampling peak. With activations and ComfyUI buffers on top of the ~16.37 GB diffusion weights, the run sits in the high-teens — comfortably within the 24 GB 7900 XTX. ⚠️ Do not load the BF16 encoder (~16.02 GB) and BF16 diffusion (~16.37 GB) both fully resident — that totals ~32 GB and overflows 24 GB; if you choose the all-BF16 encoder, ensure ComfyUI's smart-memory offload is active (the default) so the encoder is freed before sampling. See /check/hidream-o1-image/rx-7900-xtx for community-submitted measurements once they land.
- Quality notes: HiDream-O1 specializes in dense prompt alignment (DPG-Bench Overall 89.83) and compositional generation (GenEval Overall 0.90), per the official card's evaluation tables. Its native resolution ceiling is 2,048 × 2,048; the built-in Reasoning-Driven Prompt Agent (the "-O1" in the name) rewrites raw prompts through layout and text-rendering reasoning before generation. There is no quantization tradeoff to weigh on this card — run the native BF16 weights.
For the full benchmark data and other-GPU comparisons, see /check/hidream-o1-image/rx-7900-xtx.
Troubleshooting
"Torch not compiled with CUDA enabled"
A CUDA build of PyTorch got installed instead of the ROCm build. Per the ComfyUI README, reinstall against the ROCm wheel index:
pip uninstall torch
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm7.2
python -c "import torch; print(torch.__version__)" should print a +rocm7.2-style suffix, and torch.cuda.is_available() returns True (ROCm masquerades as the cuda device namespace under HIP).
PyTorch 2.9.x crashes on first inference
The canonical HiDream-ai/HiDream-O1-Image card flags this in its May 13, 2026 update: "PyTorch 2.9.x is not recommended due to the issue." — a bug in the Qwen3-VL backbone this model is built on. Pin a 2.8.x or 2.10+ ROCm wheel until upstream patches land. This is architecture-agnostic and applies on ROCm exactly as on CUDA.
VRAM creeps toward the limit at 2048×2048
If the run climbs toward 24 GB on aggressive samplers, force ComfyUI to offload the text encoder (and other idle models) to system RAM rather than keeping them in VRAM. Per ComfyUI's cli_args.py, --disable-smart-memory is documented as "Force ComfyUI to agressively offload to regular ram instead of keeping models in vram when it can.":
python main.py --use-pytorch-cross-attention --disable-smart-memory
You can also drop the resolution to 1536×1536, use the FP8-scaled Gemma encoder (the default above) instead of the BF16 one, or switch to the Dev template (28 steps instead of 50). For repeated runs that destabilize on the ROCm memory manager, --disable-pinned-memory ("Disable pinned memory use." per cli_args.py) is the standard ROCm steadying flag.
Do not install xformers or FlashAttention
HF and ComfyUI guides written for NVIDIA frequently suggest pip install xformers or a FlashAttention wheel. On RDNA3 these are the wrong path: the ROCm xformers fork is limited, and ComfyUI already routes attention through PyTorch SDPA on this stack. Stick with the default, or force it explicitly with --use-pytorch-cross-attention. The MXFP8 checkpoint variant (hidream_o1_image_mxfp8.safetensors) the Comfy-Org mirror also ships targets NVIDIA Blackwell mxfp8 matmuls — it has no RDNA3 datapath; use the plain BF16 checkpoint on this card.