How much VRAM does HiDream-O1-Image need?

About 14 GB — the minimum this recipe targets.

How hard is this setup?

Intermediate — follow the steps above.

HiDream-O1-Image on RX 7800 XT: 2048×2048 Text-to-Image in ComfyUI on ROCm (16 GB, Encoder-Offload)

What You'll Build

A local text-to-image and instruction-edit pipeline for HiDream-O1-Image, an 8B natively-unified image foundation model, running in ComfyUI on a 16 GB Radeon RX 7800 XT (RDNA3, Navi 32, gfx1101) through the ROCm stack — generating up to 2,048 × 2,048 images from text prompts. HiDream-O1 is "a natively unified image generative foundation model built on a Pixel-level Unified Transformer (UiT) without external VAEs or disjoint text encoders" per the canonical model card. On a 16 GB card the BF16 diffusion checkpoint (~16.37 GB) does not fit alongside a resident text encoder, so this recipe leads with the Gemma 4 text encoder offloaded to system RAM and lets ComfyUI stream the diffusion weights — there is no FP8 shortcut on AMD (see the note below).

Hardware data: RX 7800 XT (16GB VRAM) · BF16 DiT + CPU-offloaded encoder · ComfyUI on ROCm 7.2 · See benchmark data

⚠️ This is a ROCm recipe, not CUDA — and FP8 does NOT save VRAM here. The RX 7800 XT runs on AMD's ROCm/HIP stack: there is no cu124/cu128 wheel, no xformers install. Crucially, RDNA3's WMMA units accept FP16, BF16, INT8, and INT4 only (AMD GPUOpen, "How to accelerate AI applications on RDNA 3 using WMMA"), so an FP8-scaled checkpoint upcasts to BF16 at load — it shrinks the download, but the resident weights are the same ~16.37 GB and there is no compute acceleration. That is why the NVIDIA 16 GB recipes for this model lean on a real FP8 datapath to fit, and this one cannot: the only way the 8B BF16 DiT runs on 16 GB AMD is by offloading the text encoder and letting ComfyUI stream diffusion blocks to system RAM. The attention path is PyTorch SDPA (ComfyUI's default; the explicit flag is --use-pytorch-cross-attention), not FlashAttention-2 and not xformers.

ℹ️ No discrete-RDNA3 measurement yet, and no GGUF for HiDream-O1. HiDream-O1-Image is ComfyUI-native and runs through the same ComfyUI-on-ROCm stack that AMD officially documents — but their ComfyUI-on-Radeon tutorial states "This tutorial was tested on an AMD Radeon RX 7900 XTX GPU" (a different, 24 GB card) and covers a different model — no public source confirms HiDream-O1 specifically on a discrete gfx1101 / RX 7800 XT card. There is also no published GGUF quant for HiDream-O1-Image (the only city96 HiDream GGUFs are for the older, unrelated HiDream-I1 model — do not substitute them), so the GGUF squeeze that other 16 GB recipes use is unavailable here. Treat the VRAM figures below as a derived envelope to verify on your first run, and please contribute your measurement.

Requirements

Component	Minimum	Tested
GPU	16 GB VRAM (with encoder offload + ComfyUI weight streaming) — ROCm-supported AMD card	RX 7800 XT (16 GB, RDNA3 / gfx1101)
RAM	32 GB system (the Gemma encoder offloads here; ComfyUI streams diffusion blocks here too)	—
Storage	~25 GB (BF16 diffusion ~16.37 GB + FP8-scaled encoder ~9.06 GB)	per the Comfy-Org HiDream-O1-Image mirror + Comfy-Org gemma-4 mirror
Driver	AMD ROCm 7.2.x on Linux	—
Software	ComfyUI (native HiDream-O1 templates) + PyTorch (ROCm 7.2 build), Python 3.10+	—

HiDream-O1-Image is 8B parameters and released under the MIT License (confirmed verbatim in the canonical card: license: mit); the weights are not gated on Hugging Face — no access request is required for the diffusion checkpoint. Architecturally it is a Qwen3-VL-class unified transformer ("architectures": ["Qwen3VLForConditionalGeneration"]), which is why the PyTorch-2.9.x caveat below matters. This recipe uses the ComfyUI-native path, which pairs the diffusion checkpoint with a separate Gemma 4 E4B text encoder per the official Comfy tutorial.

Installation

This recipe targets the official ComfyUI-native path documented at docs.comfy.org/tutorials/image/hidream/hidream-o1, rewritten for the ROCm stack with the 16 GB offload configuration. No custom node is required.

1. Install (or update) ComfyUI

Per the ComfyUI README:

git clone https://github.com/comfyanonymous/ComfyUI.git
cd ComfyUI

If you already have ComfyUI, update it instead — the HiDream O1 workflow template only appears once the build ships the native loader (it also gives you the dynamic-VRAM weight streaming this 16 GB recipe relies on):

cd ComfyUI
git pull

2. Install PyTorch for ROCm

The RX 7800 XT (gfx1101) is an officially ROCm-supported GPU, so it uses the stable ROCm PyTorch wheel. Per the ComfyUI README "AMD GPUs (Linux)" section:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm7.2

ℹ️ Verify the ROCm tag before you copy it. The rocmX.Y tag moves over time (6.3 → 6.4 → 7.x). Read the current line in the live ComfyUI README (or the selector at pytorch.org/get-started/locally) before running. Avoid PyTorch 2.9.x for this model: the canonical card's May 13, 2026 update notes "PyTorch 2.9.x is not recommended due to the issue." — a Qwen3-VL backbone bug. Pin a 2.8.x or 2.10+ ROCm wheel.

3. Install ComfyUI dependencies

pip install -r requirements.txt

4. Download the BF16 checkpoint and FP8-scaled Gemma 4 text encoder

Download the BF16 diffusion checkpoint (the official Comfy tutorial lists hidream_o1_image_bf16.safetensors and notes the bf16 option is the "largest") and the FP8-scaled Gemma 4 E4B encoder. File sizes are verified from the Comfy-Org HiDream-O1-Image tree (16,365,209,824 bytes ≈ 16.37 GB) and the Comfy-Org gemma-4 tree (FP8-scaled encoder 9,057,782,194 bytes ≈ 9.06 GB; BF16 encoder 16,024,746,334 bytes ≈ 16.02 GB):

# Diffusion checkpoint — BF16, ~16.37 GB → models/checkpoints/
huggingface-cli download Comfy-Org/HiDream-O1-Image \
    checkpoints/hidream_o1_image_bf16.safetensors \
    --local-dir ComfyUI/models/

# Text encoder (Gemma 4 E4B) — FP8-scaled, ~9.06 GB → models/text_encoders/
# The FP8 here only shrinks the DOWNLOAD; on RDNA3 it upcasts to BF16 at load
# (no accel, no resident-VRAM win). It is downloaded only to save disk — the
# encoder runs on CPU on this 16 GB card regardless of its dtype (see Running).
huggingface-cli download Comfy-Org/gemma-4 \
    text_encoders/gemma4_e4b_it_fp8_scaled.safetensors \
    --local-dir ComfyUI/models/

The Comfy-Org/HiDream-O1-Image repo repackages the canonical HiDream-ai/HiDream-O1-Image weights, and Comfy-Org/gemma-4 repackages Google's gemma-4-E4B-it. Do not download the hidream_o1_image_fp8_scaled.safetensors (8.07 GB) or hidream_o1_image_mxfp8.safetensors checkpoint to try to fit 16 GB: on RDNA3 the FP8-scaled DiT upcasts to the same ~16.37 GB BF16 resident footprint, and the MXFP8 variant targets NVIDIA Blackwell mxfp8 matmuls with no RDNA3 datapath. The memory you save on this card comes from offloading the encoder and streaming the DiT, not from the checkpoint's on-disk dtype.

Running

Launch ComfyUI from the repo root with the encoder forced onto the CPU — this is the change that makes the 16.37 GB BF16 DiT fit a 16 GB card:

python main.py --use-pytorch-cross-attention --lowvram

On a build with dynamic VRAM enabled (the default on recent ComfyUI), the loader automatically partial-offloads diffusion weights to system RAM when the model exceeds available VRAM, so the 16.37 GB DiT streams block-by-block rather than requiring 16.37 GB resident at once. --lowvram additionally pushes the text encoders to the CPU — per ComfyUI's cli_args.py the flag is documented as "Doesn't do anything if dynamic vram is enabled. If dynamic vram isn't being used this option makes the text encoders run on the CPU." The Gemma 4 encoder runs its pass first, then the diffusion sampling pass runs with the encoder out of VRAM. ComfyUI's default attention backend on this stack is PyTorch's scaled-dot-product attention (SDPA); passing --use-pytorch-cross-attention forces it explicitly — per cli_args.py the flag is "Use the new pytorch 2.0 cross attention function." On RDNA3 this is the correct attention path; do not install xformers or a FlashAttention wheel.

This starts the server (default http://127.0.0.1:8188). In the UI, open Workflow → Browse Templates → Image and load "HiDream O1 Full: Image generation" per the official ComfyUI tutorial. The bundled template wires a checkpoint loader (hidream_o1_image_bf16.safetensors) and a text encoder (gemma4_e4b_it_fp8_scaled.safetensors), with default sampling of 50 inference steps at up to 2048×2048. Enter a prompt and queue; generated PNGs land in ComfyUI/output/ with the full workflow embedded. On 16 GB, start at 1024×1024 to confirm the run is stable before pushing toward 2048×2048 — the streaming/offload path is slower the more it has to swap. For a lighter run, the "HiDream O1 Dev" template uses 28 inference steps instead of 50.

Results

Speed: Not quoted. The backend has no benchmark for this pair yet — /check/hidream-o1-image/rx-7800-xt returns verdict: unknown — and no public source benchmarks HiDream-O1-Image on a discrete RX 7800 XT (gfx1101) at a known throughput. The 7800 XT also has materially less memory bandwidth than the 7900 XTX (624 vs 960 GB/s), and the encoder-offload + DiT-streaming path on a 16 GB card adds CPU↔GPU transfer overhead, so any 7900 XTX number would over-state this card. Quoting a figure measured on different hardware would mislead. If you measure it/s or seconds-per-image on a 7800 XT, please contribute it so it lands at /check/hidream-o1-image/rx-7800-xt.
VRAM usage: Plan on ~14 GB resident as a derived envelope, not a measured peak. The arithmetic: the BF16 diffusion checkpoint is ~16.37 GB on disk (Comfy-Org tree), but with the Gemma encoder offloaded to CPU (--lowvram) and ComfyUI's dynamic-VRAM streaming the diffusion blocks from system RAM, the resident footprint stays under the 16 GB card's budget — at the cost of speed. There is no FP8 or GGUF path to lower this on AMD (FP8 upcasts to BF16; no HiDream-O1 GGUF exists), so the offload/stream config is the fit mechanism. ⚠️ Do not use the BF16 encoder (~16.02 GB) on this card without offload — even on CPU it consumes RAM, and resident it would OOM the 16 GB GPU outright. See /check/hidream-o1-image/rx-7800-xt for community-submitted measurements once they land.
Quality notes: HiDream-O1 specializes in dense prompt alignment (DPG-Bench Overall 89.83) and compositional generation (GenEval Overall 0.90), per the official card's evaluation tables. Its native resolution ceiling is 2,048 × 2,048; the built-in Reasoning-Driven Prompt Agent (the "-O1" in the name) rewrites raw prompts through layout and text-rendering reasoning before generation. Image quality is unchanged by the 16 GB offload config — you run the same BF16 weights as the 24 GB card; only throughput differs.

For the full benchmark data and other-GPU comparisons, see /check/hidream-o1-image/rx-7800-xt.

Troubleshooting

Out of memory loading the 16.37 GB BF16 checkpoint on 16 GB

The BF16 DiT is larger than the card's 16 GB, so it must stream. Confirm dynamic VRAM is active (default on recent ComfyUI; otherwise add --enable-dynamic-vram) and that the text encoder is offloaded with --lowvram. If it still OOMs, escalate to --novram — per ComfyUI's cli_args.py it is documented as "When lowvram isn't enough." — and drop the resolution to 1024×1024:

python main.py --use-pytorch-cross-attention --novram

Do not reach for an FP8 or MXFP8 checkpoint to "save VRAM" — on RDNA3 they upcast to BF16 and save nothing resident (see What You'll Build).

"Torch not compiled with CUDA enabled"

A CUDA build of PyTorch got installed instead of the ROCm build. Per the ComfyUI README, reinstall against the ROCm wheel index:

pip uninstall torch
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm7.2

python -c "import torch; print(torch.__version__)" should print a +rocm7.2-style suffix, and torch.cuda.is_available() returns True (ROCm masquerades as the cuda device namespace under HIP).

PyTorch 2.9.x crashes on first inference

The canonical HiDream-ai/HiDream-O1-Image card flags this in its May 13, 2026 update: "PyTorch 2.9.x is not recommended due to the issue." — a bug in the Qwen3-VL backbone this model is built on. Pin a 2.8.x or 2.10+ ROCm wheel until upstream patches land. This is architecture-agnostic and applies on ROCm exactly as on CUDA.

Runs but stalls or destabilizes on the ROCm memory manager

Large model loads on the RDNA3 ROCm memory manager can stall. Per ComfyUI's cli_args.py, --disable-pinned-memory ("Disable pinned memory use.") is the standard ROCm steadying flag for this:

python main.py --use-pytorch-cross-attention --lowvram --disable-pinned-memory

Do not install xformers or FlashAttention

HF and ComfyUI guides written for NVIDIA frequently suggest pip install xformers or a FlashAttention wheel. On RDNA3 these are the wrong path: the ROCm xformers fork is limited, and ComfyUI already routes attention through PyTorch SDPA on this stack. Stick with the default, or force it explicitly with --use-pytorch-cross-attention.