How much VRAM does ERNIE-Image-Turbo need?

About 16 GB — the minimum this recipe targets.

How hard is this setup?

Intermediate — follow the steps above.

ERNIE-Image-Turbo on RX 7900 XTX: 8-step text-to-image via ComfyUI on ROCm

What You'll Build

A local ERNIE-Image-Turbo text-to-image setup running in ComfyUI on a 24 GB Radeon RX 7900 XTX (RDNA3, Navi 31, gfx1100) through the ROCm stack — producing 1024×1024 images from text prompts in 8 inference steps. Baidu's ERNIE-Image-Turbo is the distilled 8B Diffusion Transformer release of ERNIE-Image, optimized (per the model card) with DMD and RL for 8-step generation. With 24 GB of VRAM you have two clean paths on this card: lead with the GGUF Q8_0 quant (ernie-image-turbo-Q8_0.gguf, 8.69 GB on disk) loaded through city96's ComfyUI-GGUF custom node, or run the full BF16 weights (16.07 GB diffusion model) — both fit, with no need for any FP8 path.

Hardware data: RX 7900 XTX (24GB VRAM) · 8 inference steps · GGUF Q8_0 / BF16 · ComfyUI on ROCm 7.2 · See benchmark data

⚠️ This is a ROCm recipe, not CUDA. The RX 7900 XTX runs on AMD's ROCm/HIP stack — there is no cu124/cu128 wheel, no xformers install, and no FP8/FP4 path here. RDNA3's WMMA units accept FP16, BF16, INT8, and INT4 only (AMD GPUOpen, "How to accelerate AI applications on RDNA 3 using WMMA"), so an FP8 checkpoint would just upcast to BF16 with no memory saving — and at 24 GB you don't need it anyway. The attention path is PyTorch SDPA (ComfyUI's default; the explicit flag is --use-pytorch-cross-attention), not FlashAttention-2 and not xformers. If a guide tells you to pip install xformers, build a flash-attn wheel, or pick a cu12x wheel for this card, it's written for the wrong vendor.

Requirements

Component	Minimum	Tested
GPU	16 GB VRAM (ROCm-supported AMD card)	RX 7900 XTX (24 GB)
RAM	16 GB system	—
Storage	~17 GB: Q8_0 GGUF (8.69 GB) + Ministral-3B text encoder (7.72 GB) + Flux2 VAE (0.34 GB) — or ~24 GB for the full BF16 path	per HF tree
Driver	AMD ROCm 7.2.x on Linux	—
Software	ComfyUI + PyTorch (ROCm 7.2 build), ComfyUI-GGUF, Python 3.10+	—

The model is released under the Apache-2.0 license and the weights are not gated on Hugging Face — no access request or login is required to download them. Baidu's card states ERNIE-Image-Turbo "can run on consumer GPUs with 24G VRAM" (HF card, "Practical deployment" highlight) — the RX 7900 XTX sits exactly at that 24 GB floor, so unlike a 16 GB card this recipe never has to squeeze: you run the GGUF Q8_0 (near-BF16 quality at half the disk size) or the full BF16 weights directly. ERNIE-Image-Turbo uses a Ministral-3B text encoder and a Flux2 VAE, confirmed by the Comfy-Org/ERNIE-Image repackager file layout (text_encoders/ministral-3-3b.safetensors, vae/flux2-vae.safetensors).

Installation

1. Install ComfyUI

Per the ComfyUI README, clone the repo:

git clone https://github.com/comfyanonymous/ComfyUI.git
cd ComfyUI

2. Install PyTorch for ROCm

The RX 7900 XTX (gfx1100) is an officially ROCm-supported GPU, so it uses the stable ROCm PyTorch wheel. Per the ComfyUI README "AMD GPUs (Linux)" section, the stable install command is:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm7.2

ℹ️ Verify the ROCm tag before you copy it. As of this writing the ComfyUI README pins rocm7.2 as the stable wheel — but the rocmX.Y tag moves over time (6.3 → 6.4 → 7.x). Read the current line in the live ComfyUI README before running. A nightly variant (https://download.pytorch.org/whl/nightly/rocm7.2) is also listed. There is also a separate experimental RDNA-3-specific wheel index (https://rocm.nightlies.amd.com/v2/gfx110X-all/) that the README lists for broader RDNA3 support — on officially-supported Linux you do not need it; the stable whl/rocm7.2 wheel above is the canonical path.

3. Install ComfyUI dependencies and the ComfyUI-GGUF node

Per the ComfyUI README "Dependencies" section:

pip install -r requirements.txt

For the GGUF Q8_0 path, install city96's ComfyUI-GGUF loader node into custom_nodes and the gguf package (the README's install steps):

git clone https://github.com/city96/ComfyUI-GGUF ComfyUI/custom_nodes/ComfyUI-GGUF
pip install --upgrade gguf

Restart ComfyUI after install — the GGUF Unet loader node appears under the bootleg category.

Note on the GGUF source and how it runs on ROCm. city96 publishes the ComfyUI-GGUF loader node, not the ERNIE quant weights — there is no city96/ERNIE-Image-Turbo-gguf repo. The quant weights come from the unsloth/ERNIE-Image-Turbo-GGUF repo in step 4. The GGUF format was popularized by llama.cpp (per the ComfyUI-GGUF README), but the loader node dequantizes the weights and runs them through ComfyUI's own PyTorch runtime — so on this card the compute goes through PyTorch's ROCm/HIP backend, exactly like the BF16 path. There is no separate llama.cpp process to build; if PyTorch-for-ROCm works (step 2), the GGUF path works.

4. Download the weights

Path A — GGUF Q8_0 (recommended, comfortable headroom). Pick the Q8_0 quant from the unsloth/ERNIE-Image-Turbo-GGUF repo — ernie-image-turbo-Q8_0.gguf, 8.69 GB on disk. The unsloth card is a GGUF quant of the canonical baidu/ERNIE-Image-Turbo upstream (linked via its base_model) and credits city96's ComfyUI-GGUF as the loader tooling. GGUF diffusion-model files live in ComfyUI/models/unet:

# from your ComfyUI root
huggingface-cli download unsloth/ERNIE-Image-Turbo-GGUF \
  ernie-image-turbo-Q8_0.gguf \
  --local-dir ComfyUI/models/unet

Path B — full BF16 single-file (also fits 24 GB). The Comfy-Org/ERNIE-Image repackager (the ComfyUI core team's repackaging into ComfyUI's expected layout) ships the BF16 diffusion model as a single safetensors (16.07 GB). On a 24 GB card this loads with the standard Load Diffusion Model node — no GGUF loader needed:

# from your ComfyUI root — BF16 diffusion model (16.07 GB)
huggingface-cli download Comfy-Org/ERNIE-Image \
  diffusion_models/ernie-image-turbo.safetensors \
  --local-dir ComfyUI/models/

5. Download the text encoder and VAE

Both paths need the same auxiliary files. Pull them from the Comfy-Org/ERNIE-Image repackager:

# from your ComfyUI root — text encoder (Ministral-3-3B, 7.72 GB)
huggingface-cli download Comfy-Org/ERNIE-Image \
  text_encoders/ministral-3-3b.safetensors \
  --local-dir ComfyUI/models/

# optional prompt enhancer (6.88 GB) — only if you enable use_pe
huggingface-cli download Comfy-Org/ERNIE-Image \
  text_encoders/ernie-image-prompt-enhancer.safetensors \
  --local-dir ComfyUI/models/

# VAE (Flux2 VAE, 0.34 GB)
huggingface-cli download Comfy-Org/ERNIE-Image \
  vae/flux2-vae.safetensors \
  --local-dir ComfyUI/models/

The official ComfyUI ERNIE-Image tutorial lists the same three auxiliary files — ministral-3-3b.safetensors (text encoder), ernie-image-prompt-enhancer.safetensors (prompt enhancer text encoder), and flux2-vae.safetensors (VAE) — under the expected text_encoders/ and vae/ layout.

6. Load the Turbo workflow template

The official ComfyUI tutorial documents loading the ERNIE-Image flow: update ComfyUI to the latest version (or use Comfy Cloud), open Template and search for ERNIE-Image, select the ERNIE-Image workflow, then download any missing models, update the prompt, and click Run. For Turbo specifically, the same tutorial page provides a separate "Download the ERNIE-Image-Turbo text-to-image workflow JSON file" link and describes the variant verbatim as "ERNIE-Image-Turbo is a faster variant optimized with DMD and RL, generating images in just 8 steps compared to the ~50 steps required by the standard model." Download that Turbo JSON and load it in ComfyUI.

For Path A (GGUF), swap the template's Load Diffusion Model node for the GGUF Unet loader from ComfyUI-GGUF (the bootleg category), pointing it at the ernie-image-turbo-Q8_0.gguf you downloaded in step 4. For Path B (BF16), leave the default Load Diffusion Model node and point it at ernie-image-turbo.safetensors. The text encoder, VAE, and sampler graph stay as the template ships them.

Running

Launch ComfyUI from the repo root. Per the ComfyUI README "Running" section:

python main.py

This starts the server (default http://127.0.0.1:8188). Open it in a browser, load the Turbo workflow from step 6, select your diffusion-model loader, and set the Baidu-recommended parameters: resolution 1024×1024 (or one of 848×1264, 1264×848, 768×1376, 896×1200, 1376×768, 1200×896), 8 inference steps, and guidance scale (CFG) 1.0. Turbo is step-distilled (DMD + RL per the model card) and tuned for 8-step generation — higher CFG or more steps degrade output. Hit Queue Prompt; generated PNGs land in ComfyUI/output/.

ComfyUI's default attention backend on this stack is PyTorch's scaled-dot-product attention (SDPA). On RDNA3 the SDPA path is the one to use — do not install xformers or build a flash-attn wheel. If the auto-selected attention path misbehaves (a known footgun on ROCm is a crash at the VAE-decode stage), force the PyTorch cross-attention function explicitly — per ComfyUI's cli_args.py, the flag is documented as "Use the new pytorch 2.0 cross attention function.":

python main.py --use-pytorch-cross-attention

If repeated generations grow unstable, the ROCm memory-management flags --disable-smart-memory ("Force ComfyUI to agressively offload to regular ram instead of keeping models in vram when it can.") and --disable-pinned-memory ("Disable pinned memory use.") — both quoted verbatim from cli_args.py — settle large-model loads on RDNA3; the AMD ROCm troubleshooting guidance flags pinned-memory as the go-to fix for ROCm large-load instability. Do not reach for --use-split-cross-attention on this card; SDPA via --use-pytorch-cross-attention is the path that works.

Results

Speed: Not quoted. The /check/ernie-image-turbo/rx-7900-xtx page is currently verdict: unknown and no community benchmark naming the RX 7900 XTX (or any same-config ERNIE-Image-Turbo ROCm run) was found in the sources reviewed. Image-generation throughput depends heavily on the ROCm version, attention path, and quant tier, so transferring an iterations-per-second figure from a different GPU or a CUDA run would mislead. The /check page populates once a benchmark lands — if you've measured ERNIE-Image-Turbo on a 7900 XTX, please contribute it so it lands at /check/ernie-image-turbo/rx-7900-xtx.
VRAM usage: The recommended GGUF Q8_0 diffusion weights are 8.69 GB on disk (unsloth GGUF tree); the Ministral-3B text encoder (7.72 GB), Flux2 VAE (0.34 GB), and 1024×1024 activations add to that, but not all are resident simultaneously — the text encoder runs once per generation, then ComfyUI offloads it before the diffusion stage dominates. The full BF16 path's diffusion model is 16.07 GB (Comfy-Org tree); on a 24 GB card it fits but is tighter, which is why this recipe leads with Q8_0. min_vram_gb: 16 is a conservative floor covering the recommended Q8_0 path's multi-component peak. See /check/ernie-image-turbo/rx-7900-xtx for any community-submitted measurement.
Quality notes: 8-step distilled output (DMD + RL). Stay at the Baidu-recommended 1024×1024 or 848×1264 resolutions and CFG 1.0 for cleanest fidelity. Q8_0 is near-lossless versus BF16 and is the practical sweet spot on this card; there is no FP8 tradeoff to consider on RDNA3 (no FP8 hardware), so the choice is simply Q8_0 (more headroom) vs BF16 (maximal fidelity, tighter fit).

For the full benchmark data once it lands, see /check/ernie-image-turbo/rx-7900-xtx.

Troubleshooting

"Torch not compiled with CUDA enabled" / wrong backend

This means a CUDA build of PyTorch got installed instead of the ROCm build. Per the ComfyUI README, uninstall and reinstall against the ROCm wheel index:

pip uninstall torch
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm7.2

Confirm the installed build is the ROCm one: python -c "import torch; print(torch.__version__)" should print a +rocm7.2-style suffix, and torch.cuda.is_available() returns True (ROCm masquerades as the cuda device namespace under HIP).

Crash or instability at the VAE-decode stage on ROCm

The default attention path can be unstable on RDNA3, and repeated runs can fragment VRAM. Launch with the PyTorch SDPA attention flag and, if needed, the ROCm memory-management flags:

python main.py --use-pytorch-cross-attention --disable-smart-memory --disable-pinned-memory

These are the canonical ROCm stabilization flags for large image models on the 7900 XTX (the AMD ROCm guidance flags pinned/smart-memory as the go-to fix). Do not use --use-split-cross-attention as a substitute — on this card it does not resolve the issue; SDPA via --use-pytorch-cross-attention is the working path.

First-run generation feels slow — enable TunableOp

Per the ComfyUI README "AMD ROCm Tips", setting the env variable PYTORCH_TUNABLEOP_ENABLED=1 "might speed things up at the cost of a very slow initial run." TunableOp auto-tunes GEMM kernels for your card on the first pass (slow), then caches them for faster subsequent generations:

PYTORCH_TUNABLEOP_ENABLED=1 python main.py

The raw diffusers/SGLang path OOMs during inference even at 24 GB

A community user (animebing) reports in Issue #4 ("How to run the turbo version on a 24G graphics card?") that the diffusers and SGLang paths load successfully on a 24 GB RTX 4090 but hit an out-of-memory error during inference; a community contributor (HsiaWinter) in the same thread suggests pipe.enable_model_cpu_offload() as a diffusers workaround. This is one reason this recipe leads with the ComfyUI path (GGUF Q8_0 or BF16), whose loader manages VRAM more conservatively than the raw ErnieImagePipeline.to("cuda") snippet on the Baidu card. If you do run the diffusers path on this card and OOM, add pipe.enable_model_cpu_offload() before generating.

The GGUF Unet loader node isn't visible after install

Per the ComfyUI-GGUF README, the node lives under the bootleg category. If it's missing:

Confirm the clone landed in ComfyUI/custom_nodes/ComfyUI-GGUF/ (not nested one level deeper).
Verify pip install --upgrade gguf ran in the same Python environment ComfyUI uses.
Restart ComfyUI fully (not just refresh the browser).