How much VRAM does ERNIE-Image-Turbo need?

About 9 GB — the minimum this recipe targets.

How hard is this setup?

Intermediate — follow the steps above.

ERNIE-Image-Turbo on RX 7800 XT: 8-step text-to-image via ComfyUI on ROCm (GGUF Q8_0)

What You'll Build

A local ERNIE-Image-Turbo text-to-image setup running in ComfyUI on a 16 GB Radeon RX 7800 XT (RDNA3, Navi 32, gfx1101) through the ROCm stack — producing 1024×1024 images from text prompts in 8 inference steps. Baidu's ERNIE-Image-Turbo is the distilled 8B Diffusion Transformer release of ERNIE-Image, optimized (per the model card) with DMD and RL for 8-step generation. On a 16 GB card the right lead is the GGUF Q8_0 quant (ernie-image-turbo-Q8_0.gguf, 8.69 GB on disk) loaded through city96's ComfyUI-GGUF custom node — at near-BF16 quality it leaves comfortable headroom for the text encoder and activations that the full BF16 weights (16.07 GB diffusion model alone) would crowd out at this VRAM tier.

Hardware data: RX 7800 XT (16GB VRAM) · 8 inference steps · GGUF Q8_0 · ComfyUI on ROCm 7.2 · See benchmark data

⚠️ This is a ROCm recipe, not CUDA. The RX 7800 XT runs on AMD's ROCm/HIP stack — there is no cu124/cu128 wheel, no xformers install, and no FP8/FP4 path here. RDNA3's WMMA units accept FP16, BF16, INT8, and INT4 only (AMD GPUOpen, "How to accelerate AI applications on RDNA 3 using WMMA"), so an FP8 checkpoint would just upcast to BF16 with no memory saving — which is exactly why FP8 is not the 16 GB escape hatch it is on NVIDIA cards. The real memory route here is GGUF. The attention path is PyTorch SDPA (ComfyUI's default; the explicit flag is --use-pytorch-cross-attention), not FlashAttention-2 and not xformers. If a guide tells you to pip install xformers, build a flash-attn wheel, or pick a cu12x wheel for this card, it's written for the wrong vendor.

Requirements

Component	Minimum	Tested
GPU	16 GB VRAM (ROCm-supported AMD card)	RX 7800 XT (16 GB)
RAM	16 GB system	—
Storage	~17 GB: Q8_0 GGUF (8.69 GB) + Ministral-3B text encoder (7.72 GB) + Flux2 VAE (0.34 GB)	per HF tree
Driver	AMD ROCm 7.2.x on Linux	—
Software	ComfyUI + PyTorch (ROCm 7.2 build), ComfyUI-GGUF, Python 3.10+	—

The model is released under the Apache-2.0 license and the weights are not gated on Hugging Face — no access request or login is required to download them. Baidu's card states ERNIE-Image-Turbo "can run on consumer GPUs with 24G VRAM" (HF card, "Practical deployment" highlight) — that is the full BF16 envelope. The RX 7800 XT has 16 GB, below that floor, so this recipe leads with the GGUF Q8_0 quant (8.69 GB) instead of the 16.07 GB BF16 diffusion model: at 16 GB the BF16 weights leave no room for the 7.72 GB text encoder and the 1024×1024 activations to coexist, whereas Q8_0 is near-lossless and comfortably resident. ERNIE-Image-Turbo uses a Ministral-3B text encoder and a Flux2 VAE, confirmed by the Comfy-Org/ERNIE-Image repackager file layout (text_encoders/ministral-3-3b.safetensors, vae/flux2-vae.safetensors).

Installation

1. Install ComfyUI

Per the ComfyUI README, clone the repo:

git clone https://github.com/comfyanonymous/ComfyUI.git
cd ComfyUI

2. Install PyTorch for ROCm

The RX 7800 XT (gfx1101) is an officially ROCm-supported GPU on Linux, so it uses the stable ROCm PyTorch wheel. Per the ComfyUI README "AMD GPUs (Linux)" section, the stable install command is:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm7.2

ℹ️ Verify the ROCm tag before you copy it. As of this writing the ComfyUI README pins rocm7.2 as the stable wheel — but the rocmX.Y tag moves over time (6.3 → 6.4 → 7.x). Read the current line in the live ComfyUI README before running. A nightly variant (https://download.pytorch.org/whl/nightly/rocm7.2) is also listed. There is also a separate experimental RDNA-3-specific wheel index (https://rocm.nightlies.amd.com/v2/gfx110X-all/) that the README lists for broader RDNA3 support — on officially-supported Linux you do not need it; the stable whl/rocm7.2 wheel above is the canonical path.

ℹ️ gfx1101 and the HSA override. The RX 7800 XT's ROCm arch target is gfx1101 (Navi 32) — not gfx1100 (the 7900 XTX) and not gfx1102 (the RX 7600). Modern ROCm 7.x ships gfx1101 kernels, so you should not need HSA_OVERRIDE_GFX_VERSION. It is a legacy fallback only: if a specific library ships kernels for gfx1100 but not gfx1101, exporting HSA_OVERRIDE_GFX_VERSION=11.0.0 masquerades the card as gfx1100 so the gfx1100 kernels load. Try the unmodified install first; reach for the override only if you hit a "no kernel image" error.

3. Install ComfyUI dependencies and the ComfyUI-GGUF node

Per the ComfyUI README "Dependencies" section:

pip install -r requirements.txt

For the GGUF Q8_0 path, install city96's ComfyUI-GGUF loader node into custom_nodes and the gguf package (the README's install steps):

git clone https://github.com/city96/ComfyUI-GGUF ComfyUI/custom_nodes/ComfyUI-GGUF
pip install --upgrade gguf

Restart ComfyUI after install — the GGUF Unet loader node appears under the bootleg category.

Note on the GGUF source and how it runs on ROCm. city96 publishes the ComfyUI-GGUF loader node, not the ERNIE quant weights — there is no city96/ERNIE-Image-Turbo-gguf repo. The quant weights come from the unsloth/ERNIE-Image-Turbo-GGUF repo in step 4. The GGUF format was popularized by llama.cpp (per the ComfyUI-GGUF README), but the loader node dequantizes the weights and runs them through ComfyUI's own PyTorch runtime — so on this card the compute goes through PyTorch's ROCm/HIP backend, exactly like a BF16 model would. There is no separate llama.cpp process to build; if PyTorch-for-ROCm works (step 2), the GGUF path works.

4. Download the weights (GGUF Q8_0)

Pick the Q8_0 quant from the unsloth/ERNIE-Image-Turbo-GGUF repo — ernie-image-turbo-Q8_0.gguf, 8.69 GB on disk. The unsloth card is a GGUF quant of the canonical baidu/ERNIE-Image-Turbo upstream (linked via its base_model) and credits city96's ComfyUI-GGUF as the loader tooling. GGUF diffusion-model files live in ComfyUI/models/unet:

# from your ComfyUI root
huggingface-cli download unsloth/ERNIE-Image-Turbo-GGUF \
  ernie-image-turbo-Q8_0.gguf \
  --local-dir ComfyUI/models/unet

ℹ️ Tighter on VRAM? Drop a Q-tier. Q8_0 (8.69 GB) is the recommended lead on 16 GB and is near-lossless versus BF16. If you stack other VRAM consumers (a second model, a large batch) and need more headroom, the same unsloth repo also publishes Q6_K (6.79 GB), Q5_K_M (5.93 GB), and Q4_K_M (5.02 GB) — quality degrades gradually as the bit-width drops. There is no BF16-fits-comfortably option on this 16 GB card; the full ernie-image-turbo-BF16.gguf is 16.07 GB (the same size as the safetensors), which leaves no room for the text encoder and activations.

5. Download the text encoder and VAE

Pull the auxiliary files from the Comfy-Org/ERNIE-Image repackager (the ComfyUI core team's repackaging into ComfyUI's expected layout):

# from your ComfyUI root — text encoder (Ministral-3-3B, 7.72 GB)
huggingface-cli download Comfy-Org/ERNIE-Image \
  text_encoders/ministral-3-3b.safetensors \
  --local-dir ComfyUI/models/

# optional prompt enhancer (6.88 GB) — only if you enable use_pe
huggingface-cli download Comfy-Org/ERNIE-Image \
  text_encoders/ernie-image-prompt-enhancer.safetensors \
  --local-dir ComfyUI/models/

# VAE (Flux2 VAE, 0.34 GB)
huggingface-cli download Comfy-Org/ERNIE-Image \
  vae/flux2-vae.safetensors \
  --local-dir ComfyUI/models/

The official ComfyUI ERNIE-Image tutorial lists the same three auxiliary files — ministral-3-3b.safetensors (text encoder), ernie-image-prompt-enhancer.safetensors (prompt enhancer text encoder), and flux2-vae.safetensors (VAE) — under the expected text_encoders/ and vae/ layout.

6. Load the Turbo workflow template

The official ComfyUI tutorial documents loading the ERNIE-Image flow: update ComfyUI to the latest version (or use Comfy Cloud), open Template and search for ERNIE-Image, select the ERNIE-Image workflow, then download any missing models, update the prompt, and click Run. For Turbo specifically, the same tutorial page provides a separate "Download the ERNIE-Image-Turbo text-to-image workflow JSON file" link and describes the variant verbatim as "ERNIE-Image-Turbo is a faster variant optimized with DMD and RL, generating images in just 8 steps compared to the ~50 steps required by the standard model." Download that Turbo JSON and load it in ComfyUI.

On this card, swap the template's Load Diffusion Model node for the GGUF Unet loader from ComfyUI-GGUF (the bootleg category), pointing it at the ernie-image-turbo-Q8_0.gguf you downloaded in step 4. The text encoder, VAE, and sampler graph stay as the template ships them.

Running

Launch ComfyUI from the repo root. Per the ComfyUI README "Running" section:

python main.py

This starts the server (default http://127.0.0.1:8188). Open it in a browser, load the Turbo workflow from step 6, point the GGUF Unet loader at ernie-image-turbo-Q8_0.gguf, and set the Baidu-recommended parameters: resolution 1024×1024 (or one of 848×1264, 1264×848, 768×1376, 896×1200, 1376×768, 1200×896), 8 inference steps, and guidance scale (CFG) 1.0. Turbo is step-distilled (DMD + RL per the model card) and tuned for 8-step generation — higher CFG or more steps degrade output. Hit Queue Prompt; generated PNGs land in ComfyUI/output/.

ComfyUI's default attention backend on this stack is PyTorch's scaled-dot-product attention (SDPA). On RDNA3 the SDPA path is the one to use — do not install xformers or build a flash-attn wheel. If the auto-selected attention path misbehaves (a known footgun on ROCm is a crash at the VAE-decode stage), force the PyTorch cross-attention function explicitly — per ComfyUI's cli_args.py, the flag is documented as "Use the new pytorch 2.0 cross attention function.":

python main.py --use-pytorch-cross-attention

If repeated generations grow unstable, the ROCm memory-management flags --disable-smart-memory ("Force ComfyUI to agressively offload to regular ram instead of keeping models in vram when it can.") and --disable-pinned-memory ("Disable pinned memory use.") — both quoted verbatim from cli_args.py — settle large-model loads on RDNA3; the AMD ROCm troubleshooting guidance flags pinned-memory as the go-to fix for ROCm large-load instability. These flags matter more on this 16 GB card than on a 24 GB one — VRAM headroom is tighter, so aggressive offload is a useful safety valve. Do not reach for --use-split-cross-attention on this card; SDPA via --use-pytorch-cross-attention is the path that works.

Results

Speed: Not quoted. The /check/ernie-image-turbo/rx-7800-xt page is currently verdict: unknown and no community benchmark naming the RX 7800 XT (or any same-config ERNIE-Image-Turbo ROCm run) was found in the sources reviewed. Image-generation throughput depends heavily on the ROCm version, attention path, and quant tier, so transferring an iterations-per-second figure from a different GPU (including the 24 GB RX 7900 XTX, which has ~50% more memory bandwidth) or a CUDA run would mislead. The /check page populates once a benchmark lands — if you've measured ERNIE-Image-Turbo on a 7800 XT, please contribute it so it lands at /check/ernie-image-turbo/rx-7800-xt.
VRAM usage: The recommended GGUF Q8_0 diffusion weights are 8.69 GB on disk (unsloth GGUF tree); the Ministral-3B text encoder (7.72 GB), Flux2 VAE (0.34 GB), and 1024×1024 activations add to that, but not all are resident simultaneously — the text encoder runs once per generation, then ComfyUI offloads it before the diffusion stage dominates. The full BF16 path's diffusion model is 16.07 GB (Comfy-Org tree) — on its own that nearly fills this card's 16 GB and leaves no room for the encoder or activations, which is why this recipe leads with Q8_0 rather than BF16. min_vram_gb: 9 is a conservative floor covering the Q8_0 path's multi-component peak (the ~8.69 GB resident diffusion model plus diffusion-stage activations, with the text encoder offloaded). See /check/ernie-image-turbo/rx-7800-xt for any community-submitted measurement.
Quality notes: 8-step distilled output (DMD + RL). Stay at the Baidu-recommended 1024×1024 or 848×1264 resolutions and CFG 1.0 for cleanest fidelity. Q8_0 is near-lossless versus BF16 and is the practical sweet spot on this 16 GB card; there is no FP8 tradeoff to consider on RDNA3 (no FP8 hardware), so the choice is simply which GGUF Q-tier buys enough headroom — Q8_0 unless you need to free VRAM for other consumers.

For the full benchmark data once it lands, see /check/ernie-image-turbo/rx-7800-xt.

Troubleshooting

"Torch not compiled with CUDA enabled" / wrong backend

This means a CUDA build of PyTorch got installed instead of the ROCm build. Per the ComfyUI README, uninstall and reinstall against the ROCm wheel index:

pip uninstall torch
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm7.2

Confirm the installed build is the ROCm one: python -c "import torch; print(torch.__version__)" should print a +rocm7.2-style suffix, and torch.cuda.is_available() returns True (ROCm masquerades as the cuda device namespace under HIP).

Out of memory loading the model

This is the most likely failure on a 16 GB card if you picked the wrong weights. Confirm you downloaded the Q8_0 GGUF (8.69 GB), not the BF16 safetensors (16.07 GB) — the full BF16 diffusion model nearly fills the card on its own and OOMs once the text encoder and activations are added. If Q8_0 still OOMs because you're running other VRAM consumers, drop to a lower Q-tier (Q6_K 6.79 GB or Q5_K_M 5.93 GB) from the unsloth repo, and launch with --disable-smart-memory to force aggressive offload of the text encoder to system RAM before the diffusion stage.

Crash or instability at the VAE-decode stage on ROCm

The default attention path can be unstable on RDNA3, and repeated runs can fragment VRAM. Launch with the PyTorch SDPA attention flag and, if needed, the ROCm memory-management flags:

python main.py --use-pytorch-cross-attention --disable-smart-memory --disable-pinned-memory

These are the canonical ROCm stabilization flags for large image models on RDNA3 (the AMD ROCm guidance flags pinned/smart-memory as the go-to fix). Do not use --use-split-cross-attention as a substitute — on this card it does not resolve the issue; SDPA via --use-pytorch-cross-attention is the working path.

First-run generation feels slow — enable TunableOp

Per the ComfyUI README "AMD ROCm Tips", setting the env variable PYTORCH_TUNABLEOP_ENABLED=1 "might speed things up at the cost of a very slow initial run." TunableOp auto-tunes GEMM kernels for your card on the first pass (slow), then caches them for faster subsequent generations:

PYTORCH_TUNABLEOP_ENABLED=1 python main.py

The raw diffusers/SGLang path OOMs during inference

A community user (animebing) reports in Issue #4 ("How to run the turbo version on a 24G graphics card?") that the diffusers and SGLang paths load successfully on a 24 GB RTX 4090 but hit an out-of-memory error during inference; a community contributor (HsiaWinter) in the same thread suggests pipe.enable_model_cpu_offload() as a diffusers workaround. If a 24 GB card OOMs on the raw path, a 16 GB card certainly will — which is the main reason this recipe leads with the ComfyUI GGUF Q8_0 path, whose loader manages VRAM far more conservatively than the raw ErnieImagePipeline.to("cuda") snippet on the Baidu card. If you do try the diffusers path here, add pipe.enable_model_cpu_offload() before generating.

The GGUF Unet loader node isn't visible after install

Per the ComfyUI-GGUF README, the node lives under the bootleg category. If it's missing:

Confirm the clone landed in ComfyUI/custom_nodes/ComfyUI-GGUF/ (not nested one level deeper).
Verify pip install --upgrade gguf ran in the same Python environment ComfyUI uses.
Restart ComfyUI fully (not just refresh the browser).