ERNIE-Image-Turbo on RTX 4080 SUPER: 8-step text-to-image via GGUF in ComfyUI

What You'll Build

A working ComfyUI text-to-image pipeline that runs Baidu's 8B ERNIE-Image-Turbo on a 16GB RTX 4080 SUPER using the unsloth Q8_0 GGUF quant (ernie-image-turbo-Q8_0.gguf, 8.69 GB on disk) loaded through city96's ComfyUI-GGUF custom node. 8 inference steps per image, full 1024×1024 native resolution, no CPU offload required at Q8_0.

Hardware data: RTX 4080 SUPER (16GB VRAM) · 8 inference steps · GGUF Q8_0 · See benchmark data

ℹ️ Why GGUF and not the full BF16 release. Baidu's card states ERNIE-Image-Turbo "can run on consumer GPUs with 24G VRAM" (HF card) — and a community user reports OOM during inference even on a 24 GB RTX 4090 on both the diffusers and SGLang paths in Issue #4. The RTX 4080 SUPER carries 16 GB, below that documented BF16 floor, so this recipe runs the Q8_0 GGUF quant (8.69 GB weights) through ComfyUI-GGUF rather than the full 16.07 GB BF16 single-file.

Requirements

Component	Minimum	Tested
GPU	12GB VRAM NVIDIA (per Civitai workflow notes)	RTX 4080 SUPER (16GB)
RAM	16GB system RAM	—
Storage	~17 GB for Q8_0 UNet (8.69 GB) + text encoder (7.72 GB) + VAE (0.34 GB)	—
Software	ComfyUI (latest), ComfyUI-Manager, Python 3.10+, PyTorch with stable CUDA wheels	—

The unquantized Baidu release "can run on consumer GPUs with 24G VRAM" per the official ERNIE-Image-Turbo card — the Q8_0 GGUF brings that down to where a 16GB GPU has comfortable headroom for the Ministral-3B text encoder, the Flux2 VAE, and activation memory. The SarcasticTOFU Civitai workflow (a Base-or-Turbo ERNIE-Image flow) documents a 12 GB minimum for its FP8 path; the same floor applies to the Q8_0 GGUF path on this tier.

Installation

1. Install PyTorch (RTX 4080 SUPER is Ada sm_89 — stock wheels work)

The RTX 4080 SUPER is Ada Lovelace AD103, compute capability sm_89. Unlike Blackwell (sm_120) cards, sm_89 kernels ship in the default stable PyTorch CUDA wheels — no nightly or special --index-url is required. The standard ComfyUI install already pulls a working build:

pip install torch torchvision torchaudio

Verify the runtime sees the device:

python -c "import torch; print(torch.version.cuda, torch.cuda.get_device_capability())"

You want a CUDA 12.x version and (8, 9) printed.

2. Install the ComfyUI-GGUF custom node

From the city96/ComfyUI-GGUF README, clone into ComfyUI's custom_nodes directory:

git clone https://github.com/city96/ComfyUI-GGUF ComfyUI/custom_nodes/ComfyUI-GGUF
pip install --upgrade gguf

On Windows portable ComfyUI, use the embedded interpreter instead:

git clone https://github.com/city96/ComfyUI-GGUF ComfyUI/custom_nodes/ComfyUI-GGUF
.\python_embeded\python.exe -s -m pip install -r .\ComfyUI\custom_nodes\ComfyUI-GGUF\requirements.txt

Restart ComfyUI after install — the Unet Loader (GGUF) node appears under the bootleg category.

3. Download the Q8_0 GGUF UNet

Pick the Q8_0 quant from the unsloth/ERNIE-Image-Turbo-GGUF repo — ernie-image-turbo-Q8_0.gguf, 8.69 GB on disk. The repo lists a full quant ladder from Q2_K (3.18 GB) through BF16 (16.07 GB); Q8_0 is the best quality-vs-size trade-off for a 16GB card. The card credits city96's ComfyUI-GGUF as the loader tooling and links back to the canonical baidu/ERNIE-Image-Turbo upstream.

# from your ComfyUI root
huggingface-cli download unsloth/ERNIE-Image-Turbo-GGUF \
  ernie-image-turbo-Q8_0.gguf \
  --local-dir ComfyUI/models/unet

Per the ComfyUI-GGUF README, GGUF UNet files live in ComfyUI/models/unet.

4. Download the text encoder and VAE

The GGUF UNet still needs the auxiliary files the workflow expects. Pull them from the Comfy-Org/ERNIE-Image repackager (the ComfyUI core team's repackaging into ComfyUI's expected layout):

# from your ComfyUI root — text encoder (Ministral-3-3B, 7.72 GB)
huggingface-cli download Comfy-Org/ERNIE-Image \
  text_encoders/ministral-3-3b.safetensors \
  --local-dir ComfyUI/models/

# optional prompt enhancer (6.88 GB) — only if you enable use_pe
huggingface-cli download Comfy-Org/ERNIE-Image \
  text_encoders/ernie-image-prompt-enhancer.safetensors \
  --local-dir ComfyUI/models/

# VAE (Flux2 VAE, 0.34 GB)
huggingface-cli download Comfy-Org/ERNIE-Image \
  vae/flux2-vae.safetensors \
  --local-dir ComfyUI/models/

The official ComfyUI ERNIE-Image tutorial lists the same three auxiliary files — ministral-3-3b.safetensors (text encoder), ernie-image-prompt-enhancer.safetensors (prompt enhancer text encoder), and flux2-vae.safetensors (VAE) — under the expected layout:

📂 ComfyUI/
├── 📂 models/
│   ├── 📂 diffusion_models/
│   │   └── ernie-image-turbo.safetensors
│   ├── 📂 text_encoders/
│   │   ├── ministral-3-3b.safetensors
│   │   └── ernie-image-prompt-enhancer.safetensors
│   └── 📂 vae/
│       └── flux2-vae.safetensors

(You replace the diffusion_models/ernie-image-turbo.safetensors slot with the Q8_0 GGUF in models/unet loaded via the GGUF node — see step 5.)

5. Load the Turbo workflow template

Per the official ComfyUI ERNIE-Image tutorial, the get-started flow is: update ComfyUI to the latest version (or use Comfy Cloud), open the Template menu and search for ERNIE-Image, select the ERNIE-Image workflow, then download any missing models, update the prompt, and click Run. The same page documents the Turbo variant separately — it provides its own "Download the ERNIE-Image-Turbo text-to-image workflow JSON file" link, and describes ERNIE-Image-Turbo as a faster variant optimized with DMD and RL that generates images in 8 steps versus the roughly 50 steps the standard ERNIE-Image model needs. (Baidu's own card confirms this characterization: the Turbo checkpoint is "optimized by DMD and RL" and produces output "in only 8 inference steps" — see the HF card.) Download that Turbo workflow JSON and load it in ComfyUI.

In the loaded Turbo template, swap the default Load Diffusion Model node for the Unet Loader (GGUF) node from ComfyUI-GGUF, pointing it at the Q8_0 file you downloaded in step 3. The text encoder, VAE, and sampler graph stay as the template ships them.

Running

With the workflow loaded and the GGUF loader wired in:

Set resolution to one of the Baidu-recommended sizes: 1024×1024, 848×1264, 1264×848, 768×1376, 896×1200, 1376×768, or 1200×896.
Set sampler steps to 8 and guidance scale (CFG) to 1.0 — Turbo is step-distilled (DMD + RL per the Baidu HF card) and tuned for 8-step generation. Higher CFG degrades output.
Optionally enable the prompt enhancer (use_pe=True in diffusers terminology; in ComfyUI this is the toggle on the ERNIE prompt-enhancer node in the official template). It adds ~6.88 GB of resident VRAM but improves complex-prompt fidelity.
Hit Queue Prompt.

First run is slow due to weight load; subsequent runs reuse the cached UNet.

Results

Speed: Not quoted. The /check/ernie-image-turbo/rtx-4080-super page is currently verdict: unknown and no community benchmark naming the RTX 4080 SUPER (or the closely-matched RTX 4080, or any same-config ERNIE-Image-Turbo run) is cited in the sources reviewed. With no 4080-class figure to transfer even as a lower bound, the Speed line is omitted. The /check page populates once a benchmark lands — to contribute one, see the submission form.
VRAM usage: Lower bound is the Q8_0 weight file at 8.69 GB (unsloth GGUF tree); the Ministral-3B text encoder (7.72 GB), Flux2 VAE (0.34 GB), the optional prompt enhancer (6.88 GB), and activation memory add to that (not all resident simultaneously — the text encoders run once per generation, then offload). The 12 GB recipe minimum is the FP8/GGUF-path floor documented in the SarcasticTOFU Civitai workflow notes, used here as a conservative safety floor until a measured Q8_0 benchmark lands at /check/.
Quality notes: 8-step distilled output (DMD + RL). For the cleanest fidelity stay at the recommended 1024×1024 or 848×1264 resolutions. Higher-bit quants (BF16 16.07 GB) won't fit a 16 GB card alongside the text encoders without offload — Q8_0 is the practical ceiling on this tier.

For the full benchmark data once it lands, see /check/ernie-image-turbo/rtx-4080-super.

Troubleshooting

Out of memory after the first generation

The Q8_0 GGUF weights are 8.69 GB on disk, but text-encoder + VAE + activations push real-time peak meaningfully higher. If you OOM at 1264×848 or larger:

Drop one quant tier: the unsloth repo ships ernie-image-turbo-Q6_K.gguf (6.79 GB), ernie-image-turbo-Q5_K_M.gguf (5.93 GB), ernie-image-turbo-Q4_K_M.gguf (5.02 GB), and ernie-image-turbo-Q4_0.gguf (4.76 GB) in the same repo — drop-in replacements at the GGUF loader.
Disable the prompt enhancer (use_pe=False) to free ~6.88 GB of resident text-encoder memory.
Lower output resolution to 1024×1024.
Restart ComfyUI between runs to reset accumulated VRAM if your driver is leaking allocations.

The full BF16 release OOMs even on a 24 GB card

This is expected, and it's why this recipe uses GGUF. A community user reports in Issue #4 that both the diffusers and SGLang versions load successfully on a 24 GB RTX 4090 but hit an out-of-memory error during inference; a contributor in the same thread suggests pipe.enable_model_cpu_offload() as a diffusers workaround. On a 16 GB RTX 4080 SUPER the BF16 path is not viable without aggressive offload — stay on the Q8_0 GGUF UNet, which never needs offload for the weights themselves.

The `Unet Loader (GGUF)` node isn't visible after install

Per the ComfyUI-GGUF README, the node lives under the bootleg category. If it's missing from the node menu entirely:

Confirm the clone landed in ComfyUI/custom_nodes/ComfyUI-GGUF/ (not nested one level deeper).
Verify pip install --upgrade gguf ran in the same Python environment ComfyUI uses (use the embedded interpreter on Windows portable).
Restart ComfyUI fully (not just refresh the browser).

The `Load Diffusion Model` node throws "unsupported format" on a `.gguf` file

You're using the default loader, not the GGUF one. The stock ComfyUI Load Diffusion Model node only reads safetensors. Replace it with Unet Loader (GGUF) from the bootleg category — that's the whole point of installing the custom node in step 2.