ERNIE-Image-Turbo on RTX 4070: 8-step text-to-image via GGUF in ComfyUI

What You'll Build

A working ComfyUI text-to-image pipeline that runs Baidu's 8B ERNIE-Image-Turbo on a 12GB RTX 4070 using a step-down GGUF quant from the unsloth/ERNIE-Image-Turbo-GGUF repo, loaded through city96's ComfyUI-GGUF custom node. Eight inference steps per image at full 1024×1024 native resolution.

Hardware data: RTX 4070 (12GB VRAM) · 8 inference steps · GGUF Q6_K / Q5_K_M · See benchmark data

ℹ️ Why a Q6_K/Q5_K_M GGUF and not Q8_0 or the full BF16 release. Baidu's card states ERNIE-Image-Turbo "can run on consumer GPUs with 24G VRAM" (HF card, "Practical deployment" highlight), and a user reports OOM during inference even on a 24 GB card on the SGLang/Diffusers paths (see Troubleshooting). On a 12 GB card the usable budget after a display is closer to 11 GB, so this recipe leads with the Q6_K (6.79 GB) or Q5_K_M (5.93 GB) GGUF rather than the Q8_0 (8.69 GB) the 16 GB siblings use — the smaller diffusion-model weights leave headroom for the Ministral-3B text encoder and activations. Q8_0 sits right at the 12 GB budget once a display is attached, so it's kept as a headless-only / 16GB note below.

Requirements

Component	Minimum	Tested
GPU	12GB VRAM NVIDIA (per Civitai workflow notes)	RTX 4070 (12GB)
RAM	16GB system RAM	—
Storage	~15 GB for Q6_K UNet (6.79 GB) + text encoder (7.72 GB) + VAE (0.34 GB)	—
Software	ComfyUI (latest), ComfyUI-Manager, Python 3.10+, PyTorch with stable CUDA wheels	—

The unquantized Baidu release "can run on consumer GPUs with 24G VRAM" per the official ERNIE-Image-Turbo card — the GGUF quant brings that down to where a 12GB card has room for the diffusion-model weights, the Ministral-3B text encoder, the Flux2 VAE, and activation memory. The sarcastictofu Civitai workflow (a Base-or-Turbo ERNIE-Image flow that documents both GGUF and FP8 paths) states a 12 GB minimum for its FP8 path; the smaller GGUF tiers used here keep peak below that floor on the RTX 4070's 12 GB.

Installation

1. Install PyTorch (RTX 4070 is Ada sm_89 — stock wheels work)

The RTX 4070 is Ada Lovelace (AD104), compute capability sm_89. sm_89 kernels ship in the default stable PyTorch CUDA wheels — no nightly, no --pre, and no special --index-url is required (this is the one place a Blackwell RTX 50-series recipe needs a cu128 nightly wheel; on Ada the stock CUDA 12.x build already covers your card). The standard ComfyUI install already pulls a working build:

pip install torch torchvision torchaudio

Verify the runtime sees the device:

python -c "import torch; print(torch.version.cuda, torch.cuda.get_device_capability())"

You want a CUDA 12.x version and (8, 9) printed.

2. Install the ComfyUI-GGUF custom node

Per the city96/ComfyUI-GGUF README, clone into ComfyUI's custom_nodes directory and install the gguf Python package:

git clone https://github.com/city96/ComfyUI-GGUF ComfyUI/custom_nodes/ComfyUI-GGUF
pip install --upgrade gguf

On Windows portable ComfyUI, use the embedded interpreter instead:

git clone https://github.com/city96/ComfyUI-GGUF ComfyUI/custom_nodes/ComfyUI-GGUF
.\python_embeded\python.exe -s -m pip install -r .\ComfyUI\custom_nodes\ComfyUI-GGUF\requirements.txt

Restart ComfyUI after install — the GGUF Unet loader node appears under the bootleg category.

3. Download the GGUF diffusion-model weights

Pick a Q6_K or Q5_K_M quant from the unsloth/ERNIE-Image-Turbo-GGUF repo. The unsloth card is a GGUF quant of the canonical baidu/ERNIE-Image-Turbo upstream (linked via its base_model) and credits city96's ComfyUI-GGUF as the loader tooling. On a 12 GB RTX 4070, lead with one of:

ernie-image-turbo-Q6_K.gguf — 6.79 GB on disk (best quality that still leaves comfortable display headroom)
ernie-image-turbo-Q5_K_M.gguf — 5.93 GB on disk (extra headroom if you also run the prompt enhancer)

# from your ComfyUI root — Q6_K is the recommended 12 GB tier
huggingface-cli download unsloth/ERNIE-Image-Turbo-GGUF \
  ernie-image-turbo-Q6_K.gguf \
  --local-dir ComfyUI/models/unet

Per the ComfyUI-GGUF README, GGUF diffusion-model files live in ComfyUI/models/unet.

Q8_0 is a 16 GB / headless tier, not a 12 GB tier. The same repo ships ernie-image-turbo-Q8_0.gguf (8.69 GB on disk). On a 12 GB card with a display attached (~11 GB usable), the Q8_0 weights plus the text encoder and activations push the real-time peak right up to the budget — Q8_0 is the right choice on a 16 GB card or a headless 12 GB Linux box, not a 12 GB desktop. Stay at Q6_K / Q5_K_M on the RTX 4070.

4. Download the text encoder and VAE

The GGUF diffusion model still needs the auxiliary files the workflow expects. Pull them from the Comfy-Org/ERNIE-Image repackager (the ComfyUI core team's repackaging into ComfyUI's expected layout):

# from your ComfyUI root — text encoder (Ministral-3-3B, 7.72 GB)
huggingface-cli download Comfy-Org/ERNIE-Image \
  text_encoders/ministral-3-3b.safetensors \
  --local-dir ComfyUI/models/

# VAE (Flux2 VAE, 0.34 GB)
huggingface-cli download Comfy-Org/ERNIE-Image \
  vae/flux2-vae.safetensors \
  --local-dir ComfyUI/models/

# optional prompt enhancer (6.88 GB) — skip on 12 GB unless you disable it per-run (see Running)
huggingface-cli download Comfy-Org/ERNIE-Image \
  text_encoders/ernie-image-prompt-enhancer.safetensors \
  --local-dir ComfyUI/models/

The official ComfyUI ERNIE-Image tutorial lists the same Turbo auxiliary files — ministral-3-3b.safetensors (text encoder), ernie-image-prompt-enhancer.safetensors (prompt enhancer text encoder), and flux2-vae.safetensors (VAE) — under this layout:

📂 ComfyUI/
├── 📂 models/
│   ├── 📂 unet/
│   │   └── ernie-image-turbo-Q6_K.gguf      ← the GGUF diffusion model from step 3
│   ├── 📂 text_encoders/
│   │   ├── ministral-3-3b.safetensors
│   │   └── ernie-image-prompt-enhancer.safetensors
│   └── 📂 vae/
│       └── flux2-vae.safetensors

(The tutorial's default layout puts a full ernie-image-turbo.safetensors in diffusion_models/; this recipe replaces that slot with the GGUF in models/unet loaded via the GGUF node — see step 5.)

5. Load the Turbo workflow template

The official ComfyUI tutorial documents the base ERNIE-Image get-started flow as four steps: update ComfyUI to the latest version (or use Comfy Cloud), open the Template menu and search for ERNIE-Image, select the ERNIE-Image workflow, then download any missing models, update the prompt, and click Run. For the Turbo variant the same tutorial page describes it as a faster variant optimized with DMD and RL that generates images in 8 steps versus the roughly 50 steps the standard model needs, and it offers a separate "Download the ERNIE-Image-Turbo text-to-image workflow JSON file" link. (Baidu's own card confirms this characterization: the Turbo checkpoint is "optimized by DMD and RL" and produces output "in only 8 inference steps" — see the HF card.) Download that Turbo JSON and load it in ComfyUI.

In the loaded Turbo template, swap the default Load Diffusion Model node for the GGUF Unet loader node (the bootleg category from ComfyUI-GGUF), pointing it at the Q6_K file you downloaded in step 3. The text encoder, VAE, and sampler graph stay as the template ships them.

Running

With the workflow loaded and the GGUF loader wired in:

Set resolution to one of the Baidu-recommended sizes: 1024×1024, 848×1264, 1264×848, 768×1376, 896×1200, 1376×768, or 1200×896.
Set sampler steps to 8 and guidance scale (CFG) to 1.0 — Turbo is step-distilled (DMD + RL per the Baidu HF card) and tuned for 8-step generation. Higher CFG degrades output.
On a 12 GB card, leave the prompt enhancer disabled (use_pe=False in diffusers terms; in ComfyUI this is the toggle on the ERNIE prompt-enhancer node). It loads a second ~6.88 GB text encoder and is the most common way to blow the 12 GB budget. Enable it only if you drop to Q5_K_M and have closed other VRAM consumers.
Hit Queue Prompt.

First run is slow due to weight load; subsequent runs reuse the cached diffusion model.

Results

Speed: Not quoted. No community benchmark on the RTX 4070 for ERNIE-Image-Turbo is currently cited, and /check/ernie-image-turbo/rtx-4070 reports no benchmark for this pair yet (verdict: unknown). The RTX 4070's ~504 GB/s memory bandwidth and 5888 CUDA cores differ enough from every available sibling card — including the 16 GB Ada RTX 4070 Ti SUPER (~672 GB/s, 8448 cores) — that none of their figures would transfer honestly. The /check page populates once a benchmark lands — to contribute one, see the submission form.
VRAM usage: The diffusion-model weights are 6.79 GB at Q6_K (or 5.93 GB at Q5_K_M) per the unsloth GGUF tree. The Ministral-3B text encoder (7.72 GB) and Flux2 VAE (0.34 GB) add to that, but ComfyUI runs the text encoder once per generation then frees it before the diffusion sampling pass, so the sampling-time peak is dominated by the GGUF weights + VAE + activations. The 12 GB recipe minimum tracks the FP8-path floor documented in the sarcastictofu Civitai workflow notes, used here as a conservative ceiling until a measured Q6_K benchmark lands at /check/.
Quality notes: 8-step distilled output (DMD + RL). For the cleanest fidelity stay at the recommended 1024×1024 or 848×1264 resolutions. Q6_K is the highest GGUF tier that fits a 12 GB display card with headroom; Q8_0 (8.69 GB) and the BF16 single-file (16.07 GB) are 16 GB / headless tiers.

For the full benchmark data once it lands, see /check/ernie-image-turbo/rtx-4070.

Troubleshooting

Out of memory during inference

ERNIE-Image-Turbo's unquantized paths are heavy: a user reports that on a 24 GB RTX 4090 the model loads but hits an out-of-memory error during inference on both the SGLang and Diffusers paths (baidu/ERNIE-Image Issue #4, reporter animebing); a contributor in that thread suggests pipe.enable_model_cpu_offload() for the diffusers path. On a 12 GB RTX 4070 the GGUF route sidesteps that, but if you still OOM:

Disable the prompt enhancer (use_pe=False) to free the ~6.88 GB second text encoder.
Drop one quant tier: the unsloth repo ships ernie-image-turbo-Q5_K_M.gguf (5.93 GB), ernie-image-turbo-Q4_K_M.gguf (5.02 GB), and ernie-image-turbo-Q4_0.gguf (4.76 GB) — drop-in replacements at the GGUF Unet loader.
Lower output resolution to 1024×1024.
Restart ComfyUI between runs to reset accumulated VRAM if your driver is leaking allocations.

Q8_0 OOMs on this card

The same unsloth repo ships ernie-image-turbo-Q8_0.gguf (8.69 GB), which the 16 GB siblings use as their default. On a 12 GB RTX 4070 with a display attached (~11 GB usable), the Q8_0 weights plus the resident text encoder and activations push the peak right up to the budget and OOM is likely. Stay at Q6_K (6.79 GB) or Q5_K_M (5.93 GB); reserve Q8_0 for a 16 GB card or a headless 12 GB Linux box with no display claiming VRAM.

The GGUF Unet loader node isn't visible after install

Per the ComfyUI-GGUF README, the node lives under the bootleg category. If it's missing entirely:

Confirm the clone landed in ComfyUI/custom_nodes/ComfyUI-GGUF/ (not nested one level deeper).
Verify pip install --upgrade gguf ran in the same Python environment ComfyUI uses (use the embedded interpreter on Windows portable).
Restart ComfyUI fully (not just a browser refresh).

The `Load Diffusion Model` node throws "unsupported format" on a `.gguf` file

You're using the default loader, not the GGUF one. The stock ComfyUI Load Diffusion Model node only reads safetensors. Replace it with the GGUF Unet loader from the bootleg category — that's the whole point of installing the custom node in step 2.