ERNIE-Image-Turbo on RTX 5070: 8-step text-to-image via GGUF in ComfyUI

What You'll Build

A working ComfyUI text-to-image pipeline that runs Baidu's 8B ERNIE-Image-Turbo on a 12GB RTX 5070 using a step-down GGUF quant from the unsloth/ERNIE-Image-Turbo-GGUF repo, loaded through city96's ComfyUI-GGUF custom node. Eight inference steps per image at full 1024×1024 native resolution.

Hardware data: RTX 5070 (12GB VRAM) · 8 inference steps · GGUF Q6_K / Q5_K_M · See benchmark data

ℹ️ Why a Q6_K/Q5_K_M GGUF and not Q8_0 or the full BF16 release. Baidu's card states ERNIE-Image-Turbo can run on consumer GPUs with 24G VRAM (HF card, "Practical deployment" highlight), and a user reports OOM during inference even on a 24 GB card on the BF16/SGLang paths (see Troubleshooting). On a 12 GB card the usable budget after a display is closer to 11 GB, so this recipe leads with the Q6_K (6.79 GB) or Q5_K_M (5.93 GB) GGUF rather than the Q8_0 (8.69 GB) the 16 GB siblings use — the smaller diffusion-model weights leave headroom for the Ministral-3B text encoder and activations. Q8_0 is kept as a headless-only / 16GB note below.

Requirements

Component	Minimum	Tested
GPU	12GB VRAM NVIDIA (per Civitai workflow notes)	RTX 5070 (12GB)
RAM	16GB system RAM	—
Storage	~15 GB for Q6_K UNet (6.79 GB) + text encoder (7.72 GB) + VAE (0.34 GB)	—
Software	ComfyUI (latest), ComfyUI-Manager, Python 3.10+, PyTorch with CUDA 12.8 (cu128) wheels for Blackwell sm_120	—

The unquantized Baidu release runs on consumer GPUs with 24G VRAM per the official ERNIE-Image-Turbo card — the GGUF quant brings that down to where a 12GB card has room for the diffusion-model weights, the Ministral-3B text encoder, the Flux2 VAE, and activation memory. The sarcastictofu Civitai workflow (a Base-or-Turbo ERNIE-Image flow that ships GGUF as its primary path with FP8 optional) documents a 12 GB minimum for its FP8 path; the smaller GGUF tiers used here keep peak below that floor on the RTX 5070's 12 GB.

Installation

1. Use the cu128 PyTorch wheels for Blackwell sm_120

The RTX 5070 is Blackwell sm_120 (GB205 die) — kernels for this architecture first ship in CUDA 12.8 (cu128) PyTorch wheels; the older cu126 default misses them. The ComfyUI portable Windows build ships cu128 by default. For a manual install, pin:

pip install --upgrade --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu128

Verify the runtime sees the device:

python -c "import torch; print(torch.version.cuda, torch.cuda.get_device_capability())"

You want 12.8 and (12, 0) printed.

2. Install the ComfyUI-GGUF custom node

Per the city96/ComfyUI-GGUF README, clone into ComfyUI's custom_nodes directory and install the gguf Python package:

git clone https://github.com/city96/ComfyUI-GGUF ComfyUI/custom_nodes/ComfyUI-GGUF
pip install --upgrade gguf

On Windows portable ComfyUI, use the embedded interpreter instead:

git clone https://github.com/city96/ComfyUI-GGUF ComfyUI/custom_nodes/ComfyUI-GGUF
.\python_embeded\python.exe -s -m pip install -r .\ComfyUI\custom_nodes\ComfyUI-GGUF\requirements.txt

Restart ComfyUI after install — the GGUF Unet loader node appears under the bootleg category.

3. Download the GGUF diffusion-model weights

Pick a Q6_K or Q5_K_M quant from the unsloth/ERNIE-Image-Turbo-GGUF repo. The unsloth card is a GGUF quant of the canonical baidu/ERNIE-Image-Turbo upstream (linked via its base_model) and credits city96's ComfyUI-GGUF as the loader tooling. On a 12 GB RTX 5070, lead with one of:

ernie-image-turbo-Q6_K.gguf — 6.79 GB on disk (best quality that still leaves comfortable display headroom)
ernie-image-turbo-Q5_K_M.gguf — 5.93 GB on disk (extra headroom if you also run the prompt enhancer)

# from your ComfyUI root — Q6_K is the recommended 12 GB tier
huggingface-cli download unsloth/ERNIE-Image-Turbo-GGUF \
  ernie-image-turbo-Q6_K.gguf \
  --local-dir ComfyUI/models/unet

Per the ComfyUI-GGUF README, GGUF diffusion-model files live in ComfyUI/models/unet.

Q8_0 is a 16 GB / headless tier, not a 12 GB tier. The same repo ships ernie-image-turbo-Q8_0.gguf (8.69 GB on disk). On a 12 GB card with a display attached (~11 GB usable), Q8_0 weights plus the text encoder and activations push real-time peak over the budget — Q8_0 is the right choice on a 16 GB card or a headless 12 GB Linux box, not a 12 GB desktop. Stay at Q6_K / Q5_K_M on the RTX 5070.

4. Download the text encoder and VAE

The GGUF diffusion model still needs the auxiliary files the workflow expects. Pull them from the Comfy-Org/ERNIE-Image repackager (the ComfyUI core team's repackaging into ComfyUI's expected layout):

# from your ComfyUI root — text encoder (Ministral-3-3B, 7.72 GB)
huggingface-cli download Comfy-Org/ERNIE-Image \
  text_encoders/ministral-3-3b.safetensors \
  --local-dir ComfyUI/models/

# VAE (Flux2 VAE, 0.34 GB)
huggingface-cli download Comfy-Org/ERNIE-Image \
  vae/flux2-vae.safetensors \
  --local-dir ComfyUI/models/

# optional prompt enhancer (6.88 GB) — skip on 12 GB unless you disable it per-run (see Running)
huggingface-cli download Comfy-Org/ERNIE-Image \
  text_encoders/ernie-image-prompt-enhancer.safetensors \
  --local-dir ComfyUI/models/

The official ComfyUI ERNIE-Image tutorial lists the same Turbo auxiliary files — ministral-3-3b.safetensors (text encoder), ernie-image-prompt-enhancer.safetensors (prompt enhancer text encoder), and flux2-vae.safetensors (VAE) — under this layout:

📂 ComfyUI/
├── 📂 models/
│   ├── 📂 unet/
│   │   └── ernie-image-turbo-Q6_K.gguf      ← the GGUF diffusion model from step 3
│   ├── 📂 text_encoders/
│   │   ├── ministral-3-3b.safetensors
│   │   └── ernie-image-prompt-enhancer.safetensors
│   └── 📂 vae/
│       └── flux2-vae.safetensors

(The tutorial's default layout puts a full ernie-image-turbo.safetensors in diffusion_models/; this recipe replaces that slot with the GGUF in models/unet loaded via the GGUF node — see step 5.)

5. Load the Turbo workflow template

The official ComfyUI tutorial documents the base ERNIE-Image get-started flow as four steps: update ComfyUI to the latest version (or use Comfy Cloud), go to Template and search for ERNIE-Image, select the ERNIE-Image workflow, then download any missing models, update the prompt, and click Run. For Turbo specifically the same tutorial page describes the model as a faster variant optimized with DMD and RL that generates images in just 8 steps compared to the ~50 steps the standard model needs, and it offers a separate "Download the ERNIE-Image-Turbo text-to-image workflow JSON file" link. Download that Turbo JSON and load it in ComfyUI.

In the loaded Turbo template, swap the default Load Diffusion Model node for the GGUF Unet loader node (the bootleg category from ComfyUI-GGUF), pointing it at the Q6_K file you downloaded in step 3. The text encoder, VAE, and sampler graph stay as the template ships them.

Running

With the workflow loaded and the GGUF loader wired in:

Set resolution to one of the Baidu-recommended sizes from the model card: 1024×1024, 848×1264, 1264×848, 768×1376, 896×1200, 1376×768, or 1200×896.
Set sampler steps to 8 and guidance scale (CFG) to 1.0 — Turbo is step-distilled (DMD + RL per the Baidu HF card) and tuned for 8-step generation at CFG 1.0. Higher CFG degrades output.
On a 12 GB card, leave the prompt enhancer disabled (use_pe=False in diffusers terms; in ComfyUI this is the toggle on the ERNIE prompt-enhancer node). It loads a second ~6.88 GB text encoder and is the most common way to blow the 12 GB budget. Enable it only if you drop to Q5_K_M and have closed other VRAM consumers.
Hit Queue Prompt.

First run is slow due to weight load; subsequent runs reuse the cached diffusion model.

Results

Speed: Not quoted. No community benchmark on the RTX 5070 for ERNIE-Image-Turbo is currently cited, and /check/ reports no benchmark for this pair yet. The RTX 5070 differs from the 16 GB Blackwell siblings on both memory bandwidth (~672 GB/s) and core count, so a sibling card's timing would not transfer honestly — see /check/ernie-image-turbo/rtx-5070, which will populate once a benchmark lands. To contribute one, see the submission form.
VRAM usage: The diffusion-model weights are 6.79 GB at Q6_K (or 5.93 GB at Q5_K_M) per the unsloth GGUF tree. The Ministral-3B text encoder (7.72 GB) and Flux2 VAE (0.34 GB) add to that, but ComfyUI runs the text encoder once per generation then frees it before the diffusion sampling pass, so the sampling-time peak is dominated by the GGUF weights + VAE + activations. The 12 GB recipe minimum tracks the FP8-path floor documented in the sarcastictofu Civitai workflow notes, used here as a conservative ceiling until a measured Q6_K benchmark lands at /check/.
Quality notes: 8-step distilled output (DMD + RL). For the cleanest fidelity stay at the recommended 1024×1024 or 848×1264 resolutions. Q6_K is the highest GGUF tier that fits a 12 GB display card with headroom; Q8_0 and the BF16 single-file (16.07 GB) are 16 GB / headless tiers.

For the full benchmark data once it lands, see /check/ernie-image-turbo/rtx-5070.

Troubleshooting

Out of memory during inference

ERNIE-Image-Turbo's unquantized paths are heavy: a community user reports that on a 24 GB RTX 4090 the model loads but hits an out-of-memory error during inference on both the SGLang and Diffusers paths (baidu/ERNIE-Image Issue #4, reporter animebing); a contributor in that thread suggests pipe.enable_model_cpu_offload() for the diffusers path. On a 12 GB RTX 5070 the GGUF route sidesteps that, but if you still OOM:

Disable the prompt enhancer (use_pe=False) to free the ~6.88 GB second text encoder.
Drop one quant tier: the unsloth repo ships ernie-image-turbo-Q5_K_M.gguf (5.93 GB), ernie-image-turbo-Q4_K_M.gguf (5.02 GB), and ernie-image-turbo-Q4_0.gguf (4.76 GB) — drop-in replacements at the GGUF Unet loader.
Lower output resolution to 1024×1024.
Restart ComfyUI between runs to reset accumulated VRAM if your driver is leaking allocations.

Blackwell sm_120 kernel missing / "no kernel image is available for execution on the device"

The RTX 5070 is Blackwell sm_120 — kernels for this architecture first ship in CUDA 12.8 (cu128) PyTorch wheels. If your install uses the older cu126 default you'll see kernel-missing errors at the first inference step; reinstall PyTorch per step 1. The same gap affects FlashAttention 2 — FA2 wheels for sm_120 are still incomplete as of mid-2026 (see Dao-AILab/flash-attention#2168). ComfyUI's default attention path (PyTorch SDPA) covers sm_120, so the stock GGUF workflow is unaffected; only manual flash_attn_func calls hit the gap.

The GGUF Unet loader node isn't visible after install

Per the ComfyUI-GGUF README, the node lives under the bootleg category. If it's missing entirely:

Confirm the clone landed in ComfyUI/custom_nodes/ComfyUI-GGUF/ (not nested one level deeper).
Verify pip install --upgrade gguf ran in the same Python environment ComfyUI uses (use the embedded interpreter on Windows portable).
Restart ComfyUI fully (not just a browser refresh).

The `Load Diffusion Model` node throws "unsupported format" on a `.gguf` file

You're using the default loader, not the GGUF one. The stock ComfyUI Load Diffusion Model node only reads safetensors. Replace it with the GGUF Unet loader from the bootleg category — that's the whole point of installing the custom node in step 2.