self-hosted/ai
§01·recipe · image

ERNIE-Image-Turbo on RTX 5090: 8-Step Text-to-Image at BF16 in ComfyUI

imageintermediate24GB+ VRAMMay 25, 2026
models
tools
prerequisites
  • NVIDIA RTX 5090 (32GB VRAM) or any 24GB+ NVIDIA GPU with comfortable headroom
  • Python 3.10+
  • ComfyUI (latest, with cu128 PyTorch wheels for Blackwell sm_120)

What You'll Build

A working ComfyUI text-to-image pipeline that runs Baidu's 8B ERNIE-Image-Turbo at full BF16 precision on an RTX 5090 using the official Comfy-Org repackager (16.07 GB single-file safetensors). 8 inference steps per image, full 1024×1024 native resolution, no quantization required — the 5090's 32 GB envelope is the first consumer card that comfortably clears Baidu's documented 24 GB floor with real margin for the Ministral-3B text encoder, Flux2 VAE, and activation memory.

Hardware data: RTX 5090 (32GB VRAM) · 8 inference steps · BF16 single-file · See benchmark data

ℹ️ 24 GB is the documented floor, not a comfortable fit. Baidu's model card states ERNIE-Image-Turbo "can run on consumer GPUs with 24G VRAM" (HF card), but a user reports OOM during inference on an RTX 4090 24 GB with both SGLang and Diffusers paths in Issue #4. The 5090's 32 GB envelope is the first consumer card that clears the 24 GB floor with margin to spare — making BF16 the no-quantization-needed default here, where 24 GB cards still need GGUF (see the RTX 5060 Ti recipe for the GGUF path).

Requirements

ComponentMinimumTested
GPU24 GB VRAM NVIDIA (per Baidu HF card)RTX 5090 (32GB)
RAM32 GB system RAM
Storage~31 GB (UNet 16.07 GB + text encoders 14.5 GB + VAE 0.34 GB)
SoftwareComfyUI (latest), ComfyUI-Manager, Python 3.10+, PyTorch with CUDA 12.8 (cu128) wheels for Blackwell sm_120

The Comfy-Org repackager ships the model as a single 16.07 GB safetensors file (files manifest) — auxiliary text encoders (Ministral-3-3B at 7.72 GB + ERNIE prompt enhancer at 6.88 GB) and the Flux2 VAE (0.34 GB) load separately on demand.

Installation

1. Update ComfyUI to a build that ships ERNIE-Image native nodes

Per the official ComfyUI tutorial, the Get-started flow for the ERNIE-Image family is: update ComfyUI to the latest version (or use Comfy Cloud), open Template and search for ERNIE-Image, then select the workflow you want — the page documents the base ERNIE-Image workflow inline, and points at "the ERNIE-Image-Turbo text-to-image workflow JSON file" separately (the Turbo template lives in the same Template browser; pick it from the search results). Then download any missing models, update the prompt, and click Run.

Make sure your ComfyUI install uses the cu128 PyTorch wheels (CUDA 12.8) — Blackwell sm_120 kernels first ship in CUDA 12.8, and the default cu126 wheels miss them. The ComfyUI portable Windows build ships cu128 by default; manual installs should pin:

pip install --upgrade --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu128

2. Download the Comfy-Org repackager weights

The Comfy-Org/ERNIE-Image repackager is the ComfyUI core team's own repackaging of Baidu's release into ComfyUI's expected layout. Use huggingface-cli from your ComfyUI root:

# UNet (16.07 GB)
huggingface-cli download Comfy-Org/ERNIE-Image \
  diffusion_models/ernie-image-turbo.safetensors \
  --local-dir ComfyUI/models/

# Text encoders (Ministral-3-3B + prompt enhancer, ~14.5 GB)
huggingface-cli download Comfy-Org/ERNIE-Image \
  text_encoders/ministral-3-3b.safetensors \
  text_encoders/ernie-image-prompt-enhancer.safetensors \
  --local-dir ComfyUI/models/

# VAE (Flux2 VAE, 0.34 GB)
huggingface-cli download Comfy-Org/ERNIE-Image \
  vae/flux2-vae.safetensors \
  --local-dir ComfyUI/models/

The Comfy-Org README documents the expected layout verbatim:

📂 ComfyUI/
├── 📂 models/
│   ├── 📂 diffusion_models/
│   │   └── ernie-image-turbo.safetensors
│   ├── 📂 text_encoders/
│   │   ├── ministral-3-3b.safetensors
│   │   └── ernie-image-prompt-enhancer.safetensors
│   └── 📂 vae/
│       └── flux2-vae.safetensors

3. Load the ERNIE-Image-Turbo workflow template

In ComfyUI: Workflow → Browse Templates → search "ERNIE-Image" → select "ERNIE-Image-Turbo". If any files are still missing, ComfyUI's missing-model dialog will offer direct download links pointed at the same Comfy-Org repo from step 2.

Running

With the workflow loaded:

  1. Set resolution to one of the Baidu-recommended sizes: 1024×1024, 848×1264, 1264×848, 768×1376, 896×1200, 1376×768, or 1200×896.
  2. Set sampler steps to 8 and guidance scale (CFG) to 1.0 — Turbo is step-distilled (DMD + RL per the Baidu HF card) and explicitly tuned for these settings. Higher CFG degrades output.
  3. Optionally enable the prompt enhancer (use_pe=True in diffusers; the toggle is on the ERNIE prompt-enhancer node in the official template). The enhancer adds ~6.88 GB of resident VRAM but improves complex-prompt fidelity on the OneIG and LongTextBench scores documented in the Baidu card.
  4. Hit Queue Prompt.

First run loads the 16 GB UNet from disk; subsequent runs reuse the cached weights.

Results

  • Speed: Not quoted — no community benchmark on RTX 5090 with matching configuration (resolution, step count, sampler, prompt-enhancer toggle) is currently cited in the sources reviewed. The /check/ernie-image-turbo/rtx-5090 page will populate once a benchmark report lands. To contribute one, see submission form.
  • VRAM usage: Baidu's HF card states the model "can run on consumer GPUs with 24G VRAM" — the 5090's 32 GB envelope clears this floor with ~8 GB of margin for the prompt enhancer, activations, and CUDA workspace. Per-file disk sizes (Comfy-Org tree): UNet 16.07 GB + text encoders 14.59 GB + VAE 0.34 GB; not all of those are resident simultaneously (text encoders run once per generation then offload).
  • Quality notes: ERNIE-Image-Turbo is competitive with Z-Image, FLUX.2-Klein-9B, and Qwen-Image on the GenEval and OneIG benchmarks (Baidu card) — particularly strong on LongTextBench (English/Chinese text rendering) where it scores 0.9655 with prompt enhancer versus FLUX.2-Klein-9B's 0.5413. Best output stays at the recommended resolutions; off-aspect-ratio crops degrade.

For the full benchmark data once it lands, see /check/ernie-image-turbo/rtx-5090.

Troubleshooting

OOM during inference even though the model loads

Per Issue #4 on baidu/ERNIE-Image, a user reports OOM during inference on an RTX 4090 24 GB with both the SGLang and Diffusers paths — "the model can be successfully loaded, an out-of-memory (OOM) error occurs during the inference process." No maintainer response is published in that thread yet.

On the 5090's 32 GB envelope this is unlikely to bite at the recommended resolutions, but if you push to larger resolutions or batch sizes the same envelope cliff can hit. Workarounds:

  1. Stay at the recommended resolution list — 1024×1024 is the safest first run.
  2. Disable the prompt enhancer (use_pe=False) to free ~6.88 GB. Quality drops on long prompts but throughput recovers.
  3. Drop to the Comfy-Org repackager's GGUF alternative path via the unsloth/ERNIE-Image-Turbo-GGUF mirror — Q8_0 (8.69 GB on disk) loads through city96's ComfyUI-GGUF custom node and frees ~7 GB of weight memory. The RTX 5060 Ti recipe walks the GGUF path end-to-end.

Blackwell sm_120 kernel missing / "no kernel image is available for execution on the device"

The RTX 5090 is Blackwell sm_120 — kernels for this architecture first ship in CUDA 12.8 (cu128) PyTorch wheels. If your install uses the older cu126 default you'll see kernel-missing errors at the first inference step. Verify with:

python -c "import torch; print(torch.version.cuda, torch.cuda.get_device_capability())"

You want 12.8 and (12, 0) printed. If not, reinstall PyTorch per step 1 above. Same applies to FlashAttention 2 — FA2 wheels for sm_120 are still incomplete as of mid-2026 (see Dao-AILab/flash-attention#2168). ComfyUI's default attention path (PyTorch SDPA) covers sm_120; only manual flash_attn_func calls hit the gap.

Community FP8 / NVFP4 mirrors exist — should I use them on Blackwell?

A community-published mirror Abiray/ERNIE-Image-Turbo-FP8-NVFP4 ships both an FP8 weight file (8.2 GB) and an NVFP4 weight file (4.8 GB). NVFP4 is genuinely Blackwell-native (sm_120 has FP4 microscaling tensor cores; Ada and Ampere do not), so on paper this is a real speed win on the 5090 versus the BF16 path. Caveats:

  • Neither file has an official Baidu or Comfy-Org link-back; the mirror is a single-author community upload (no maintainer endorsement per the discussions tab).
  • The mirror documents only a diffusers loader (DiffusionPipeline.from_pretrained), not a ComfyUI workflow — ComfyUI's NVFP4 loader path for ERNIE-Image is not yet documented in the official tutorial.
  • No 5090-named VRAM or speed measurement is published for either quant.

This recipe stays on the Comfy-Org BF16 path because the 32 GB envelope makes the quantization-for-fit motivation moot. If you want to experiment with NVFP4 specifically for the Blackwell FP4 acceleration, treat it as research-mode, not as the recommended install — and please report findings to /contribute.