self-hosted/ai
§01·recipe · image

Z-Image Turbo on RTX 4090: 8-Step 1024x1024 Text-to-Image at BF16 in ~2.3s with Diffusers or ComfyUI

imagebeginner16GB+ VRAMMay 20, 2026
models
tools
prerequisites
  • NVIDIA RTX 4090 (24GB VRAM) or any consumer GPU with at least 16GB VRAM
  • Python 3.10+
  • PyTorch with CUDA support and bfloat16 capability
  • ComfyUI (latest nightly) OR HuggingFace diffusers from main

What You'll Build

A local install of Z-Image-Turbo — Alibaba Tongyi-MAI's 6B-parameter distilled image generation model — running text-to-image at 1024×1024 in 8 inference steps on an RTX 4090. The 24 GB Ada Lovelace card is comfortably over-provisioned for the model's 16 GB floor, so the canonical BF16 weights run directly via diffusers or via the official ComfyUI workflow — no GGUF quantization, no CPU offload, no text-encoder workarounds. Launch-day write-ups put a single 1024×1024 generation at approximately 2.3 seconds with a steady 13 GB VRAM footprint on this card.

Hardware data: RTX 4090 (24 GB VRAM) · ~2.3 s per 1024×1024 BF16 image at 8 NFEs · ~13 GB peak VRAM · See benchmark data

ℹ️ Why this is the comfortable path: Z-Image-Turbo pairs a 6B DiT with a Qwen3-4B text encoder (visible in the Comfy-Org workflow's text_encoders/qwen_3_4b.safetensors), which alone needs roughly 8 GB at BF16. On a 24 GB card the upstream-recommended BF16 build "fits comfortably" per the Tongyi-MAI model card, with measured peak around 13 GB per the Alibaba launch write-up via AIbase. That leaves headroom for batch_size > 1, higher resolutions, or other GPU-resident apps.

Note on variants: The Tongyi-MAI Z-Image family ships three weight sets — Z-Image (Base), Z-Image-Turbo, and Z-Image (Distilled). This recipe targets Z-Image-Turbo, the consumer-friendly distilled variant. Fine-tunes like Juggernaut-Z (RunDiffusion) are a separate model with its own slug.

Requirements

ComponentMinimumTested
GPU16GB VRAM consumer card (per Tongyi-MAI model card)RTX 4090 (24GB, Ada Lovelace, sm_89)
RAM16GB system RAM
Storage~21GB on disk (DiT 12.3 GB + Qwen3-4B text encoder 8.0 GB + VAE 0.3 GB, per the Comfy-Org split-file mirror)
SoftwarePython 3.10+, PyTorch with CUDA + bf16 supportComfyUI nightly / diffusers @ main

Z-Image-Turbo "fits comfortably within 16G VRAM consumer devices" per the official Tongyi-MAI model card. The RTX 4090 sits well above that floor at 24 GB, so the canonical BF16 path is unconstrained. No special CUDA wheel selection is required for Ada Lovelace cards: the default pip install torch already ships sm_89 kernels.

Installation

Path A — HuggingFace diffusers (Python script)

Z-Image support is in diffusers main; install from source per the official model card and the Tongyi-MAI/Z-Image GitHub README:

pip install git+https://github.com/huggingface/diffusers
pip install torch transformers accelerate safetensors

Path B — ComfyUI (official workflow)

Per the official ComfyUI tutorial, update ComfyUI to the latest nightly via ComfyUI Manager, then place three files into the standard model directories:

# from your ComfyUI root
cd models/diffusion_models
wget https://huggingface.co/Comfy-Org/z_image_turbo/resolve/main/split_files/diffusion_models/z_image_turbo_bf16.safetensors

cd ../text_encoders
wget https://huggingface.co/Comfy-Org/z_image_turbo/resolve/main/split_files/text_encoders/qwen_3_4b.safetensors

cd ../vae
wget https://huggingface.co/Comfy-Org/z_image_turbo/resolve/main/split_files/vae/ae.safetensors

These are the Comfy-Org-packaged split-file mirror of the Tongyi-MAI weights, repackaged for the ComfyUI loader graph. Load the workflow JSON from Comfy-Org/workflow_templates by dragging it into ComfyUI.

Running

Path A — diffusers snippet

The inference snippet below is verbatim from the Tongyi-MAI HF model card. Z-Image-Turbo uses 8 NFEs (the snippet uses num_inference_steps=9 and guidance_scale=0.0 per the card — 9 steps results in 8 DiT forwards):

import torch
from diffusers import ZImagePipeline

pipe = ZImagePipeline.from_pretrained(
    "Tongyi-MAI/Z-Image-Turbo",
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=False,
)
pipe.to("cuda")

prompt = "A photo of a city at night, neon signs reflecting on wet pavement"
image = pipe(
    prompt=prompt,
    height=1024,
    width=1024,
    num_inference_steps=9,
    guidance_scale=0.0,
    generator=torch.Generator("cuda").manual_seed(42),
).images[0]
image.save("z_image_turbo_out.png")

On a 24 GB RTX 4090 there is no need to call pipe.enable_model_cpu_offload() — the full BF16 pipeline (DiT + Qwen3-4B text encoder + VAE) fits resident in VRAM with margin.

Path B — ComfyUI

After dropping the workflow JSON into ComfyUI, edit the prompt node and hit Queue Prompt. The preconfigured workflow runs with the 8-NFE schedule out of the box.

Results

  • Speed: approximately 2.3 seconds per 1024×1024 image at 8 NFEs, BF16, on RTX 4090 — measured by the Alibaba Tongyi Lab launch write-up via AIbase (Nov 27, 2025 release-day reporting) and independently reproduced in the ComfyUI Wiki release announcement. Both sources name the 4090 explicitly and report the same number, consistent with the Tongyi-MAI card's "sub-second on H800" framing scaled to a slower consumer card.
  • VRAM usage: peak around 13 GB at 1024×1024 BF16 on RTX 4090, per the AIbase launch write-up ("the VRAM pointer stably at 13GB"). This sits well below the 16 GB headline floor on the model card and leaves ~11 GB of the 4090's 24 GB free for higher resolutions, batching, or coexisting workloads. Live measurements: /check/z-image-turbo/rtx-4090.
  • Quality notes: Architecture is "Scalable Single-Stream DiT (S3-DiT)" with text, visual semantic, and VAE tokens concatenated into a unified sequence — design optimized for 8-NFE generation rivaling full-step competitors per the model card. Z-Image-Turbo is noted in the launch coverage for strong bilingual (English + Chinese) text rendering and photorealistic portrait quality.

For the full benchmark data, see /check/z-image-turbo/rtx-4090. If you run this on your own 4090, please submit your numbers — community benchmarks let us replace the launch-day write-up citation with an empirical measurement.

Troubleshooting

Higher resolution exhausts VRAM beyond 13 GB

The 2.3 s / 13 GB measurement is for 1024×1024. Doubling either axis (e.g. 2048×1024 or 2048×2048) increases both latency and peak VRAM well beyond those numbers — the Comfy-Org workflow template defaults to 1024×1024 for that reason. On a 24 GB 4090 you have meaningful headroom for higher resolutions, but if you push past ~1.5K on either axis and run into OOM, fall back to the official model card snippet with pipe.enable_model_cpu_offload() enabled (system RAM acts as a backstop at a latency cost).

ComfyUI doesn't recognize Z-Image nodes

The Z-Image loader nodes ship in ComfyUI's nightly builds, not the stable release. Update via ComfyUI Manager → "Update ComfyUI" → restart. Verified path documented on the official ComfyUI Z-Image tutorial.

Confusion with Juggernaut-Z

Juggernaut-Z is a RunDiffusion fine-tune of Z-Image Base, distributed under RunDiffusion/Juggernaut-Z-Image — a different model with its own slug. If you want the original Tongyi-MAI base or turbo weights, stick to the Tongyi-MAI/Z-Image-* repos linked above.

Want the sibling 16 GB recipe?

If you're on a 16 GB card such as the RTX 4060 Ti 16GB or RTX 5060 Ti, the install and code path is identical — see the published recipe at /recipes/z-image-turbo-on-rtx-4060-ti-16gb-8-step-text-to-image-at-bf16-with-diffusers-or-comfyui. The 24 GB 4090 buys you headroom (batching, higher resolution) and a meaningful speed lift, but does not change the install steps.