Z-Image Turbo on RTX 4080 SUPER: 8-Step 1024x1024 Text-to-Image at BF16 with Diffusers or ComfyUI

What You'll Build

A local install of Z-Image-Turbo — Alibaba Tongyi-MAI's 6B-parameter distilled image generation model — running text-to-image at 1024×1024 in 8 inference steps on an RTX 4080 SUPER. The 16 GB Ada Lovelace card sits exactly in the model's headline VRAM tier, so the canonical BF16 weights run directly via diffusers or via the official ComfyUI workflow — no GGUF quantization, no text-encoder workarounds.

Hardware data: RTX 4080 SUPER (16GB VRAM) · 8 NFEs at 1024×1024 BF16 · See benchmark data

ℹ️ Why this is the comfortable path: Z-Image-Turbo pairs a 6B DiT with a Qwen3-4B text encoder (visible in the Comfy-Org workflow's text_encoders/qwen_3_4b.safetensors at ~8 GB on disk), which alone needs roughly 8 GB at BF16 — too tight for an 8 GB card to hold alongside the DiT + VAE. On a 16 GB card the upstream-recommended BF16 build "fits comfortably within 16G VRAM consumer devices" per the Tongyi-MAI model card, so this recipe stays on the canonical BF16 path rather than reaching for GGUF redistributors.

Note on variants: The Tongyi-MAI Z-Image family currently ships four variants — Z-Image-Turbo, Z-Image (the foundation model), Z-Image-Omni-Base, and Z-Image-Edit (the latter two listed as To be released in the card's Model Zoo). This recipe targets Z-Image-Turbo, the consumer-friendly distilled variant. Fine-tunes like Juggernaut-Z (RunDiffusion) are a separate model with its own recipe.

Requirements

Component	Minimum	Tested
GPU	16GB VRAM consumer card	RTX 4080 SUPER (16GB GDDR6X, Ada Lovelace AD103, sm_89)
RAM	16GB system RAM	—
Storage	~21GB on disk (DiT 12.3 GB + Qwen3-4B text encoder 8.0 GB + VAE 0.3 GB, per the Comfy-Org split-file mirror)	—
Software	Python 3.10+, PyTorch with CUDA + bf16 support	ComfyUI nightly / `diffusers` @ main

Z-Image-Turbo "fits comfortably within 16G VRAM consumer devices" per the official Tongyi-MAI model card — the RTX 4080 SUPER (16 GB GDDR6X, 256-bit bus) matches that target tier exactly. No special CUDA wheel selection is required for Ada Lovelace cards: the default pip install torch already ships sm_89 kernels (unlike Blackwell sm_120 cards, the 4080 SUPER needs no cu128-specific index URL). The card's BF16 weights load natively — Ada has hardware FP8 (E4M3/E5M2) support, but the BF16 build is the documented 16 GB path and is what this recipe installs.

Installation

Path A — HuggingFace diffusers (Python script)

Z-Image support landed in diffusers via two merged PRs (#12703 "Add Support for Z-Image Series" and #12715); install from source per the official model card and the Tongyi-MAI/Z-Image GitHub README:

pip install git+https://github.com/huggingface/diffusers
pip install torch transformers accelerate safetensors

Path B — ComfyUI (official workflow)

Per the official ComfyUI tutorial, update ComfyUI to the latest nightly via ComfyUI Manager, then place three files into the standard model directories:

# from your ComfyUI root
cd models/diffusion_models
wget https://huggingface.co/Comfy-Org/z_image_turbo/resolve/main/split_files/diffusion_models/z_image_turbo_bf16.safetensors

cd ../text_encoders
wget https://huggingface.co/Comfy-Org/z_image_turbo/resolve/main/split_files/text_encoders/qwen_3_4b.safetensors

cd ../vae
wget https://huggingface.co/Comfy-Org/z_image_turbo/resolve/main/split_files/vae/ae.safetensors

These are the Comfy-Org-packaged split-file mirror of the Tongyi-MAI weights, repackaged for the ComfyUI loader graph. Load the workflow JSON from Comfy-Org/workflow_templates by dragging it into ComfyUI.

Running

Path A — diffusers snippet

The inference snippet below is from the Tongyi-MAI HF model card. Z-Image-Turbo uses 8 NFEs; the card's snippet sets num_inference_steps=9 and guidance_scale=0.0 — the card notes the step value "This actually results in 8 DiT forwards", and that "Guidance should be 0 for the Turbo models":

import torch
from diffusers import ZImagePipeline

pipe = ZImagePipeline.from_pretrained(
    "Tongyi-MAI/Z-Image-Turbo",
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=False,
)
pipe.to("cuda")

prompt = "A photo of a city at night, neon signs reflecting on wet pavement"
image = pipe(
    prompt=prompt,
    height=1024,
    width=1024,
    num_inference_steps=9,  # This actually results in 8 DiT forwards
    guidance_scale=0.0,     # Guidance should be 0 for the Turbo models
    generator=torch.Generator("cuda").manual_seed(42),
).images[0]
image.save("z_image_turbo_out.png")

Path B — ComfyUI

After dropping the workflow JSON into ComfyUI, edit the prompt node and hit Queue Prompt. The preconfigured workflow runs with the 8-NFE schedule out of the box.

Results

Speed: No community benchmark on the RTX 4080 SUPER has been published yet, and the backend has no ingested run for this pair (/check/z-image-turbo/rtx-4080-super returns verdict: unknown). The official Tongyi-MAI card cites "sub-second inference latency" on enterprise-grade H800 GPUs, which is not comparable to a consumer Ada card. No first-party RTX 4080 / 4080 SUPER timing exists to relabel honestly either, so we deliberately omit a speed figure rather than extrapolate one. If you run it, please submit your numbers so a measured figure can land here.
VRAM usage: The model "fits comfortably within 16G VRAM consumer devices" per the official model card — i.e. the BF16 build is the headline configuration for a 16GB card like the RTX 4080 SUPER. Live measurements: /check/z-image-turbo/rtx-4080-super.
Quality notes: Architecture is a "Scalable Single-Stream DiT" (S3-DiT) with text, visual semantic, and VAE tokens concatenated into a unified input stream — design optimized for 8-NFE generation while matching or exceeding leading competitors per the model card.

For the full benchmark data, see /check/z-image-turbo/rtx-4080-super.

Troubleshooting

Out of memory at first generation (diffusers path)

If the BF16 pipeline doesn't fit alongside other GPU-resident apps (browser GPU acceleration, second monitor, idle Docker compute), enable CPU offload — the Tongyi-MAI model card documents pipe.enable_model_cpu_offload() for memory-constrained devices, which moves idle parts to system RAM:

pipe = ZImagePipeline.from_pretrained(
    "Tongyi-MAI/Z-Image-Turbo",
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=False,
)
pipe.enable_model_cpu_offload()
# do NOT call pipe.to("cuda") when using offload

On the RTX 4080 SUPER's PCIe Gen4 x16 link, offload is rarely needed at 1024×1024 — the 16 GB envelope is the model's headline tier — but it is the documented escape hatch if you co-host other GPU workloads.

ComfyUI doesn't recognize Z-Image nodes

The Z-Image loader nodes ship in ComfyUI's nightly builds, not the stable release. Update via ComfyUI Manager → "Update ComfyUI" → restart. Verified path documented on the official ComfyUI Z-Image tutorial.

Confusion with Juggernaut-Z

Juggernaut-Z is a RunDiffusion fine-tune of Z-Image Base, distributed under RunDiffusion/Juggernaut-Z-Image — a different model with its own slug. If you want the original Tongyi-MAI base or turbo weights, stick to the Tongyi-MAI/Z-Image-* repos linked above.