self-hosted/ai
§01·recipe · image

Z-Image Turbo on RTX 4080: 8-Step 1024x1024 Text-to-Image at BF16 with Diffusers or ComfyUI

imageintermediate16GB+ VRAMMay 29, 2026
models
tools
prerequisites
  • NVIDIA RTX 4080 (16GB VRAM) or any consumer GPU with 16GB VRAM
  • Python 3.10+
  • PyTorch with CUDA support and bfloat16 capability
  • ComfyUI (latest nightly) OR HuggingFace diffusers from main

What You'll Build

A local install of Z-Image-Turbo — Alibaba Tongyi-MAI's 6B-parameter distilled image generation model — running text-to-image at 1024×1024 in 8 inference steps on an RTX 4080. The 16 GB Ada Lovelace card sits exactly in the model's headline VRAM tier, so the canonical BF16 weights run directly via diffusers or via the official ComfyUI workflow — no GGUF quantization, no text-encoder workarounds.

Hardware data: RTX 4080 (16GB VRAM) · 8 NFEs at 1024×1024 BF16 · See benchmark data

ℹ️ Why this is the comfortable path: Z-Image-Turbo pairs a 6B DiT with a Qwen3-4B text encoder (visible in the Comfy-Org workflow's text_encoders/qwen_3_4b.safetensors at ~8 GB on disk), which alone needs roughly 8 GB at BF16 — too tight for an 8 GB card to hold alongside the DiT + VAE. On a 16 GB card the upstream-recommended BF16 build "fits comfortably within 16G VRAM consumer devices" per the Tongyi-MAI model card, so this recipe stays on the canonical BF16 path rather than reaching for GGUF redistributors.

Note on variants: The Tongyi-MAI Z-Image family currently ships four variants — Z-Image-Turbo, Z-Image (the foundation model), Z-Image-Omni-Base, and Z-Image-Edit. This recipe targets Z-Image-Turbo, the consumer-friendly distilled variant. Fine-tunes like Juggernaut-Z (RunDiffusion) are a separate model with its own recipe.

Requirements

ComponentMinimumTested
GPU16GB VRAM consumer cardRTX 4080 (16GB, Ada Lovelace, sm_89)
RAM16GB system RAM
Storage~21GB on disk (DiT 12.3 GB + Qwen3-4B text encoder 8.0 GB + VAE 0.3 GB, per the Comfy-Org split-file mirror)
SoftwarePython 3.10+, PyTorch with CUDA + bf16 supportComfyUI nightly / diffusers @ main

Z-Image-Turbo "fits comfortably within 16G VRAM consumer devices" per the official Tongyi-MAI model card — the RTX 4080 matches that target tier exactly. No special CUDA wheel selection is required for Ada Lovelace cards: the default pip install torch already ships sm_89 kernels (unlike Blackwell sm_120 cards, the 4080 needs no cu128-specific index URL).

Installation

Path A — HuggingFace diffusers (Python script)

Z-Image support landed in diffusers via two merged PRs (#12703 and #12715); install from source per the official model card and the Tongyi-MAI/Z-Image GitHub README:

pip install git+https://github.com/huggingface/diffusers
pip install torch transformers accelerate safetensors

Path B — ComfyUI (official workflow)

Per the official ComfyUI tutorial, update ComfyUI to the latest nightly via ComfyUI Manager, then place three files into the standard model directories:

# from your ComfyUI root
cd models/diffusion_models
wget https://huggingface.co/Comfy-Org/z_image_turbo/resolve/main/split_files/diffusion_models/z_image_turbo_bf16.safetensors

cd ../text_encoders
wget https://huggingface.co/Comfy-Org/z_image_turbo/resolve/main/split_files/text_encoders/qwen_3_4b.safetensors

cd ../vae
wget https://huggingface.co/Comfy-Org/z_image_turbo/resolve/main/split_files/vae/ae.safetensors

These are the Comfy-Org-packaged split-file mirror of the Tongyi-MAI weights, repackaged for the ComfyUI loader graph. Load the workflow JSON from Comfy-Org/workflow_templates by dragging it into ComfyUI.

Running

Path A — diffusers snippet

The inference snippet below is from the Tongyi-MAI HF model card. Z-Image-Turbo uses 8 NFEs; the card's snippet sets num_inference_steps=9 and guidance_scale=0.0 — the card notes 9 results in 8 DiT forwards, and that guidance should be 0 for the Turbo models:

import torch
from diffusers import ZImagePipeline

pipe = ZImagePipeline.from_pretrained(
    "Tongyi-MAI/Z-Image-Turbo",
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=False,
)
pipe.to("cuda")

prompt = "A photo of a city at night, neon signs reflecting on wet pavement"
image = pipe(
    prompt=prompt,
    height=1024,
    width=1024,
    num_inference_steps=9,  # This actually results in 8 DiT forwards
    guidance_scale=0.0,     # Guidance should be 0 for the Turbo models
    generator=torch.Generator("cuda").manual_seed(42),
).images[0]
image.save("z_image_turbo_out.png")

Path B — ComfyUI

After dropping the workflow JSON into ComfyUI, edit the prompt node and hit Queue Prompt. The preconfigured workflow runs with the 8-NFE schedule out of the box.

Results

  • Speed: No community benchmark on the RTX 4080 has been published yet, and the backend has no ingested run for this pair (/check/z-image-turbo/rtx-4080 returns verdict: unknown). The official Tongyi-MAI card cites "sub-second inference latency on enterprise-grade H800 GPUs", which is not comparable to a consumer Ada card. We deliberately do not borrow a figure from a sibling card: the RTX 4080's ~716.8 GB/s memory bandwidth sits well below the RTX 4090 (~1008 GB/s, an upper bound) and above the RTX 3090 Ti — neither is close enough to relabel honestly. If you run it, please submit your numbers so a measured figure can land here.
  • VRAM usage: The model "fits comfortably within 16G VRAM consumer devices" per the official model card — i.e. the BF16 build is the headline configuration for a 16GB card like the RTX 4080. Live measurements: /check/z-image-turbo/rtx-4080.
  • Quality notes: Architecture is "Scalable Single-Stream DiT (S3-DiT)" with text, visual semantic, and VAE tokens concatenated into a unified input stream — design optimized for 8-NFE generation while matching or exceeding leading competitors per the model card.

For the full benchmark data, see /check/z-image-turbo/rtx-4080.

Troubleshooting

Out of memory at first generation (diffusers path)

If the BF16 pipeline doesn't fit alongside other GPU-resident apps (browser GPU acceleration, second monitor, idle Docker compute), enable CPU offload — the Tongyi-MAI model card documents pipe.enable_model_cpu_offload() for memory-constrained devices, which moves idle parts to system RAM:

pipe = ZImagePipeline.from_pretrained(
    "Tongyi-MAI/Z-Image-Turbo",
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=False,
)
pipe.enable_model_cpu_offload()
# do NOT call pipe.to("cuda") when using offload

On the RTX 4080's PCIe Gen4 x16 link, offload is rarely needed at 1024×1024 — the 16 GB envelope is the model's headline tier — but it is the documented escape hatch if you co-host other GPU workloads.

ComfyUI doesn't recognize Z-Image nodes

The Z-Image loader nodes ship in ComfyUI's nightly builds, not the stable release. Update via ComfyUI Manager → "Update ComfyUI" → restart. Verified path documented on the official ComfyUI Z-Image tutorial.

Confusion with Juggernaut-Z

Juggernaut-Z is a RunDiffusion fine-tune of Z-Image Base, distributed under RunDiffusion/Juggernaut-Z-Image — a different model with its own slug. If you want the original Tongyi-MAI base or turbo weights, stick to the Tongyi-MAI/Z-Image-* repos linked above.