Z-Image Turbo on RTX 5070 Ti: 8-Step 1024x1024 Text-to-Image at BF16 with Diffusers or ComfyUI

What You'll Build

A local install of Z-Image-Turbo — Alibaba Tongyi-MAI's 6B-parameter distilled image generation model — running text-to-image at 1024×1024 in 8 inference steps on an RTX 5070 Ti (16GB). The 5070 Ti sits exactly in the model's headline VRAM tier, so the canonical BF16 weights run directly via diffusers or via the official ComfyUI workflow — no GGUF quantization, no text-encoder workarounds. The one thing the 5070 Ti needs that older cards don't is a Blackwell-aware PyTorch build (cu128).

Hardware data: RTX 5070 Ti (16GB VRAM) · 8 NFEs at 1024×1024 BF16 · See benchmark data

ℹ️ Why this is the comfortable path: Z-Image-Turbo pairs a 6B DiT with a Qwen3-4B text encoder (qwen_3_4b.safetensors, 8.04 GB at BF16 in the Comfy-Org split-file mirror). The encoder runs first to produce embeddings, then the 12.31 GB BF16 DiT denoises — so peak residency stays inside the 16 GB envelope. On a 16 GB card the upstream-recommended BF16 build "fits comfortably" per the Tongyi-MAI model card, so this recipe stays on the canonical BF16 path rather than reaching for community quants.

Note on variants: The Tongyi-MAI Z-Image family ships four variants — Z-Image-Turbo, Z-Image (the foundation model), Z-Image-Omni-Base, and Z-Image-Edit. This recipe targets Z-Image-Turbo, the 8-NFE distilled variant. Fine-tunes like Juggernaut-Z (RunDiffusion) are a separate model with its own recipe.

Requirements

Component	Minimum	Tested
GPU	16GB VRAM consumer card	RTX 5070 Ti (Blackwell, sm_120, 16GB GDDR7)
RAM	16GB system RAM	—
Storage	~21GB on disk (BF16 DiT 12.31 GB + Qwen3-4B text encoder 8.04 GB + VAE 0.34 GB, per the Comfy-Org split-file mirror)	—
Software	Python 3.10+, PyTorch `cu128` (CUDA 12.8) with bf16 support	ComfyUI nightly / `diffusers` @ main

Z-Image-Turbo is designed to fit comfortably within 16 GB VRAM consumer devices per the official Tongyi-MAI model card — the RTX 5070 Ti's 16GB matches that target tier exactly.

⚠️ Blackwell (sm_120) needs the cu128 PyTorch wheel. The RTX 5070 Ti is an sm_120 Blackwell card (the GB203 die, same family as the RTX 5080). Unlike Ada Lovelace cards (RTX 4060/4090), where the default pip install torch already includes the right kernels, the 5070 Ti needs a PyTorch build compiled against CUDA 12.8. Install it explicitly with the cu128 index URL (see Installation below). A PyTorch build without sm_120 kernels will fail at the first CUDA call.

Installation

Path A — HuggingFace diffusers (Python script)

First install a Blackwell-capable PyTorch (cu128), then install Z-Image support, which lives in diffusers main per the official model card and the Tongyi-MAI/Z-Image GitHub README:

# Blackwell sm_120 requires the cu128 PyTorch build
pip install torch --index-url https://download.pytorch.org/whl/cu128

# Z-Image support is merged into diffusers main; install from source
pip install git+https://github.com/huggingface/diffusers
pip install transformers accelerate safetensors

Path B — ComfyUI (official workflow)

Per the official ComfyUI tutorial, update ComfyUI to the latest nightly via ComfyUI Manager, then place three files into the standard model directories:

# from your ComfyUI root
cd models/diffusion_models
wget https://huggingface.co/Comfy-Org/z_image_turbo/resolve/main/split_files/diffusion_models/z_image_turbo_bf16.safetensors

cd ../text_encoders
wget https://huggingface.co/Comfy-Org/z_image_turbo/resolve/main/split_files/text_encoders/qwen_3_4b.safetensors

cd ../vae
wget https://huggingface.co/Comfy-Org/z_image_turbo/resolve/main/split_files/vae/ae.safetensors

These are the Comfy-Org-packaged split-file mirror of the Tongyi-MAI weights, repackaged for the ComfyUI loader graph. Load the workflow JSON from Comfy-Org/workflow_templates by dragging it into ComfyUI.

Running

Path A — diffusers snippet

The inference snippet below is from the Tongyi-MAI HF model card. Z-Image-Turbo uses 8 NFEs (the card's snippet sets num_inference_steps=9 and guidance_scale=0.0 — 9 results in 8 DiT forwards, and guidance must be 0.0 for the Turbo model):

import torch
from diffusers import ZImagePipeline

pipe = ZImagePipeline.from_pretrained(
    "Tongyi-MAI/Z-Image-Turbo",
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=False,
)
pipe.to("cuda")

prompt = "A photo of a city at night, neon signs reflecting on wet pavement"
image = pipe(
    prompt=prompt,
    height=1024,
    width=1024,
    num_inference_steps=9,  # results in 8 DiT forwards
    guidance_scale=0.0,     # guidance should be 0 for the Turbo model
    generator=torch.Generator("cuda").manual_seed(42),
).images[0]
image.save("z_image_turbo_out.png")

The diffusers pipeline uses SDPA attention by default per the model card, so no FlashAttention override is needed on Blackwell — the card exposes Flash-Attention only as an optional, commented-out backend.

Path B — ComfyUI

After dropping the workflow JSON into ComfyUI, edit the prompt node and hit Queue Prompt. The preconfigured workflow runs with the 8-NFE schedule out of the box.

Results

Speed: No first-party RTX 5070 Ti benchmark has been published for Z-Image-Turbo yet, and the backend benchmark data currently has no ingested measurement for this pair. The official Tongyi-MAI card cites sub-second inference latency on enterprise-grade H800 GPUs — a datacenter card that is not comparable to a consumer 5070 Ti, so we do not quote it as a 5070 Ti figure. Figures from other cards do not transfer cleanly: the RTX 5070 Ti's ~896 GB/s memory bandwidth and 8960 CUDA cores set its own throughput, distinct from both higher and lower tiers. Once a community-submitted run lands it will appear on /check/z-image-turbo/rtx-5070-ti. If you run it, please submit your numbers.
VRAM usage: Designed to fit comfortably within 16 GB VRAM consumer devices per the official model card — the BF16 build is the headline configuration for a 16GB card like the RTX 5070 Ti. On-disk footprint is ~21 GB (BF16 DiT 12.31 GB + Qwen3-4B encoder 8.04 GB + VAE 0.34 GB per the Comfy-Org mirror); runtime residency stays under 16 GB because the encoder and DiT load sequentially. Live measurements: /check/z-image-turbo/rtx-5070-ti.
Quality notes: Architecture is a "Scalable Single-Stream DiT" (S3-DiT) with text, visual semantic, and image VAE tokens concatenated into a unified input stream — designed for 8-NFE generation rivaling full-step competitors per the model card.

For the full benchmark data, see /check/z-image-turbo/rtx-5070-ti.

Troubleshooting

First CUDA call crashes / "no kernel image is available for execution on the device"

This is the Blackwell sm_120 wheel mismatch — a PyTorch build without sm_120 kernels can't run on the RTX 5070 Ti. Reinstall PyTorch from the CUDA 12.8 index: pip install torch --index-url https://download.pytorch.org/whl/cu128. Verify with python -c "import torch; print(torch.cuda.get_device_capability())" — it should report (12, 0) for the 5070 Ti.

Out of memory at first generation (diffusers path)

If the BF16 pipeline doesn't fit alongside other GPU-resident apps (browser GPU acceleration, a second monitor, idle Docker compute), enable CPU offload — the Tongyi-MAI model card documents pipe.enable_model_cpu_offload() for memory-constrained devices. Apply it after from_pretrained and do not also call pipe.to("cuda"):

pipe = ZImagePipeline.from_pretrained(
    "Tongyi-MAI/Z-Image-Turbo",
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=False,
)
pipe.enable_model_cpu_offload()
# do NOT call pipe.to("cuda") when using offload

ComfyUI doesn't recognize Z-Image nodes

The Z-Image loader nodes ship in ComfyUI's nightly builds, not the stable release. Update via ComfyUI Manager → "Update ComfyUI" → restart. Verified path documented on the official ComfyUI Z-Image tutorial.

Want a smaller on-disk footprint?

The Comfy-Org mirror also publishes an NVFP4 DiT (z_image_turbo_nvfp4.safetensors, 4.51 GB) and FP4/FP8 text-encoder variants — quant formats the RTX 5070 Ti's Blackwell tensor cores support natively. These cut the on-disk footprint substantially, but neither the official ComfyUI tutorial nor the ComfyUI examples page yet documents a loader graph for the NVFP4 single-file, and no RTX 5070 Ti measurement of the NVFP4 path has been published. This recipe stays on the officially-documented BF16 path, which fits the 16 GB envelope without quantization. If you get the NVFP4 path working on a 5070 Ti, please submit your setup.

Confusion with Juggernaut-Z

Juggernaut-Z is a RunDiffusion fine-tune of Z-Image Base, distributed under RunDiffusion/Juggernaut-Z-Image — a different model with its own slug. If you want the original Tongyi-MAI base or turbo weights, stick to the Tongyi-MAI/Z-Image-* repos linked above.