Z-Image Turbo on RTX 4070 Ti SUPER: 8-Step 1024x1024 Text-to-Image at BF16 with Diffusers or ComfyUI

What You'll Build

A local install of Z-Image-Turbo — Alibaba Tongyi-MAI's distilled image generation model — running text-to-image at 1024×1024 in 8 inference steps on an RTX 4070 Ti SUPER. The 16 GB Ada Lovelace card sits exactly in the model's headline VRAM tier, so the canonical BF16 weights run directly via diffusers or via the official ComfyUI workflow — no GGUF quantization, no text-encoder workarounds.

Hardware data: RTX 4070 Ti SUPER (16GB VRAM) · 8 NFEs at 1024×1024 BF16 · See benchmark data

ℹ️ Why this is the comfortable path: Z-Image-Turbo pairs its DiT with a Qwen3-4B text encoder (visible in the Comfy-Org workflow's text_encoders/qwen_3_4b.safetensors at ~8 GB on disk), which alone needs roughly 8 GB at BF16 — too tight for an 8 GB card to hold alongside the DiT + VAE. On a 16 GB card the upstream-recommended BF16 build runs without quantization: the Tongyi-MAI model card describes Z-Image-Turbo as fitting comfortably within "16G VRAM consumer devices", so this recipe stays on the canonical BF16 path rather than reaching for GGUF redistributors.

Note on variants: The Tongyi-MAI Z-Image family is documented as four variants on the model card — Z-Image-Turbo, Z-Image (the foundation model), Z-Image-Omni-Base, and Z-Image-Edit — of which Turbo and the Z-Image foundation model currently have public weights (Omni-Base and Edit are marked "To be released"). This recipe targets Z-Image-Turbo, the consumer-friendly distilled variant. Fine-tunes like Juggernaut-Z (RunDiffusion) are a separate model with its own recipe.

Requirements

Component	Minimum	Tested
GPU	16GB VRAM consumer card	RTX 4070 Ti SUPER (16GB, Ada Lovelace, sm_89)
RAM	16GB system RAM	—
Storage	~21GB on disk (DiT 12.31 GB + Qwen3-4B text encoder 8.04 GB + VAE 0.34 GB, per the Comfy-Org split-file mirror)	—
Software	Python 3.10+, PyTorch with CUDA + bf16 support	ComfyUI nightly / `diffusers` @ main

The Tongyi-MAI model card describes Z-Image-Turbo as fitting comfortably within "16G VRAM consumer devices" (see the official card) — the RTX 4070 Ti SUPER's 16 GB matches that target tier exactly. No special CUDA wheel selection is required for Ada Lovelace cards: the default pip install torch already ships sm_89 kernels (unlike Blackwell sm_120 cards, the 4070 Ti SUPER needs no cu128-specific index URL).

Installation

Path A — HuggingFace diffusers (Python script)

Z-Image support landed in diffusers via two merged PRs (#12703 and #12715); install from source per the official model card and the Tongyi-MAI/Z-Image GitHub README:

pip install git+https://github.com/huggingface/diffusers
pip install torch transformers accelerate safetensors

Path B — ComfyUI (official workflow)

Per the official ComfyUI tutorial, update ComfyUI to the latest nightly via ComfyUI Manager, then place three files into the standard model directories:

# from your ComfyUI root
cd models/diffusion_models
wget https://huggingface.co/Comfy-Org/z_image_turbo/resolve/main/split_files/diffusion_models/z_image_turbo_bf16.safetensors

cd ../text_encoders
wget https://huggingface.co/Comfy-Org/z_image_turbo/resolve/main/split_files/text_encoders/qwen_3_4b.safetensors

cd ../vae
wget https://huggingface.co/Comfy-Org/z_image_turbo/resolve/main/split_files/vae/ae.safetensors

These are the Comfy-Org-packaged split-file mirror of the Tongyi-MAI weights, repackaged for the ComfyUI loader graph. Load the workflow JSON from Comfy-Org/workflow_templates by dragging it into ComfyUI.

Running

Path A — diffusers snippet

The inference snippet below is from the Tongyi-MAI HF model card. Z-Image-Turbo uses 8 NFEs; the card's snippet sets num_inference_steps=9 and guidance_scale=0.0, with the card's own inline comments noting that 9 results in 8 DiT forwards and that guidance should be 0 for the Turbo models:

import torch
from diffusers import ZImagePipeline

pipe = ZImagePipeline.from_pretrained(
    "Tongyi-MAI/Z-Image-Turbo",
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=False,
)
pipe.to("cuda")

prompt = "A photo of a city at night, neon signs reflecting on wet pavement"
image = pipe(
    prompt=prompt,
    height=1024,
    width=1024,
    num_inference_steps=9,  # This actually results in 8 DiT forwards
    guidance_scale=0.0,     # Guidance should be 0 for the Turbo models
    generator=torch.Generator("cuda").manual_seed(42),
).images[0]
image.save("z_image_turbo_out.png")

Path B — ComfyUI

After dropping the workflow JSON into ComfyUI, edit the prompt node and hit Queue Prompt. The preconfigured workflow runs with the 8-NFE schedule out of the box.

Results

Speed: No community benchmark on the RTX 4070 Ti SUPER has been published yet, and the backend has no ingested run for this pair (/check/z-image-turbo/rtx-4070-ti-super returns verdict: unknown). The official Tongyi-MAI card cites "sub-second inference latency" on enterprise-grade H800 GPUs, which is not comparable to a consumer Ada card. We deliberately do not borrow a figure from a sibling card: the RTX 4070 Ti SUPER's ~672 GB/s memory bandwidth and 8448 CUDA cores sit below the RTX 4080 (~716.8 GB/s, 9728 cores) by enough margin that an RTX 4080 number would be an upper bound, not a transferable measurement. If you run it, please submit your numbers so a measured figure can land here.
VRAM usage: The model is described as fitting comfortably within "16G VRAM consumer devices" per the official model card — i.e. the BF16 build is the headline configuration for a 16GB card like the RTX 4070 Ti SUPER. Live measurements: /check/z-image-turbo/rtx-4070-ti-super.
Quality notes: The card states the architecture is a "Scalable Single-Stream DiT" (S3-DiT), with text, visual semantic tokens, and image VAE tokens concatenated at the sequence level into a unified input stream — a design optimized for 8-NFE generation while matching or exceeding leading competitors per the model card.

For the full benchmark data, see /check/z-image-turbo/rtx-4070-ti-super.

Troubleshooting

Out of memory at first generation (diffusers path)

If the BF16 pipeline doesn't fit alongside other GPU-resident apps (browser GPU acceleration, second monitor, idle Docker compute), enable CPU offload — the Tongyi-MAI model card documents pipe.enable_model_cpu_offload() for memory-constrained devices, which moves idle parts to system RAM:

pipe = ZImagePipeline.from_pretrained(
    "Tongyi-MAI/Z-Image-Turbo",
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=False,
)
pipe.enable_model_cpu_offload()
# do NOT call pipe.to("cuda") when using offload

On the RTX 4070 Ti SUPER's PCIe Gen4 x16 link, offload is rarely needed at 1024×1024 — the 16 GB envelope is the model's headline tier — but it is the documented escape hatch if you co-host other GPU workloads.

ComfyUI doesn't recognize Z-Image nodes

The Z-Image loader nodes ship in ComfyUI's nightly builds, not the stable release. Update via ComfyUI Manager → "Update ComfyUI" → restart. Verified path documented on the official ComfyUI Z-Image tutorial.

Don't reach for the NVFP4 weights on this card

The Comfy-Org mirror also ships an nvfp4 DiT and FP4-mixed text encoder. NVFP4 acceleration is a Blackwell (sm_120) feature; on the Ada Lovelace RTX 4070 Ti SUPER (sm_89) there is no NVFP4 tensor-core path, so those files give you no speed benefit and the BF16 build already fits 16 GB. Ada does have native FP8 (E4M3/E5M2) tensor cores, but BF16 is the documented, no-surprises path here — stick with the BF16 split files above.

Confusion with Juggernaut-Z

Juggernaut-Z is a RunDiffusion fine-tune of Z-Image Base, distributed under RunDiffusion/Juggernaut-Z-Image — a different model with its own slug. If you want the original Tongyi-MAI base or turbo weights, stick to the Tongyi-MAI/Z-Image-* repos linked above.