Z-Image Turbo on RTX 4060 Ti 16GB: 8-Step Text-to-Image at BF16 with Diffusers or ComfyUI

What You'll Build

A local install of Z-Image-Turbo — Alibaba Tongyi-MAI's 6B-parameter distilled image generation model — running text-to-image at 1024×1024 in 8 inference steps on an RTX 4060 Ti 16GB. The 16 GB Ada Lovelace card sits exactly in the model's headline VRAM tier, so the canonical BF16 weights run directly via diffusers or via the official ComfyUI workflow — no GGUF quantization, no text-encoder workarounds.

Hardware data: RTX 4060 Ti 16GB · 8 NFEs at 1024×1024 BF16 · See benchmark data

ℹ️ Why this is the comfortable path: Z-Image-Turbo pairs a 6B DiT with a Qwen3-4B text encoder (visible in the Comfy-Org workflow's text_encoders/qwen_3_4b.safetensors at 8 GB on disk), which alone needs roughly 8 GB at BF16 — too tight for the 8 GB RTX 4060 to hold alongside the DiT + VAE. On a 16 GB card the upstream-recommended BF16 build "fits comfortably" per the Tongyi-MAI model card, so this recipe stays on the canonical BF16 path rather than reaching for GGUF redistributors.

Note on variants: The Tongyi-MAI Z-Image family ships three weight sets — Z-Image (Base), Z-Image-Turbo, and Z-Image (Distilled). This recipe targets Z-Image-Turbo, the consumer-friendly distilled variant. Fine-tunes like Juggernaut-Z (RunDiffusion) are a separate model with its own recipe.

Requirements

Component	Minimum	Tested
GPU	16GB VRAM consumer card	RTX 4060 Ti 16GB (Ada Lovelace, sm_89)
RAM	16GB system RAM	—
Storage	~21GB on disk (DiT 12.3 GB + Qwen3-4B text encoder 8.0 GB + VAE 0.3 GB, per the Comfy-Org split-file mirror)	—
Software	Python 3.10+, PyTorch with CUDA + bf16 support	ComfyUI nightly / `diffusers` @ main

Z-Image-Turbo "fits comfortably within 16G VRAM consumer devices" per the official Tongyi-MAI model card — the RTX 4060 Ti 16GB matches that target tier exactly. No special CUDA wheel selection is required for Ada Lovelace cards: the default pip install torch already ships sm_89 kernels.

Installation

Path A — HuggingFace diffusers (Python script)

Z-Image support is in diffusers main; install from source per the official model card and the Tongyi-MAI/Z-Image GitHub README:

pip install git+https://github.com/huggingface/diffusers
pip install torch transformers accelerate safetensors

Path B — ComfyUI (official workflow)

Per the official ComfyUI tutorial, update ComfyUI to the latest nightly via ComfyUI Manager, then place three files into the standard model directories:

# from your ComfyUI root
cd models/diffusion_models
wget https://huggingface.co/Comfy-Org/z_image_turbo/resolve/main/split_files/diffusion_models/z_image_turbo_bf16.safetensors

cd ../text_encoders
wget https://huggingface.co/Comfy-Org/z_image_turbo/resolve/main/split_files/text_encoders/qwen_3_4b.safetensors

cd ../vae
wget https://huggingface.co/Comfy-Org/z_image_turbo/resolve/main/split_files/vae/ae.safetensors

These are the Comfy-Org-packaged split-file mirror of the Tongyi-MAI weights, repackaged for the ComfyUI loader graph. Load the workflow JSON from Comfy-Org/workflow_templates by dragging it into ComfyUI.

Running

Path A — diffusers snippet

The inference snippet below is verbatim from the Tongyi-MAI HF model card. Z-Image-Turbo uses 8 NFEs (the snippet uses num_inference_steps=9 and guidance_scale=0.0 per the card — 9 steps results in 8 DiT forwards):

import torch
from diffusers import ZImagePipeline

pipe = ZImagePipeline.from_pretrained(
    "Tongyi-MAI/Z-Image-Turbo",
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=False,
)
pipe.to("cuda")

prompt = "A photo of a city at night, neon signs reflecting on wet pavement"
image = pipe(
    prompt=prompt,
    height=1024,
    width=1024,
    num_inference_steps=9,
    guidance_scale=0.0,
    generator=torch.Generator("cuda").manual_seed(42),
).images[0]
image.save("z_image_turbo_out.png")

Path B — ComfyUI

After dropping the workflow JSON into ComfyUI, edit the prompt node and hit Queue Prompt. The preconfigured workflow runs with the 8-NFE schedule out of the box.

Results

Speed: No community benchmark on RTX 4060 Ti 16GB has been published yet. The official Tongyi-MAI card cites "sub-second inference latency on enterprise-grade H800 GPUs", which is not directly comparable to a consumer Ada card. The 4060 Ti has lower memory bandwidth than its 5060 Ti sibling, so the published 5060 Ti recipe's qualitative impressions don't transfer cleanly either. Once a community-submitted run lands, it will appear on /check/z-image-turbo/rtx-4060-ti-16gb. If you run it, please submit your numbers.
VRAM usage: Designed to "fit comfortably within 16G VRAM consumer devices" per the official model card — i.e. the BF16 build is the headline configuration for a 16GB card like the RTX 4060 Ti 16GB. Live measurements: /check/z-image-turbo/rtx-4060-ti-16gb.
Quality notes: Architecture is "Scalable Single-Stream DiT (S3-DiT)" with text, visual semantic, and VAE tokens concatenated into a unified sequence — design optimized for 8-NFE generation rivaling full-step competitors per the model card.

For the full benchmark data, see /check/z-image-turbo/rtx-4060-ti-16gb.

Troubleshooting

Out of memory at first generation (diffusers path)

If the BF16 pipeline doesn't fit alongside other GPU-resident apps (browser GPU acceleration, second monitor, idle Docker compute), enable CPU offload — the Tongyi-MAI/Z-Image GitHub repo recommends pipe.enable_model_cpu_offload() after from_pretrained to move idle parts to system RAM:

pipe = ZImagePipeline.from_pretrained(
    "Tongyi-MAI/Z-Image-Turbo",
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=False,
)
pipe.enable_model_cpu_offload()
# do NOT call pipe.to("cuda") when using offload

ComfyUI doesn't recognize Z-Image nodes

The Z-Image loader nodes ship in ComfyUI's nightly builds, not the stable release. Update via ComfyUI Manager → "Update ComfyUI" → restart. Verified path documented on the official ComfyUI Z-Image tutorial.

Have an 8 GB card instead?

The 8 GB Ada path (e.g. RTX 4060) is currently blocked: the Qwen3-4B text encoder needs roughly 8 GB at BF16 on its own, and the community GGUF text-encoder loader (city96/ComfyUI-GGUF's CLIPLoader (gguf)) does not yet support Qwen3-4B. Until that loader gap closes or an official 8 GB walkthrough is published, the 16 GB tier is the smallest practical card for this model. If you have measurements on an 8 GB card, please submit them.

Confusion with Juggernaut-Z

Juggernaut-Z is a RunDiffusion fine-tune of Z-Image Base, distributed under RunDiffusion/Juggernaut-Z-Image — a different model with its own slug. If you want the original Tongyi-MAI base or turbo weights, stick to the Tongyi-MAI/Z-Image-* repos linked above.