Z-Image Turbo on RTX 3090 Ti: 8-Step 1024x1024 Text-to-Image at BF16 in ~6.7s with Diffusers or ComfyUI

What You'll Build

A local install of Z-Image-Turbo — Alibaba Tongyi-MAI's 6B-parameter distilled image generation model — running text-to-image at 1024×1024 in 8 inference steps on an RTX 3090 Ti. The 24 GB Ampere card sits well above the model's 16 GB headline floor, so the canonical BF16 weights run directly via diffusers or via the official ComfyUI workflow — no GGUF quantization, no CPU offload, no text-encoder workarounds.

Hardware data: RTX 3090 Ti (24 GB VRAM) · ~6.74 s per 1024×1024 BF16 image at 7 steps, denoise 0.82 (directly measured on RTX 3090 Ti by the miroleon Z-Image-Turbo benchmark project) · ~13 GB peak VRAM · See benchmark data

ℹ️ Why this is the comfortable path: Z-Image-Turbo pairs a 6B DiT with a Qwen3-4B text encoder (visible in the Comfy-Org workflow's text_encoders/qwen_3_4b.safetensors at 8 GB on disk), which alone needs roughly 8 GB at BF16. On a 24 GB card the upstream-recommended BF16 build "fits comfortably" per the Tongyi-MAI model card, with peak around 13 GB per the Alibaba launch write-up via AIbase. That leaves ~11 GB of the 3090 Ti's 24 GB free for batch_size > 1, higher resolutions, or other GPU-resident apps.

Note on variants: The Tongyi-MAI Z-Image family ships four weight sets — Z-Image (Base), Z-Image-Turbo, Z-Image-Omni-Base, and Z-Image-Edit. This recipe targets Z-Image-Turbo, the consumer-friendly distilled variant. Fine-tunes like Juggernaut-Z (RunDiffusion) are a separate model with its own slug.

Requirements

Component	Minimum	Tested
GPU	16GB VRAM consumer card (per Tongyi-MAI model card)	RTX 3090 Ti (24GB, Ampere, sm_86)
RAM	16GB system RAM	64GB (per miroleon RTX 3090 Ti config)
Storage	~21GB on disk (DiT 12.3 GB + Qwen3-4B text encoder 8.0 GB + VAE 0.3 GB, per the Comfy-Org split-file mirror)	—
Software	Python 3.10+, PyTorch with CUDA + bf16 support	ComfyUI nightly / `diffusers` @ main

Z-Image-Turbo "fits comfortably within 16G VRAM consumer devices" per the official Tongyi-MAI model card. The RTX 3090 Ti sits well above that floor at 24 GB, so the canonical BF16 path is unconstrained. The 3090 Ti's Ampere sm_86 compute capability is fully supported by the default pip install torch wheels — no special CUDA wheel selection required.

Installation

Path A — HuggingFace diffusers (Python script)

Z-Image support is in diffusers main; install from source per the official model card and the Tongyi-MAI/Z-Image GitHub README:

pip install git+https://github.com/huggingface/diffusers
pip install torch transformers accelerate safetensors

Path B — ComfyUI (official workflow)

Per the official ComfyUI tutorial, update ComfyUI to the latest nightly via ComfyUI Manager, then place three files into the standard model directories:

# from your ComfyUI root
cd models/diffusion_models
wget https://huggingface.co/Comfy-Org/z_image_turbo/resolve/main/split_files/diffusion_models/z_image_turbo_bf16.safetensors

cd ../text_encoders
wget https://huggingface.co/Comfy-Org/z_image_turbo/resolve/main/split_files/text_encoders/qwen_3_4b.safetensors

cd ../vae
wget https://huggingface.co/Comfy-Org/z_image_turbo/resolve/main/split_files/vae/ae.safetensors

These are the Comfy-Org-packaged split-file mirror of the Tongyi-MAI weights, repackaged for the ComfyUI loader graph. Load the workflow JSON from Comfy-Org/workflow_templates by dragging it into ComfyUI.

Running

Path A — diffusers snippet

The inference snippet below is verbatim from the Tongyi-MAI HF model card. Z-Image-Turbo uses 8 NFEs (the snippet uses num_inference_steps=9 and guidance_scale=0.0 per the card — 9 steps results in 8 DiT forwards):

import torch
from diffusers import ZImagePipeline

pipe = ZImagePipeline.from_pretrained(
    "Tongyi-MAI/Z-Image-Turbo",
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=False,
)
pipe.to("cuda")

prompt = "A photo of a city at night, neon signs reflecting on wet pavement"
image = pipe(
    prompt=prompt,
    height=1024,
    width=1024,
    num_inference_steps=9,
    guidance_scale=0.0,
    generator=torch.Generator("cuda").manual_seed(42),
).images[0]
image.save("z_image_turbo_out.png")

On a 24 GB RTX 3090 Ti there is no need to call pipe.enable_model_cpu_offload() — the full BF16 pipeline (DiT + Qwen3-4B text encoder + VAE) fits resident in VRAM with margin.

Path B — ComfyUI

After dropping the workflow JSON into ComfyUI, edit the prompt node and hit Queue Prompt. The preconfigured workflow runs with the 8-NFE schedule out of the box.

Results

Speed: The miroleon Z-Image-Turbo benchmark project measured the canonical BF16 (BF16 Qwen3 + VAE) workflow directly on an RTX 3090 Ti at 6.74 s end-to-end (steady-state Run 2) at 1024×1024, 7 steps, denoise 0.82, prompt fixed — corresponding to 0.89 s/it (raw benchmarks.json source data). Scaled to the model-card-canonical 8 NFEs at the same denoise the steady-state end-to-end runtime lands near ~7.7 s on this card. The same miroleon dataset reports FP8 Qwen3 + VAE at 7.28 s and BF16 GGUF + Qwen Q5 at 6.89 s on the same RTX 3090 Ti — i.e. the BF16-native path is the fastest on this card. For reference, release-day reporting from AIbase and the ComfyUI Wiki release announcement measured ~2.3 s on an RTX 4090 — consistent with the Ada-vs-Ampere FP16 compute gap (~82 vs ~40 TFLOPS dense). If you run this on your own 3090 Ti, please submit your numbers so the live /check/ data set grows beyond the single miroleon citation.
VRAM usage: peak around 13 GB at 1024×1024 BF16 — the AIbase launch write-up measured this on an RTX 4090 ("the VRAM pointer stably at 13GB"). BF16 weight resident set is arch-invariant: the same model card path with the same dtype produces the same on-card footprint within the same 24 GB tier on the Ampere RTX 3090 Ti (compute speed differs by arch generation, but the BF16 weight envelope doesn't — see Speed bullet above). This sits well below the 16 GB headline floor on the model card and leaves ~11 GB of the 3090 Ti's 24 GB free for higher resolutions, batching, or coexisting workloads. Live measurements: /check/z-image-turbo/rtx-3090-ti.
Quality notes: Architecture is "Scalable Single-Stream DiT (S3-DiT)" with text, visual semantic, and VAE tokens concatenated into a unified sequence — design optimized for 8-NFE generation rivaling full-step competitors per the model card. Z-Image-Turbo is noted in launch coverage for strong bilingual (English + Chinese) text rendering and photorealistic portrait quality.

For the full benchmark data, see /check/z-image-turbo/rtx-3090-ti.

Troubleshooting

FP8 weights load but bring no speed win on this card

Z-Image-Turbo ships FP8 (*_fp8_e4m3fn.safetensors) and GGUF redistributor variants alongside BF16. They load cleanly on the RTX 3090 Ti, but the 3090 Ti is Ampere sm_86, which has no FP8 tensor cores — FP8 first shipped on Hopper (sm_90) and consumer Ada (sm_89). At inference time PyTorch dequantizes FP8 weights to BF16 on the fly: you keep a small VRAM saving but pay a small compute penalty. The miroleon dataset shows this directly — on RTX 3090 Ti the BF16 Qwen3 + VAE path runs at 0.89 s/it (6.74 s total) while FP8 Qwen3 + VAE runs at 0.98 s/it (7.28 s total). Stick with BF16 on this card unless you specifically need the FP8 footprint to coexist with other GPU-resident workloads. (Pattern matches the drbaph HiDream-O1 FP8 model card which calls this out verbatim for "older GPUs".)

Higher resolution exhausts VRAM beyond 13 GB

The 13 GB VRAM envelope is for 1024×1024. Doubling either axis (e.g. 2048×1024 or 2048×2048) increases both latency and peak VRAM well beyond that — the Comfy-Org workflow template defaults to 1024×1024 for that reason. On a 24 GB 3090 Ti you have meaningful headroom for higher resolutions, but if you push past ~1.5K on either axis and run into OOM, fall back to the official model card snippet with pipe.enable_model_cpu_offload() enabled (system RAM acts as a backstop at a latency cost):

pipe = ZImagePipeline.from_pretrained(
    "Tongyi-MAI/Z-Image-Turbo",
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=False,
)
pipe.enable_model_cpu_offload()
# do NOT call pipe.to("cuda") when using offload

ComfyUI doesn't recognize Z-Image nodes

The Z-Image loader nodes ship in ComfyUI's nightly builds, not the stable release. Update via ComfyUI Manager → "Update ComfyUI" → restart. Verified path documented on the official ComfyUI Z-Image tutorial.

Confusion with Juggernaut-Z

Juggernaut-Z is a RunDiffusion fine-tune of Z-Image Base, distributed under RunDiffusion/Juggernaut-Z-Image — a different model with its own slug. If you want the original Tongyi-MAI base or turbo weights, stick to the Tongyi-MAI/Z-Image-* repos linked above.

Want the sibling RTX 3090 or RTX 4090 recipes?

If you're on a non-Ti RTX 3090, the install and code path is identical — see the published recipe at /recipes/z-image-turbo-on-rtx-3090-8-step-1024x1024-text-to-image-at-bf16-with-diffusers-or-comfyui. For the 24 GB Ada sibling (RTX 4090), the install matches but the RTX 4090 runs the same workload roughly 3× faster per the measurements above (2.3 s vs ~6.7 s end-to-end) — see /recipes/z-image-turbo-on-rtx-4090-8-step-1024x1024-text-to-image-at-bf16-in-2-3s-with-diffusers-or-comfyui.