Z-Image Turbo on RTX 3090: 8-Step 1024x1024 Text-to-Image at BF16 with Diffusers or ComfyUI

What You'll Build

A local install of Z-Image-Turbo — Alibaba Tongyi-MAI's 6B-parameter distilled image generation model — running text-to-image at 1024×1024 in 8 inference steps on an RTX 3090. The 24 GB Ampere card sits well above the model's 16 GB headline floor, so the canonical BF16 weights run directly via diffusers or via the official ComfyUI workflow — no GGUF quantization, no CPU offload, no text-encoder workarounds.

Hardware data: RTX 3090 (24 GB VRAM) · ~7 s per 1024×1024 BF16 image at 7–8 NFEs (close-sibling RTX 3090 Ti, see Results) · ~13 GB peak VRAM · See benchmark data

ℹ️ Why this is the comfortable path: Z-Image-Turbo pairs a 6B DiT with a Qwen3-4B text encoder (visible in the Comfy-Org workflow's text_encoders/qwen_3_4b.safetensors at 8 GB on disk), which alone needs roughly 8 GB at BF16. On a 24 GB card the upstream-recommended BF16 build "fits comfortably" per the Tongyi-MAI model card, with measured peak around 13 GB per the Alibaba launch write-up via AIbase. That leaves ~11 GB of the 3090's 24 GB free for batch_size > 1, higher resolutions, or other GPU-resident apps.

Note on variants: The Tongyi-MAI Z-Image family ships three weight sets — Z-Image (Base), Z-Image-Turbo, and Z-Image (Distilled). This recipe targets Z-Image-Turbo, the consumer-friendly distilled variant. Fine-tunes like Juggernaut-Z (RunDiffusion) are a separate model with its own slug.

Requirements

Component	Minimum	Tested
GPU	16GB VRAM consumer card (per Tongyi-MAI model card)	RTX 3090 (24GB, Ampere, sm_86)
RAM	16GB system RAM	—
Storage	~21GB on disk (DiT 12.3 GB + Qwen3-4B text encoder 8.0 GB + VAE 0.3 GB, per the Comfy-Org split-file mirror)	—
Software	Python 3.10+, PyTorch with CUDA + bf16 support	ComfyUI nightly / `diffusers` @ main

Z-Image-Turbo "fits comfortably within 16G VRAM consumer devices" per the official Tongyi-MAI model card. The RTX 3090 sits well above that floor at 24 GB, so the canonical BF16 path is unconstrained. The 3090's Ampere sm_86 compute capability is fully supported by the default pip install torch wheels — no special CUDA wheel selection required.

Installation

Path A — HuggingFace diffusers (Python script)

Z-Image support is in diffusers main; install from source per the official model card and the Tongyi-MAI/Z-Image GitHub README:

pip install git+https://github.com/huggingface/diffusers
pip install torch transformers accelerate safetensors

Path B — ComfyUI (official workflow)

Per the official ComfyUI tutorial, update ComfyUI to the latest nightly via ComfyUI Manager, then place three files into the standard model directories:

# from your ComfyUI root
cd models/diffusion_models
wget https://huggingface.co/Comfy-Org/z_image_turbo/resolve/main/split_files/diffusion_models/z_image_turbo_bf16.safetensors

cd ../text_encoders
wget https://huggingface.co/Comfy-Org/z_image_turbo/resolve/main/split_files/text_encoders/qwen_3_4b.safetensors

cd ../vae
wget https://huggingface.co/Comfy-Org/z_image_turbo/resolve/main/split_files/vae/ae.safetensors

These are the Comfy-Org-packaged split-file mirror of the Tongyi-MAI weights, repackaged for the ComfyUI loader graph. Load the workflow JSON from Comfy-Org/workflow_templates by dragging it into ComfyUI.

Running

Path A — diffusers snippet

The inference snippet below is verbatim from the Tongyi-MAI HF model card. Z-Image-Turbo uses 8 NFEs (the snippet uses num_inference_steps=9 and guidance_scale=0.0 per the card — 9 steps results in 8 DiT forwards):

import torch
from diffusers import ZImagePipeline

pipe = ZImagePipeline.from_pretrained(
    "Tongyi-MAI/Z-Image-Turbo",
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=False,
)
pipe.to("cuda")

prompt = "A photo of a city at night, neon signs reflecting on wet pavement"
image = pipe(
    prompt=prompt,
    height=1024,
    width=1024,
    num_inference_steps=9,
    guidance_scale=0.0,
    generator=torch.Generator("cuda").manual_seed(42),
).images[0]
image.save("z_image_turbo_out.png")

On a 24 GB RTX 3090 there is no need to call pipe.enable_model_cpu_offload() — the full BF16 pipeline (DiT + Qwen3-4B text encoder + VAE) fits resident in VRAM with margin.

Path B — ComfyUI

After dropping the workflow JSON into ComfyUI, edit the prompt node and hit Queue Prompt. The preconfigured workflow runs with the 8-NFE schedule out of the box.

Results

Speed: No first-party RTX 3090 (non-Ti) measurement has been published. The closest cited number comes from the community miroleon Z-Image-Turbo benchmark project (source data), which on an RTX 3090 Ti reports a steady-state Run 2 of 6.74 s end-to-end for the BF16 workflow (BF16 Qwen3 + VAE) at 1024×1024, 7 steps, denoise 0.82, at 0.89 s/it. The RTX 3090 Ti is a close Ampere sibling — same 24 GB GDDR6X envelope, same sm_86 arch generation, ~7–10 % more memory bandwidth (1008 vs 936 GB/s) and ~10–12 % more FP16 compute than the plain RTX 3090. Expect very slightly slower numbers on a non-Ti RTX 3090 — roughly 7–8 s end-to-end at the same 7-step BF16 setting, ~9 s at the model-card-canonical 8-NFE schedule. For comparison, release-day reporting from AIbase and the ComfyUI Wiki release announcement measured ~2.3 s on an RTX 4090, consistent with the Ada-vs-Ampere FP16 compute gap (~82 vs ~36 TFLOPS dense). If you run this on your own 3090, please submit your numbers so we can replace the sibling-card citation with an empirical 3090 measurement.
VRAM usage: peak around 13 GB at 1024×1024 BF16 — the AIbase launch write-up measured this on an RTX 4090 ("the VRAM pointer stably at 13GB"). BF16 weight resident set is arch-invariant: the same model card path with the same dtype produces the same on-card footprint within the same 24 GB tier on the Ampere RTX 3090 (compute speed differs, but VRAM doesn't — see Speed bullet above). This sits well below the 16 GB headline floor on the model card and leaves ~11 GB of the 3090's 24 GB free for higher resolutions, batching, or coexisting workloads. Live measurements: /check/z-image-turbo/rtx-3090.
Quality notes: Architecture is "Scalable Single-Stream DiT (S3-DiT)" with text, visual semantic, and VAE tokens concatenated into a unified sequence — design optimized for 8-NFE generation rivaling full-step competitors per the model card. Z-Image-Turbo is noted in launch coverage for strong bilingual (English + Chinese) text rendering and photorealistic portrait quality.

For the full benchmark data, see /check/z-image-turbo/rtx-3090.

Troubleshooting

Higher resolution exhausts VRAM beyond 13 GB

The 13 GB VRAM envelope is for 1024×1024. Doubling either axis (e.g. 2048×1024 or 2048×2048) increases both latency and peak VRAM well beyond that — the Comfy-Org workflow template defaults to 1024×1024 for that reason. On a 24 GB 3090 you have meaningful headroom for higher resolutions, but if you push past ~1.5K on either axis and run into OOM, fall back to the official model card snippet with pipe.enable_model_cpu_offload() enabled (system RAM acts as a backstop at a latency cost):

pipe = ZImagePipeline.from_pretrained(
    "Tongyi-MAI/Z-Image-Turbo",
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=False,
)
pipe.enable_model_cpu_offload()
# do NOT call pipe.to("cuda") when using offload

ComfyUI doesn't recognize Z-Image nodes

The Z-Image loader nodes ship in ComfyUI's nightly builds, not the stable release. Update via ComfyUI Manager → "Update ComfyUI" → restart. Verified path documented on the official ComfyUI Z-Image tutorial.

Confusion with Juggernaut-Z

Juggernaut-Z is a RunDiffusion fine-tune of Z-Image Base, distributed under RunDiffusion/Juggernaut-Z-Image — a different model with its own slug. If you want the original Tongyi-MAI base or turbo weights, stick to the Tongyi-MAI/Z-Image-* repos linked above.

Want the sibling 16 GB or 24 GB Ada recipes?

If you're on a 16 GB Ada card such as the RTX 4060 Ti 16GB or RTX 5060 Ti, the install and code path is identical — see the published recipe at /recipes/z-image-turbo-on-rtx-4060-ti-16gb-8-step-text-to-image-at-bf16-with-diffusers-or-comfyui. For the 24 GB Ada sibling (RTX 4090), the install matches but the RTX 4090 runs the same workload roughly 3× faster per the measurements above (2.3 s vs ~7 s end-to-end) — see /recipes/z-image-turbo-on-rtx-4090-8-step-1024x1024-text-to-image-at-bf16-in-2-3s-with-diffusers-or-comfyui.