How much VRAM does Z-Image Turbo need?

About 13 GB — the minimum this recipe targets.

How hard is this setup?

Intermediate — follow the steps above.

Z-Image Turbo on RX 7800 XT: BF16 8-Step Text-to-Image via ComfyUI on ROCm

What You'll Build

A local install of Z-Image-Turbo — Alibaba Tongyi-MAI's 6B-parameter distilled image-generation model — producing 1024×1024 text-to-image in 8 steps on a Radeon RX 7800 XT (RDNA3, Navi 32, gfx1101) through the ROCm stack and ComfyUI. The model's headline tier is "16G VRAM consumer devices," and the RX 7800 XT is exactly a 16 GB card, so the canonical BF16 weights run directly — no GGUF quant required, just a careful eye on headroom (see Requirements). AMD ships an official ComfyUI + Z-Image-Turbo playbook for Radeon hardware, so this is a first-party-supported path, not a community port.

Hardware data: RX 7800 XT (16GB VRAM) · BF16 · 8 NFEs at 1024×1024 · ROCm 7.x · See benchmark data

⚠️ This is a ROCm recipe, not CUDA. The RX 7800 XT runs on AMD's ROCm/HIP stack — there is no cu124/cu128 wheel, no xformers, and no FP8/FP4 path here. RDNA3's WMMA units have no FP8 or FP4 hardware (AMD GPUOpen, "WMMA on RDNA3"), so an FP8 or NVFP4 checkpoint would just upcast to BF16 at load — no memory saving, no speedup. The attention path is PyTorch SDPA, selected with ComfyUI's --use-pytorch-cross-attention flag exactly as AMD's own playbook instructs. The full BF16 build is the lead path: the 6B DiT is ~12.3 GB and fits the 16 GB card.

ℹ️ Note on variants. The Tongyi-MAI Z-Image family currently ships four variants — Z-Image-Turbo (the distilled 8-step consumer model this recipe targets), Z-Image (the 50-step foundation model), Z-Image-Omni-Base, and Z-Image-Edit. All four share a 6B "Scalable Single-Stream DiT" (S3-DiT) backbone. Fine-tunes such as Juggernaut-Z (RunDiffusion) are a separate model with their own recipe — don't conflate them.

Requirements

Component	Minimum	Tested
GPU	16 GB VRAM (ROCm-supported AMD card)	RX 7800 XT (16 GB, RDNA3, gfx1101)
RAM	16 GB system RAM (32 GB recommended — see headroom note)	—
Storage	~21 GB on disk (DiT 12.3 GB + Qwen3-4B text encoder 8.0 GB + VAE 0.34 GB, per the Comfy-Org split-file mirror)	—
Driver / runtime	AMD ROCm 7.x on Linux	ROCm 7.2 PyTorch wheels
Software	ComfyUI (native ROCm support)	ComfyUI with `--use-pytorch-cross-attention`

Z-Image-Turbo "fits comfortably within 16G VRAM consumer devices" per the official Tongyi-MAI model card — and the RX 7800 XT is a 16 GB card, so it sits right at the model's stated tier rather than above it (unlike the 24 GB RX 7900 XTX). The binding fact is the 6B DiT at BF16 = 12.31 GB (verified against the Comfy-Org HF tree); with the VAE (0.34 GB) plus activations resident during sampling, the DiT stage stays within the 16 GB budget. The Qwen3-4B text encoder (8.04 GB) is the other heavy component, but ComfyUI runs the text-encode pass and then loads the DiT, so the encoder and DiT are not both fully resident at the sampling peak — keep --disable-smart-memory handy (it offloads models back to system RAM between stages; see Running) and have 32 GB of system RAM so the offloaded encoder has somewhere to land. We set min_vram_gb: 13 to reflect the installed BF16 DiT path's realistic peak (12.3 GB DiT + VAE + activations) on this 16 GB card. The model is released under Apache-2.0 (per the model card frontmatter) — commercial use permitted — and the weights are not gated, so no Hugging Face login is required.

If you find the BF16 path too tight on a busy desktop session, the community-favoured fallback on this card is the GGUF Q8_0 build via ComfyUI-GGUF (a 7900 XTX user in Issue #11551 reported GGUF Q8 ran "a lot better than BF16" for them). That's an optional headroom move on the 7800 XT, not a requirement — BF16 is the documented lead.

Installation

1. Install PyTorch with ROCm

The RX 7800 XT (gfx1101) is an officially ROCm-supported GPU on Linux. Install the ROCm PyTorch wheels — per the ComfyUI README "AMD GPUs (ROCm)" section, the current stable channel is rocm7.2:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm7.2

The rocmX.Y tag moves over time (6.3 → 6.4 → 7.x). Read the live "AMD GPUs (ROCm)" block in the ComfyUI README at install time and use whatever tag it shows. AMD also publishes Radeon-specific wheels (repo.radeon.com) and a step-by-step ROCm-on-Radeon ComfyUI install guide pinned to ROCm 7.2.1 — either wheel source works on gfx1101. The RX 7800 XT (gfx1101) is natively supported; you should not need HSA_OVERRIDE_GFX_VERSION on a current ROCm/PyTorch build, but if a library ships only gfx1100 kernels you can export HSA_OVERRIDE_GFX_VERSION=11.0.0 as a legacy fallback to masquerade as gfx1100.

2. Install ComfyUI

Per the AMD ROCm-on-Radeon ComfyUI guide:

git clone https://github.com/comfyanonymous/ComfyUI.git && cd ComfyUI
pip install -r requirements.txt

3. Download the Z-Image-Turbo BF16 weights

Place the three Comfy-Org-packaged BF16 files into the standard ComfyUI model folders. These are the exact filenames and URLs from the official ComfyUI Z-Image-Turbo tutorial (and the same files referenced by AMD's playbook), with on-disk sizes verified against the Comfy-Org HF tree:

# from your ComfyUI root
cd models/diffusion_models
wget https://huggingface.co/Comfy-Org/z_image_turbo/resolve/main/split_files/diffusion_models/z_image_turbo_bf16.safetensors  # 12.3 GB

cd ../text_encoders
wget https://huggingface.co/Comfy-Org/z_image_turbo/resolve/main/split_files/text_encoders/qwen_3_4b.safetensors            # 8.04 GB

cd ../vae
wget https://huggingface.co/Comfy-Org/z_image_turbo/resolve/main/split_files/vae/ae.safetensors                            # 0.34 GB

Z-Image-Turbo pairs the 6B DiT with a Qwen3-4B text encoder (the qwen_3_4b.safetensors above, ~8 GB at BF16). On an 8 GB card this encoder is what makes the BF16 path infeasible; on the 16 GB RX 7800 XT the DiT (12.3 GB) + VAE fit during sampling, and the encoder offloads to system RAM between the text-encode and sampling stages (this is why --disable-smart-memory and 32 GB of system RAM help). The Comfy-Org repo also publishes nvfp4 and fp8_mixed files, but ignore them on this card: RDNA3 has no FP4/FP8 hardware, so they would upcast to BF16 with no benefit — the plain _bf16 files are the correct choice.

4. Load the official workflow

Update ComfyUI to the latest version (the Z-Image nodes ship in current builds), then load the bundled Z-Image-Turbo workflow template — in ComfyUI, Workflow → Browse Templates → Image → Z-Image-Turbo, or drag the template JSON in. The ComfyUI tutorial covers this drag-and-drop step; if ComfyUI prompts that models are missing, the in-app download buttons fetch the same three files from step 3.

Running

Launch ComfyUI with the PyTorch SDPA attention backend — this is the attention path AMD's own playbook specifies for ROCm (it replaces the CUDA-only FlashAttention/xformers paths, which don't apply on RDNA3). This flag is also the community-confirmed fix for the Z-Image-Turbo VAE-decode crash on RDNA3 + ROCm: with the default attention backend the DiT samples fine and then dies at VAE Decode, and --use-pytorch-cross-attention resolves it (see Troubleshooting). On the 16 GB RX 7800 XT it is also worth adding --disable-smart-memory from the start so the heavy Qwen3-4B encoder offloads back to system RAM between stages, freeing VRAM for the DiT; if loads stall on repeated runs, add --disable-pinned-memory:

python3 main.py --use-pytorch-cross-attention --disable-smart-memory

Open http://127.0.0.1:8188, edit the prompt node in the loaded Z-Image-Turbo template, and press Queue Prompt. Per AMD's playbook and the model card, the Turbo workflow is distilled for 8 NFEs (the playbook recommends a 4–10 step range), very low guidance (CFG ≈ 1.0–2.0; the model card lists CFG as off for Turbo), the euler or res_multistep sampler, and resolution kept at or below 1024 px on the longest side. The preconfigured template already encodes these — you usually just change the prompt.

The first run loads the BF16 weights into VRAM; the generated image lands in ComfyUI/output/.

Results

Speed: Omitted — no RX-7800-XT-named Z-Image-Turbo benchmark exists yet. The backend has no ingested run for this pair (/check/z-image-turbo/rx-7800-xt returns verdict: unknown). The figures that do exist are not transferable: the model card's "sub-second on enterprise-grade H800 GPUs" is a datacenter card; community 7900 XTX timings (e.g. ~11 s steady-state for GGUF Q8 in Issue #11551) are on a card with ~54% more memory bandwidth (960 vs 624 GB/s) and a different quant, so they are not a 7800 XT BF16 number. We deliberately do not relabel a different card's or model's figure. If you run Z-Image-Turbo on a 7800 XT, please submit your timing so a measured figure can land here.
VRAM usage: The model "fits comfortably within 16G VRAM consumer devices" per the official model card; the 6B DiT is 12.31 GB at BF16 (Comfy-Org tree), which the 16 GB RX 7800 XT holds during sampling with the encoder offloaded between stages. This is tighter than on a 24 GB card — keep --disable-smart-memory on and 32 GB system RAM available. Live measurements: /check/z-image-turbo/rx-7800-xt.
Quality notes: Architecture is "Scalable Single-Stream DiT (S3-DiT)" with text, visual-semantic, and VAE tokens concatenated into one unified input stream, optimized for 8-NFE generation while matching or exceeding leading competitors per the model card. Turbo is distilled for low step counts — pushing steps far above ~10 or raising CFG well past 2.0 degrades rather than improves output.

For the full benchmark data, see /check/z-image-turbo/rx-7800-xt.

Troubleshooting

VAE Decode crashes on ROCm (the DiT generates fine, then it dies at decode)

This is the most-reported Z-Image-Turbo failure on RDNA3 Radeon cards. In ComfyUI Issue #11551 (a 7900 XTX, BF16, ROCm 7.1) the KSampler stage completes successfully and VRAM drops, then VAE Decode hangs ~5 s and crashes with a ROCm device error (Memobj map does not have ptr: ...). Plenty of free VRAM remained — this is a ROCm memory-mapping bug, not an out-of-memory condition. The 7800 XT (gfx1101) shares the same RDNA3 ROCm VAE path, so expect the same failure mode and the same fix.

Mitigations, in the order multiple RDNA3 users confirmed in that thread:

Use --use-pytorch-cross-attention (already in the launch command above). This is the confirmed fix: in the thread's own 6-test matrix the default startup and --use-split-cross-attention both crash, while --use-pytorch-cross-attention runs cleanly with the best performance. The crash is triggered by the default sub-quadratic attention backend, not by VAE precision. (These reports are from community users, not the ComfyUI maintainers — but the test matrix is reproducible and self-consistent.)
Add --disable-smart-memory (and, if loads stall on repeated runs, --disable-pinned-memory) — the standard ROCm large-model memory-management workaround, also confirmed working in the thread. On a 16 GB card this one is doubly useful because offloading the 8 GB Qwen3-4B encoder between stages is what keeps the DiT within budget.
Keep resolution ≤ 1024 px on the longest side, as AMD's playbook recommends. If you still see a corrupted (not crashing) decode, you can move the VAE to the CPU with --cpu-vae (slower, but it sidesteps the ROCm VAE kernel; the VAE is only 0.34 GB so the penalty is modest). Note that --bf16-vae is not a confirmed fix for this crash and is contested on RDNA3 (ROCm Issue #4729 reports it can inflate decode VRAM) — reach for it only if a precision/black-image artifact appears, never as the crash fix.

Out of memory at 16 GB

Unlike a 24 GB card, the 7800 XT has little spare VRAM once the 12.3 GB DiT is resident. If you hit a ROCm OOM (distinct from the VAE-decode mapping crash above): (a) make sure you launched with --disable-smart-memory so the Qwen3-4B encoder offloads after the text-encode pass; (b) keep the output at or below 1024 px; (c) close other GPU consumers (desktop compositor effects, a browser with hardware acceleration). If it still won't fit, switch to the GGUF Q8_0 build via the ComfyUI-GGUF custom nodes — a 7900 XTX user found GGUF Q8 ran better than BF16 and it carries a smaller resident footprint than the 12.3 GB BF16 DiT.

"Do I need a cu128 wheel / FlashAttention / xformers?"

No — those are NVIDIA-only. On the RX 7800 XT you install the ROCm PyTorch wheel (--index-url .../whl/rocm7.2), and attention runs through PyTorch SDPA via --use-pytorch-cross-attention. If a guide tells you to pip install flash-attn, install xformers, pick a cu12x index, or download an FP8/NVFP4 checkpoint "to save VRAM," it was written for the wrong vendor — ignore it on this card.

ComfyUI doesn't recognize the Z-Image nodes

The Z-Image loader nodes ship in current ComfyUI builds. Update via ComfyUI Manager → "Update ComfyUI" → restart, then re-load the template. Path documented on the official ComfyUI Z-Image tutorial.