self-hosted/ai
§01·recipe · image

Z-Image Turbo on RX 7900 XTX: BF16 8-Step Text-to-Image via ComfyUI on ROCm

imageintermediate16GB+ VRAMJun 17, 2026

This intermediate recipe sets up Z-Image Turbo on the RX 7900 XTX, needing about 16 GB of VRAM.

models
tools
prerequisites
  • AMD Radeon RX 7900 XTX (24 GB VRAM, RDNA3 / Navi 31 / gfx1100) or an equivalent ROCm-supported AMD card
  • Linux (Ubuntu 24.04 / 22.04) with the AMD ROCm 7.x stack installed — ROCm is the AMD equivalent of CUDA, not an add-on to it
  • Python 3.10+
  • ~21 GB free disk for the BF16 weight set (DiT 12.3 GB + Qwen3-4B text encoder 8.0 GB + VAE 0.34 GB)
  • ComfyUI (latest, with native ROCm support)

What You'll Build

A local install of Z-Image-Turbo — Alibaba Tongyi-MAI's 6B-parameter distilled image-generation model — producing 1024×1024 text-to-image in 8 steps on a Radeon RX 7900 XTX (RDNA3, Navi 31, gfx1100) through the ROCm stack and ComfyUI. With 24 GB of VRAM this card sits above the model's headline 16 GB tier, so the canonical BF16 weights run directly — no quantization, no GGUF, no text-encoder workarounds. AMD ships an official ComfyUI + Z-Image-Turbo playbook for Radeon hardware, so this is a first-party-supported path, not a community port.

Hardware data: RX 7900 XTX (24GB VRAM) · BF16 · 8 NFEs at 1024×1024 · ROCm 7.x · See benchmark data

⚠️ This is a ROCm recipe, not CUDA. The RX 7900 XTX runs on AMD's ROCm/HIP stack — there is no cu124/cu128 wheel, no xformers, and no FP8/FP4 path here. RDNA3's WMMA units have no FP8 or FP4 hardware (AMD GPUOpen, "WMMA on RDNA3"), so an FP8 or NVFP4 checkpoint would just upcast to BF16 at load — no memory saving, no speedup. The attention path is PyTorch SDPA, selected with ComfyUI's --use-pytorch-cross-attention flag exactly as AMD's own playbook instructs. The full BF16 build is the lead path because the 24 GB card has room for it.

ℹ️ Note on variants. The Tongyi-MAI Z-Image family currently ships four variantsZ-Image-Turbo (the distilled 8-step consumer model this recipe targets), Z-Image (the 50-step foundation model), Z-Image-Omni-Base, and Z-Image-Edit. All four share a 6B "Scalable Single-Stream DiT" (S3-DiT) backbone. Fine-tunes such as Juggernaut-Z (RunDiffusion) are a separate model with their own recipe — don't conflate them.

Requirements

ComponentMinimumTested
GPU16 GB VRAM (ROCm-supported AMD card)RX 7900 XTX (24 GB, RDNA3, gfx1100)
RAM16 GB system RAM
Storage~21 GB on disk (DiT 12.3 GB + Qwen3-4B text encoder 8.0 GB + VAE 0.34 GB, per the Comfy-Org split-file mirror)
Driver / runtimeAMD ROCm 7.x on LinuxROCm 7.2 PyTorch wheels
SoftwareComfyUI (native ROCm support)ComfyUI with --use-pytorch-cross-attention

Z-Image-Turbo "fits comfortably within 16G VRAM consumer devices" per the official Tongyi-MAI model card; the 24 GB RX 7900 XTX clears that bar with ~8 GB of headroom for activations and a second resident model. The model is released under Apache-2.0 (per the model card frontmatter) — commercial use permitted — and the weights are not gated, so no Hugging Face login is required.

We keep min_vram_gb: 16 to match the documented BF16 path's headline tier (the same envelope AMD's playbook and the model card both quote). The 24 GB card is comfortably above it; it is not the binding constraint here.

Installation

1. Install PyTorch with ROCm

The RX 7900 XTX (gfx1100) is an officially ROCm-supported GPU on Linux. Install the ROCm PyTorch wheels — per the ComfyUI README "AMD GPUs (ROCm)" section, the current stable channel is rocm7.2:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm7.2

The rocmX.Y tag moves over time (6.3 → 6.4 → 7.x). Read the live "AMD GPUs (ROCm)" block in the ComfyUI README at install time and use whatever tag it shows. AMD also publishes Radeon-specific wheels (repo.radeon.com) and a step-by-step ROCm-on-Radeon ComfyUI install guide pinned to ROCm 7.2.1 — either wheel source works on gfx1100.

2. Install ComfyUI

Per the AMD ROCm-on-Radeon ComfyUI guide:

git clone https://github.com/comfyanonymous/ComfyUI.git && cd ComfyUI
pip install -r requirements.txt

3. Download the Z-Image-Turbo BF16 weights

Place the three Comfy-Org-packaged BF16 files into the standard ComfyUI model folders. These are the exact filenames and URLs from the official ComfyUI Z-Image-Turbo tutorial (and the same files referenced by AMD's playbook), with on-disk sizes verified against the Comfy-Org HF tree:

# from your ComfyUI root
cd models/diffusion_models
wget https://huggingface.co/Comfy-Org/z_image_turbo/resolve/main/split_files/diffusion_models/z_image_turbo_bf16.safetensors  # 12.3 GB

cd ../text_encoders
wget https://huggingface.co/Comfy-Org/z_image_turbo/resolve/main/split_files/text_encoders/qwen_3_4b.safetensors            # 8.04 GB

cd ../vae
wget https://huggingface.co/Comfy-Org/z_image_turbo/resolve/main/split_files/vae/ae.safetensors                            # 0.34 GB

Z-Image-Turbo pairs the 6B DiT with a Qwen3-4B text encoder (the qwen_3_4b.safetensors above, ~8 GB at BF16). On an 8 GB card this encoder is what makes the BF16 path infeasible; on the 24 GB RX 7900 XTX the whole stack — DiT + encoder + VAE ≈ 20.6 GB on disk — loads in full BF16 with room to spare. The Comfy-Org repo also publishes nvfp4 and fp8_mixed files, but ignore them on this card: RDNA3 has no FP4/FP8 hardware, so they would upcast to BF16 with no benefit — the plain _bf16 files are the correct choice.

4. Load the official workflow

Update ComfyUI to the latest version (the Z-Image nodes ship in current builds), then load the bundled Z-Image-Turbo workflow template — in ComfyUI, Workflow → Browse Templates → Image → Z-Image-Turbo, or drag the template JSON in. The ComfyUI tutorial covers this drag-and-drop step; if ComfyUI prompts that models are missing, the in-app download buttons fetch the same three files from step 3.

Running

Launch ComfyUI with the PyTorch SDPA attention backend — this is the attention path AMD's own playbook specifies for ROCm (it replaces the CUDA-only FlashAttention/xformers paths, which don't apply on RDNA3). This flag is also the confirmed fix for the Z-Image-Turbo VAE-decode crash on this card: with the default attention backend the DiT samples fine and then dies at VAE Decode on gfx1100, and --use-pytorch-cross-attention resolves it (see Troubleshooting). If you hit instability across repeated runs, add --disable-smart-memory (and, if loads stall, --disable-pinned-memory):

python3 main.py --use-pytorch-cross-attention

Open http://127.0.0.1:8188, edit the prompt node in the loaded Z-Image-Turbo template, and press Queue Prompt. Per AMD's playbook and the model card, the Turbo workflow is distilled for 8 NFEs (the playbook recommends a 4–10 step range), very low guidance (CFG ≈ 1.0–2.0; the model card lists CFG as off for Turbo), the euler or res_multistep sampler, and resolution kept at or below 1024 px on the longest side. The preconfigured template already encodes these — you usually just change the prompt.

The first run loads the BF16 weights into VRAM; the generated image lands in ComfyUI/output/.

Results

  • Speed: Omitted — no RX-7900-XTX-named Z-Image-Turbo benchmark exists yet. The backend has no ingested run for this pair (/check/z-image-turbo/rx-7900-xtx returns verdict: unknown). The figures that do exist are not transferable: the model card's "sub-second on enterprise-grade H800 GPUs" is a datacenter card; AMD's release-day SDXL/FLUX ROCm speedup numbers (2.6× SDXL, 5.2× FLUX.1-schnell vs ROCm 6.4) were measured on Ryzen AI Max+ / Radeon AI PRO parts, not the 7900 XTX, and are relative speedups, not absolute timings. We deliberately do not relabel a different card's or model's number. If you run Z-Image-Turbo on a 7900 XTX, please submit your timing so a measured figure can land here.
  • VRAM usage: The model "fits comfortably within 16G VRAM consumer devices" per the official model card; the full BF16 weight set is ~20.6 GB on disk and the 24 GB RX 7900 XTX holds it resident with headroom. In a real on-card run (Issue #11551) the DiT sampling stage stayed well within budget — VRAM was not the failure mode (see Troubleshooting). Live measurements: /check/z-image-turbo/rx-7900-xtx.
  • Quality notes: Architecture is "Scalable Single-Stream DiT (S3-DiT)" with text, visual-semantic, and VAE tokens concatenated into one unified input stream, optimized for 8-NFE generation while matching or exceeding leading competitors per the model card. Turbo is distilled for low step counts — pushing steps far above ~10 or raising CFG well past 2.0 degrades rather than improves output.

For the full benchmark data, see /check/z-image-turbo/rx-7900-xtx.

Troubleshooting

VAE Decode crashes on ROCm (the DiT generates fine, then it dies at decode)

This is the most-reported Z-Image-Turbo failure on the RX 7900 XTX. In ComfyUI Issue #11551 (gfx1100, 24 GB, BF16, ROCm 7.1) the KSampler stage completes successfully and VRAM drops to ~1.3 GB, then VAE Decode hangs ~5 s and crashes with a ROCm device error (Memobj map does not have ptr: ...). Plenty of free VRAM remained — this is a ROCm memory-mapping bug, not an out-of-memory condition (the report is open with no upstream fix at time of writing).

Mitigations, in the order multiple 7900 XTX users confirmed in that thread:

  1. Use --use-pytorch-cross-attention (already in the launch command above). This is the confirmed fix: in the issue's own test matrix the default startup and --use-split-cross-attention both crash, while --use-pytorch-cross-attention runs cleanly with the best performance. The crash is triggered by the default sub-quadratic attention backend, not by VAE precision.
  2. Add --disable-smart-memory (and, if loads stall on repeated runs, --disable-pinned-memory) — the standard ROCm large-model memory-management workaround, also confirmed working in the thread.
  3. Keep resolution ≤ 1024 px on the longest side, as AMD's playbook recommends. If you still see a corrupted (not crashing) decode, you can move the VAE to the CPU with --cpu-vae (slower, but it sidesteps the ROCm VAE kernel; the VAE is only 0.34 GB so the penalty is modest). Note that --bf16-vae is not a confirmed fix for this crash and is contested on RDNA3 (ROCm Issue #4729 reports it can inflate decode VRAM) — reach for it only if a precision/black-image artifact appears, never as the crash fix.

Grey / corrupted output instead of an image

A sibling report on a different AMD card (Issue #11190, RDNA4 9070 XT, native Windows ROCm) produced "grey with colored lines" output from the VAE — and the reporter confirmed the same workflow runs cleanly under WSL2. If you're on native-Windows ROCm and hit corruption, run under WSL2 (or native Linux, the configuration this recipe targets) where the AMD PyTorch/ROCm VAE path is more mature. This is a Windows-ROCm-stack issue, not a ComfyUI or model issue.

"Do I need a cu128 wheel / FlashAttention / xformers?"

No — those are NVIDIA-only. On the RX 7900 XTX you install the ROCm PyTorch wheel (--index-url .../whl/rocm7.2), and attention runs through PyTorch SDPA via --use-pytorch-cross-attention. If a guide tells you to pip install flash-attn, install xformers, pick a cu12x index, or download an FP8/NVFP4 checkpoint "to save VRAM," it was written for the wrong vendor — ignore it on this card.

ComfyUI doesn't recognize the Z-Image nodes

The Z-Image loader nodes ship in current ComfyUI builds. Update via ComfyUI Manager → "Update ComfyUI" → restart, then re-load the template. Path documented on the official ComfyUI Z-Image tutorial.

common questions
How much VRAM does Z-Image Turbo need?

About 16 GB — the minimum this recipe targets.

Which GPUs is Z-Image Turbo tested on?

RX 7900 XTX (24 GB).

How hard is this setup?

Intermediate — follow the steps above.