self-hosted/ai
§01·recipe · image

Juggernaut Z on RX 7900 XTX: Cinematic Photoreal Z-Image Base Fine-Tune at BF16 via ComfyUI on ROCm

imageintermediate13GB+ VRAMJun 17, 2026

This intermediate recipe sets up Juggernaut Z on the RX 7900 XTX, needing about 13 GB of VRAM.

models
tools
prerequisites
  • AMD Radeon RX 7900 XTX (24 GB VRAM, RDNA3 / Navi 31 / gfx1100) or equivalent ROCm-supported card
  • Linux (Ubuntu 24.04 / 22.04 or RHEL) with the AMD ROCm stack installed (ROCm 7.2.x)
  • Python 3.10+
  • ~13 GB free disk for the BF16 checkpoint (12.31 GB) — or ~20.5 GB for the full Diffusers component layout (DiT 12.31 GB + Qwen3-4B text encoder 8.05 GB + VAE 0.17 GB)
  • ComfyUI installed (git clone) with PyTorch built for ROCm

What You'll Build

A local Juggernaut Z V1 text-to-image setup running in ComfyUI on a 24 GB Radeon RX 7900 XTX (RDNA3, Navi 31, gfx1100) through the ROCm stack. Juggernaut Z is Team Juggernaut / KandooAI's photoreal fine-tune of Tongyi-MAI's Z-Image Base, released through RunDiffusion. Per the HF model card, it is "a fine-tune of Z-Image Base by Team Juggernaut, trained by KandooAI, and released through RunDiffusion" — tuned for stronger lighting, sharper focus, refined skin texture, and a more cinematic atmosphere. With 24 GB of VRAM the BF16 DiT checkpoint (12.31 GB of weights) is never memory-bound: you run the native BF16 weights with room to spare for the text encoder, the VAE, and a high batch size, with no need for any quantization.

Hardware data: RX 7900 XTX (24GB VRAM) · BF16 · ComfyUI on ROCm 7.2 · See benchmark data

⚠️ This is a ROCm recipe, not CUDA. The RX 7900 XTX runs on AMD's ROCm/HIP stack — there is no cu124/cu128 wheel, no xformers install, and no FP8/FP4 path here. RDNA3 has no FP8/FP4 hardware (its WMMA units accept FP16, BF16, INT8, INT4 only), so the FP8 e4m3fn checkpoint RunDiffusion also ships would just upcast to BF16 with no memory saving — and at 24 GB you don't need it anyway. The attention path is PyTorch SDPA (ComfyUI's default; the explicit flag is --use-pytorch-cross-attention), not FlashAttention-2 and not xformers. If a guide tells you to pip install xformers or pick a cu12x wheel for this card, it's written for the wrong vendor.

⚠️ License: CC BY-NC 4.0 (non-commercial). The HF model card sets the frontmatter license: cc-by-nc-4.0 and states "non-commercial use only. You may not use the model — or its outputs in a workflow — for commercial purposes without a license." Commercial licensing is via juggernaut@rundiffusion.com. The Civitai release page lists Apache 2.0 in error — the HF canonical card is the source of truth.

Not Z-Image Turbo. Juggernaut Z is built on Z-Image Base (not the distilled Turbo), so it uses a different step/CFG profile. The HF card lists the default as 35 steps at CFG 6 (steps range 25–45, CFG range 6–9), not the ~8-NFE / low-CFG pattern of the distilled Turbo. Use the settings below.

Requirements

ComponentMinimumTested
GPU13 GB VRAM (ROCm-supported AMD card) for the BF16 DiTRX 7900 XTX (24 GB)
RAM16 GB system
Storage12.31 GB (single-file BF16 DiT) or ~20.5 GB (full Diffusers layout incl. text encoder)per HF Files tree
DriverAMD ROCm 7.2.x on Linux
SoftwareComfyUI + PyTorch (ROCm 7.2 build), Python 3.10+

The BF16 weights are the anchor for this card: the Juggernaut-Z-Image repo file listing ships Juggernaut_Z_V1_by_RunDiffusion.safetensors at 12.31 GB on disk (12,309,866,400 bytes). The full Diffusers component layout adds the Qwen3-4B text encoder (8.05 GB across three shards) and the Z-Image VAE (0.17 GB) for ~20.5 GB resident in full BF16 — which the 24 GB 7900 XTX holds with headroom. The same repo also ships an FP8 e4m3fn variant (6.155 GB) and GGUF quantizations (Q4_K_S 4.83 GB through Q8_0 7.34 GB), but on this card the FP8 file gives no memory win (it upcasts to BF16 on RDNA3) — the BF16 path is the correct default, and GGUF is only useful if you want to free VRAM for a second model.

The model is a fine-tune of Z-Image Base, which the HF card pins as base_model: Tongyi-MAI/Z-Image. The BF16 weights are not gated on Hugging Face — no access request or login is required to download them.

Installation

1. Install ComfyUI

Per the ComfyUI README, clone the repo:

git clone https://github.com/comfyanonymous/ComfyUI.git
cd ComfyUI

2. Install PyTorch for ROCm

The RX 7900 XTX (gfx1100) is an officially ROCm-supported GPU, so it uses the stable ROCm PyTorch wheel. Per the ComfyUI README "AMD GPUs (Linux)" section, the stable install command is:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm7.2

ℹ️ Verify the ROCm tag before you copy it. As of this writing the ComfyUI README pins rocm7.2 as the stable wheel — but the rocmX.Y tag moves over time (6.3 → 6.4 → 7.x). Read the current line in the live ComfyUI README before running. A nightly variant (https://download.pytorch.org/whl/nightly/rocm7.2) "might have some performance improvements" per the README. There is also a separate experimental RDNA-3-specific wheel index (https://rocm.nightlies.amd.com/v2/gfx110X-all/) that the README lists for Windows+Linux RDNA3 support — on officially-supported Linux you do not need it; the stable whl/rocm7.2 wheel above is the canonical path.

3. Install ComfyUI dependencies

Per the ComfyUI README "Dependencies" section:

pip install -r requirements.txt

4. Download the Juggernaut Z checkpoint

Place the single-file BF16 checkpoint in ComfyUI/models/checkpoints/. The file size is verified from the Hugging Face Files tree (12,309,866,400 bytes ≈ 12.31 GB):

# BF16 DiT checkpoint (the correct choice for the 24 GB 7900 XTX) — 12.31 GB
wget -P models/checkpoints/ \
  https://huggingface.co/RunDiffusion/Juggernaut-Z-Image/resolve/main/Juggernaut_Z_V1_by_RunDiffusion.safetensors

ℹ️ Don't download the FP8 file on this card. The repo also ships Juggernaut_Z_V1_FP8_e4m3fn.safetensors (6.155 GB). On NVIDIA Ada/Blackwell that runs natively on FP8 tensor cores, but RDNA3 has no FP8 hardware, so on the 7900 XTX it upcasts to BF16 at load — same memory as the BF16 file, no speed gain. If you specifically want a smaller on-disk footprint to free VRAM for colocation, use a GGUF quant instead (Juggernaut_Z_V1_by_RunDiffusion_q8_0.gguf, 7.34 GB) with a GGUF-aware loader node — GGUF loads via the llama.cpp-HIP path that works on ROCm.

Running

Launch ComfyUI from the repo root with the PyTorch SDPA attention backend:

python main.py --use-pytorch-cross-attention

--use-pytorch-cross-attention forces ComfyUI's PyTorch-2.0 cross-attention path — per ComfyUI's cli_args.py the flag is documented as "Use the new pytorch 2.0 cross attention function." — which is the correct attention route on RDNA3 (it replaces the CUDA-only FlashAttention/xformers paths, which don't apply here). It is also the confirmed fix for the Z-Image-family VAE-decode crash on this card: with the default attention backend, Juggernaut Z (which shares Z-Image's VAE) samples fine and then dies at VAE Decode on gfx1100; switching to this flag resolves it. In ComfyUI Issue #11551 multiple RX 7900 XTX users confirm --use-pytorch-cross-attention fixes the crash while the default and --use-split-cross-attention startups both fail (see Troubleshooting). If you hit instability across repeated runs, add --disable-smart-memory (and, if loads stall, --disable-pinned-memory).

This starts the server (default http://127.0.0.1:8188). Open it in a browser, load a Z-Image / Juggernaut workflow (Load Checkpoint → CLIP Text Encode → KSampler → VAE Decode → Save Image), select Juggernaut_Z_V1_by_RunDiffusion.safetensors, and set the sampler to the model's defaults: 35 steps at CFG 6 per the HF card. Generated PNGs land in ComfyUI/output/ with the full workflow embedded.

At 24 GB you should not need the memory-saving --use-split-cross-attention fallback (documented as "Use the split cross attention optimization. Ignored when xformers is used.") — that is for VRAM-constrained cards. Likewise, do not pass --lowvram on a 7900 XTX; per the README it forces the text encoders onto the CPU, which only slows you down when you have memory to spare.

Results

  • Speed: No RX-7900-XTX-named Juggernaut Z iterations-per-second benchmark was found in research that could be verified on a source page, and our backend has no ingested measurement for this pair — /check/juggernaut-z/rx-7900-xtx currently reports verdict: unknown. Generation time on a Z-Image-class DiT at 35 steps is dominated by memory bandwidth (the 7900 XTX has 960 GB/s), but no first-party 7900 XTX figure exists to quote, so we omit a number rather than extrapolate one. If you've measured Juggernaut Z it/s or seconds-per-image on a 7900 XTX, please contribute it so it lands on /check/juggernaut-z/rx-7900-xtx. As a general note, you can try PYTORCH_TUNABLEOP_ENABLED=1 (see Troubleshooting) which the ComfyUI README says "might speed things up at the cost of a very slow initial run."
  • VRAM usage: The BF16 DiT checkpoint is 12.31 GB on disk (HF Files tree); with the Qwen3-4B text encoder (8.05 GB), the Z-Image VAE (0.17 GB), and 1024×1024 activations, the full BF16 stack derives to roughly a ~20.5 GB envelope on disk — comfortably within the 24 GB 7900 XTX, leaving headroom for a higher batch size. See /check/juggernaut-z/rx-7900-xtx for any community-submitted measurement.
  • Quality notes: Per the HF card, Juggernaut Z is tuned for stronger lighting, sharper focus, and cleaner portraits relative to Z-Image Base; the card's recommended sampler is 35 steps / CFG 6. There is no quantization tradeoff to consider on this card — run the native BF16 weights. License is CC BY-NC 4.0 (non-commercial; commercial licensing via juggernaut@rundiffusion.com).

For the full benchmark data and other-GPU comparisons, see /check/juggernaut-z/rx-7900-xtx.

Troubleshooting

VAE Decode crashes on ROCm (the DiT generates fine, then it dies at decode)

This is the most-reported failure for Z-Image-family models on the RX 7900 XTX, and Juggernaut Z inherits the same VAE. In ComfyUI Issue #11551 — titled "Z-Image Turbo - VAE crash when using bf16 models on 24GB RX 7900 XTX" (gfx1100, 24 GB, BF16, ROCm) — the KSampler stage completes successfully and VRAM drops to near-idle, then VAE Decode hangs and crashes with a ROCm device error (Memobj map does not have ptr). Plenty of free VRAM remains — this is a ROCm memory-mapping bug, not an out-of-memory condition (the report is open with no upstream fix at time of writing). Workarounds, in the order multiple 7900 XTX users confirmed in that thread:

  1. Use --use-pytorch-cross-attention (already in the launch command above). This is the confirmed fix: in the issue's own test matrix the default startup and --use-split-cross-attention both crash, while --use-pytorch-cross-attention runs cleanly with the best performance. The crash is triggered by the default sub-quadratic attention backend, not by VAE precision.
  2. Add --disable-smart-memory (and, if loads stall on repeated runs, --disable-pinned-memory) — the standard ROCm large-model memory-management workaround, also confirmed working in the thread.
  3. Keep resolution ≤ 1024 px on the longest side. If you still see a corrupted (not crashing) decode, you can move the VAE to the CPU with --cpu-vae — slower, but it sidesteps the ROCm VAE kernel entirely (the VAE is only 0.17 GB, so the CPU penalty is modest). Note that --bf16-vae is not a confirmed fix for this crash and is contested on RDNA3 (ROCm Issue #4729 reports it can inflate decode VRAM) — reach for it only if a precision/black-image artifact appears, never as the crash fix.

If you're on native-Windows ROCm and get corrupted ("grey with colored lines") output from the VAE instead of a crash, a sibling report (ComfyUI Issue #11190, RDNA4 9070 XT) confirmed the same workflow runs cleanly under WSL2. Run under WSL2 or native Linux — the configuration this recipe targets — where the AMD PyTorch/ROCm VAE path is more mature.

"Torch not compiled with CUDA enabled"

This means a CUDA build of PyTorch got installed instead of the ROCm build. Per the ComfyUI README troubleshooting note, uninstall and reinstall against the ROCm wheel index:

pip uninstall torch
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm7.2

Confirm the installed build is the ROCm one: python -c "import torch; print(torch.__version__)" should print a +rocm7.2-style suffix, and torch.cuda.is_available() returns True (ROCm masquerades as the cuda device namespace under HIP).

Generation feels slow on the first run — enable TunableOp

Per the ComfyUI README "AMD ROCm Tips": "You can try setting this env variable PYTORCH_TUNABLEOP_ENABLED=1 which might speed things up at the cost of a very slow initial run." TunableOp auto-tunes GEMM kernels for your card on the first pass (slow), then caches the tuned kernels for faster subsequent generations:

PYTORCH_TUNABLEOP_ENABLED=1 python main.py --use-pytorch-cross-attention

Do not install xformers or FlashAttention

HF and ComfyUI guides written for NVIDIA frequently suggest pip install xformers or a FlashAttention wheel. On RDNA3 these are the wrong path: the ROCm xformers fork is limited, and ComfyUI already routes attention through PyTorch SDPA on this stack. Stick with the default, or force it explicitly with --use-pytorch-cross-attention.

1024×1024 outputs look noisy or grainy

The Juggernaut Z author flags this on the Civitai release notes for some prompts: try a portrait/landscape aspect (e.g. 960×1440 / 1440×960) instead of square, and stay within the HF card's recommended 35 steps / CFG 6 profile rather than pushing steps far below the 25–45 range.

common questions
How much VRAM does Juggernaut Z need?

About 13 GB — the minimum this recipe targets.

Which GPUs is Juggernaut Z tested on?

RX 7900 XTX (24 GB).

How hard is this setup?

Intermediate — follow the steps above.