How much VRAM does Juggernaut Z need?

About 13 GB — the minimum this recipe targets.

How hard is this setup?

Intermediate — follow the steps above.

Juggernaut Z on RX 7800 XT: Cinematic Photoreal Z-Image Base Fine-Tune at BF16 via ComfyUI on ROCm

What You'll Build

A local Juggernaut Z V1 text-to-image setup running in ComfyUI on a 16 GB Radeon RX 7800 XT (RDNA3, Navi 32, gfx1101) through the ROCm stack. Juggernaut Z is Team Juggernaut / KandooAI's photoreal fine-tune of Tongyi-MAI's Z-Image Base, released through RunDiffusion. Per the HF model card, it is "a fine-tune of Z-Image Base by Team Juggernaut, trained by KandooAI, and released through RunDiffusion" — tuned for stronger lighting, sharper focus, refined skin texture, and a more cinematic atmosphere. On a 16 GB card the BF16 DiT checkpoint (12.31 GB of weights) is the right default, but the fit is genuinely tight: with the VAE, activations, and a streamed-off text encoder on top, you are running near the 16 GB ceiling. This recipe leads with the single-file BF16 ComfyUI path (the one that fits) and documents the in-repo GGUF quants as the escape valve when you need headroom.

Hardware data: RX 7800 XT (16GB VRAM) · BF16 (tight) · ComfyUI on ROCm 7.2 · See benchmark data

⚠️ This is a ROCm recipe, not CUDA. The RX 7800 XT runs on AMD's ROCm/HIP stack — there is no cu124/cu128 wheel, no xformers install, and no FP8/FP4 path here. RDNA3 has no FP8/FP4 hardware (its WMMA units accept FP16, BF16, INT8, INT4 only), so the FP8 e4m3fn checkpoint RunDiffusion also ships would just upcast to BF16 at load — same memory as the BF16 file, no memory win and no speed gain on this card. The attention path is PyTorch SDPA (ComfyUI's default; the explicit flag is --use-pytorch-cross-attention), not FlashAttention-2 and not xformers. If a guide tells you to pip install xformers or pick a cu12x wheel for this card, it's written for the wrong vendor.

⚠️ 16 GB is a tight fit — mind the headroom. The 12.31 GB single-file BF16 DiT plus the Z-Image VAE plus 1024×1024 activations lands close to the 16 GB ceiling. In ComfyUI's single-file workflow the Qwen3-4B text encoder is loaded for CLIP encode and then streamed back to system RAM before the sampler runs, so the resident peak is dominated by the DiT, not the full ~20.5 GB Diffusers stack — but you have little slack for a large batch size or oversized resolution. If you hit an out-of-memory error, drop a GGUF quant in instead (see "Download the Juggernaut Z checkpoint" and Troubleshooting). The full DiffusionPipeline.from_pretrained Diffusers layout — which keeps the DiT and the 8.05 GB text encoder resident at once (~20.5 GB) — does not fit 16 GB; that is the 24 GB path, not this one.

⚠️ License: CC BY-NC 4.0 (non-commercial). The HF model card sets the frontmatter license: cc-by-nc-4.0 and states "non-commercial use only. You may not use the model — or its outputs in a workflow — for commercial purposes without a license." Commercial licensing is via juggernaut@rundiffusion.com. The Civitai release page lists Apache 2.0 in error — the HF canonical card is the source of truth.

Not Z-Image Turbo. Juggernaut Z is built on Z-Image Base (not the distilled Turbo), so it uses a different step/CFG profile. The HF card lists the default as 35 steps at CFG 6 (steps range 25–45, CFG range 6–9), not the ~8-NFE / low-CFG pattern of the distilled Turbo. Use the settings below.

Requirements

Component	Minimum	Tested
GPU	13 GB VRAM (ROCm-supported AMD card) for the single-file BF16 DiT	RX 7800 XT (16 GB)
RAM	16 GB system (32 GB recommended — the text encoder streams to RAM)	—
Storage	12.31 GB (single-file BF16 DiT) or ~20.5 GB (full Diffusers layout incl. text encoder)	per HF Files tree
Driver	AMD ROCm 7.2.x on Linux	—
Software	ComfyUI + PyTorch (ROCm 7.2 build), Python 3.10+	—

The BF16 weights are the anchor for this card: the Juggernaut-Z-Image repo file listing ships Juggernaut_Z_V1_by_RunDiffusion.safetensors at 12.31 GB on disk (12,309,866,400 bytes). The full Diffusers component layout adds the Qwen3-4B text encoder (8.05 GB across three shards) and the Z-Image VAE (0.17 GB) for ~20.5 GB resident in full BF16 — which exceeds the 16 GB 7800 XT, so this recipe uses the single-file ComfyUI path where the encoder is loaded and freed around CLIP encode rather than held resident alongside the DiT. The same repo also ships an FP8 e4m3fn variant (6.155 GB) and GGUF quantizations (Q4_K_S 4.83 GB, Q4_K_M 5.15 GB, Q5_K_S 5.34 GB, Q5_K_M 5.68 GB, Q6_K 6.05 GB, Q8_0 7.34 GB) per the card's Files table. On this card the FP8 file gives no memory win (it upcasts to BF16 on RDNA3), so the choice is BF16 (default, tight) vs GGUF (the headroom path, loaded via the llama.cpp-HIP GGUF route that works on ROCm).

The model is a fine-tune of Z-Image Base, which the HF card pins as base_model: Tongyi-MAI/Z-Image. The BF16 weights are not gated on Hugging Face — no access request or login is required to download them.

Installation

1. Install ComfyUI

Per the ComfyUI README, clone the repo:

git clone https://github.com/comfyanonymous/ComfyUI.git
cd ComfyUI

2. Install PyTorch for ROCm

The RX 7800 XT (gfx1101) is an officially ROCm-supported GPU, so it uses the stable ROCm PyTorch wheel. Per the ComfyUI README "AMD GPUs (Linux)" section, the stable install command is:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm7.2

ℹ️ Verify the ROCm tag before you copy it. As of this writing the ComfyUI README pins rocm7.2 as the stable wheel — but the rocmX.Y tag moves over time (6.3 → 6.4 → 7.x). Read the current line in the live ComfyUI README before running. A nightly variant (https://download.pytorch.org/whl/nightly/rocm7.2) "might have some performance improvements" per the README. The README also lists a separate experimental RDNA-3 wheel index (https://rocm.nightlies.amd.com/v2/gfx110X-all/, covering "RDNA 3, 3.5 and 4") for Windows+Linux RDNA3 support — the 7800 XT is RDNA3, but on officially-supported Linux you do not need it; the stable whl/rocm7.2 wheel above is the canonical path.

ℹ️ gfx1101, not gfx1100 — and you should not need HSA_OVERRIDE. The 7800 XT's ROCm arch target is gfx1101 (Navi 32), distinct from the 7900 XTX's gfx1100 (Navi 31) and from the RX 7600's gfx1102 (Navi 33). gfx1101 is an officially-supported ROCm target on Linux, so the stable wheel above ships kernels for it directly. HSA_OVERRIDE_GFX_VERSION=11.0.0 (masquerade as gfx1100) is a legacy fallback for libraries that ship only gfx1100 kernels — the ComfyUI README lists it under "For AMD cards not officially supported by ROCm" "For AMD 7600 and maybe other RDNA3 cards". You should not need it on a current ROCm 7.2 install for the 7800 XT; reach for it only if a specific library refuses to find gfx1101 kernels.

3. Install ComfyUI dependencies

Per the ComfyUI README "Dependencies" section:

pip install -r requirements.txt

4. Download the Juggernaut Z checkpoint

Place the single-file BF16 checkpoint in ComfyUI/models/checkpoints/. The file size is verified from the Hugging Face Files tree (12,309,866,400 bytes ≈ 12.31 GB):

# BF16 DiT checkpoint (the default for the 16 GB 7800 XT — tight but fits) — 12.31 GB
wget -P models/checkpoints/ \
  https://huggingface.co/RunDiffusion/Juggernaut-Z-Image/resolve/main/Juggernaut_Z_V1_by_RunDiffusion.safetensors

ℹ️ Tight on 16 GB? Use a GGUF quant instead of FP8. If the BF16 path OOMs on your setup (a large batch, a resolution above 1024 px, or a heavy workflow), do not reach for the FP8 file — on RDNA3 it upcasts to BF16 at load (no memory saving, same ~12.3 GB). Instead use one of the in-repo GGUF quants with a GGUF-aware loader node, which loads via the llama.cpp-HIP path that works on ROCm. The card's Files table lists Juggernaut_Z_V1_by_RunDiffusion_q8_0.gguf (7.34 GB, highest-quality quant) down to Juggernaut_Z_V1_by_RunDiffusion_q4_k_s-001.gguf (4.83 GB, smallest footprint). The smaller on-disk weight frees several GB of headroom for activations and VAE decode:

# Optional headroom path — Q8_0 GGUF (7.34 GB) for a smaller resident footprint than BF16
wget -P models/checkpoints/ \
  https://huggingface.co/RunDiffusion/Juggernaut-Z-Image/resolve/main/Juggernaut_Z_V1_by_RunDiffusion_q8_0.gguf

Running

Launch ComfyUI from the repo root with the PyTorch SDPA attention backend:

python main.py --use-pytorch-cross-attention

--use-pytorch-cross-attention forces ComfyUI's PyTorch-2.0 cross-attention path — per ComfyUI's cli_args.py the flag is documented as "Use the new pytorch 2.0 cross attention function." — which is the correct attention route on RDNA3 (it replaces the CUDA-only FlashAttention/xformers paths, which don't apply here). It is also the workaround for the Z-Image-family VAE-decode crash on RDNA3/ROCm: with the default sub-quadratic attention backend, Z-Image-derived models (Juggernaut Z shares Z-Image's VAE) can sample fine and then die at VAE Decode on RDNA3. In ComfyUI Issue #11551 — filed against a 24 GB RX 7900 XTX (gfx1100) — a ComfyUI contributor on a 7900 XT confirms "Try adding --use-pytorch-cross-attention --disable-smart-memory" fixes it (see Troubleshooting). The 7800 XT is the same RDNA3 ROCm attention path, so the same flag applies, though the issue thread itself is not measured on a 7800 XT. If you hit instability across repeated runs, add --disable-smart-memory (and, if loads stall, --disable-pinned-memory).

This starts the server (default http://127.0.0.1:8188). Open it in a browser, load a Z-Image / Juggernaut workflow (Load Checkpoint → CLIP Text Encode → KSampler → VAE Decode → Save Image), select Juggernaut_Z_V1_by_RunDiffusion.safetensors, and set the sampler to the model's defaults: 35 steps at CFG 6 per the HF card. Generated PNGs land in ComfyUI/output/ with the full workflow embedded.

On a 16 GB card you may want the memory-saving --use-split-cross-attention fallback (documented as "Use the split cross attention optimization. Ignored when xformers is used.") only if BF16 is over budget and you have not switched to a GGUF quant — on RDNA3, though, --use-pytorch-cross-attention is both the more reliable attention path and the VAE-crash fix, so prefer the GGUF-quant route for memory relief and keep PyTorch cross-attention. Passing --lowvram makes the text encoders run on the CPU per ComfyUI's cli_args.py; the single-file workflow already frees the encoder after CLIP encode, so reach for --lowvram only if you still OOM after trying a GGUF quant.

Results

Speed: No RX-7800-XT-named Juggernaut Z iterations-per-second benchmark was found in research that could be verified on a source page, and our backend has no ingested measurement for this pair — /check/juggernaut-z/rx-7800-xt currently reports verdict: unknown. Generation time on a Z-Image-class DiT at 35 steps is dominated by memory bandwidth (the 7800 XT has 624 GB/s, about 65% of the 7900 XTX's 960 GB/s), so do not assume any 7900 XTX figure transfers — it would overstate the 7800 XT. No first-party 7800 XT number exists to quote, so we omit a speed figure rather than extrapolate one. If you've measured Juggernaut Z it/s or seconds-per-image on a 7800 XT, please contribute it so it lands on /check/juggernaut-z/rx-7800-xt. As a general note, you can try PYTORCH_TUNABLEOP_ENABLED=1 (see Troubleshooting) which the ComfyUI README says "might speed things up at the cost of a very slow initial run."
VRAM usage: The BF16 DiT checkpoint is 12.31 GB on disk (HF Files tree); in the single-file ComfyUI workflow the resident peak is dominated by that DiT plus the Z-Image VAE (0.17 GB) and 1024×1024 activations, with the Qwen3-4B text encoder (8.05 GB) loaded and freed around CLIP encode rather than held resident — which is why this path fits 16 GB while the full Diffusers stack (~20.5 GB, DiT + resident encoder) does not. The fit is tight: a min_vram_gb of 13 reflects the single-file DiT path's resident envelope, with little slack for large batches or oversized resolution. Drop to a GGUF quant (Q8_0 7.34 GB → Q4_K_S 4.83 GB) for more headroom. See /check/juggernaut-z/rx-7800-xt for any community-submitted measurement.
Quality notes: Per the HF card, Juggernaut Z is tuned for stronger lighting, sharper focus, and cleaner portraits relative to Z-Image Base; the card's recommended sampler is 35 steps / CFG 6. BF16 is the highest-quality local path; a GGUF quant trades a little quality for VRAM headroom on this 16 GB card. License is CC BY-NC 4.0 (non-commercial; commercial licensing via juggernaut@rundiffusion.com).

For the full benchmark data and other-GPU comparisons, see /check/juggernaut-z/rx-7800-xt.

Troubleshooting

Out of memory on 16 GB — switch to a GGUF quant (not FP8)

The single-file BF16 path is close to the 16 GB ceiling, so a large batch size, a resolution above 1024 px, or a memory-hungry custom node can tip it into an out-of-memory error. The fix on RDNA3 is not the FP8 file — that upcasts to BF16 at load on this card (no saving). Use an in-repo GGUF quant with a GGUF-aware loader node instead: the card's Files table runs from Juggernaut_Z_V1_by_RunDiffusion_q8_0.gguf (7.34 GB, near-BF16 quality) down to Juggernaut_Z_V1_by_RunDiffusion_q4_k_s-001.gguf (4.83 GB). GGUF loads via the llama.cpp-HIP path that works on ROCm, and the smaller resident weight frees several GB for activations and VAE decode. Keep resolution ≤ 1024 px on the longest side regardless of quant.

VAE Decode crashes on ROCm (the DiT generates fine, then it dies at decode)

This is the most-reported failure for Z-Image-family models on RDNA3 Radeon cards, and Juggernaut Z inherits the same VAE. In ComfyUI Issue #11551 — titled "Z-Image Turbo - VAE crash when using bf16 models on 24GB RX 7900 XTX" — the KSampler stage completes successfully, then VAE Decode hangs for a few seconds and crashes even with plenty of free VRAM remaining, so it is a ROCm memory-mapping bug, not an out-of-memory condition (the report is open with no upstream fix at time of writing). The thread is on a 7900 XTX (gfx1100), but the 7800 XT runs the same RDNA3 ROCm attention/VAE path, so the same workarounds apply:

Use --use-pytorch-cross-attention (already in the launch command above). A ComfyUI contributor in that thread, running a 7900 XT, advises adding "--use-pytorch-cross-attention --disable-smart-memory" and reports no Z-Image-Turbo problems with it. The crash is triggered by the default sub-quadratic attention backend, not by VAE precision.
Add --disable-smart-memory (and, if loads stall on repeated runs, --disable-pinned-memory) — the standard ROCm large-model memory-management workaround, also recommended in that thread.
Keep resolution ≤ 1024 px on the longest side. If you still see a corrupted (not crashing) decode, you can move the VAE to the CPU with --cpu-vae — slower, but it sidesteps the ROCm VAE kernel entirely (the VAE is only 0.17 GB, so the CPU penalty is modest). Note that --bf16-vae is not a confirmed fix for this crash and is contested on RDNA3 (ROCm Issue #4729 reports it can inflate decode VRAM) — reach for it only if a precision/black-image artifact appears, never as the crash fix.

If you're on native-Windows ROCm and get corrupted ("grey with colored lines") output from the VAE instead of a crash, a sibling report (ComfyUI Issue #11190, RDNA4 9070 XT) confirmed the same workflow runs cleanly under WSL2. Run under WSL2 or native Linux — the configuration this recipe targets — where the AMD PyTorch/ROCm VAE path is more mature.

"Torch not compiled with CUDA enabled"

This means a CUDA build of PyTorch got installed instead of the ROCm build. Per the ComfyUI README troubleshooting note, uninstall and reinstall against the ROCm wheel index:

pip uninstall torch
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm7.2

Confirm the installed build is the ROCm one: python -c "import torch; print(torch.__version__)" should print a +rocm7.2-style suffix, and torch.cuda.is_available() returns True (ROCm masquerades as the cuda device namespace under HIP).

Generation feels slow on the first run — enable TunableOp

Per the ComfyUI README "AMD ROCm Tips": "You can try setting this env variable PYTORCH_TUNABLEOP_ENABLED=1 which might speed things up at the cost of a very slow initial run." TunableOp auto-tunes GEMM kernels for your card on the first pass (slow), then caches the tuned kernels for faster subsequent generations:

PYTORCH_TUNABLEOP_ENABLED=1 python main.py --use-pytorch-cross-attention

Do not install xformers or FlashAttention

HF and ComfyUI guides written for NVIDIA frequently suggest pip install xformers or a FlashAttention wheel. On RDNA3 these are the wrong path: the ROCm xformers fork is limited, and ComfyUI already routes attention through PyTorch SDPA on this stack. Stick with the default, or force it explicitly with --use-pytorch-cross-attention.

1024×1024 outputs look noisy or grainy

The Juggernaut Z author flags this on the Civitai release notes for some prompts: try a portrait/landscape aspect (e.g. 960×1440 / 1440×960) instead of square, and stay within the HF card's recommended 35 steps / CFG 6 profile rather than pushing steps far below the 25–45 range.