Juggernaut Z on RTX 5070 Ti: Cinematic Photoreal Fine-Tune of Z-Image Base at BF16

What You'll Build

A local install of Juggernaut Z V1 — Team Juggernaut's photoreal fine-tune of Tongyi-MAI's 6B Z-Image Base, trained by KandooAI and released through RunDiffusion. The recipe covers two paths: a Python script via HuggingFace diffusers, and a ComfyUI workflow using the official RunDiffusion node graph. Per the HF model card, Juggernaut Z is tuned for "stronger lighting, sharper focus, more refined skin texture, and more cinematic atmosphere" relative to the upstream Base.

Hardware data: RTX 5070 Ti (16GB VRAM) · BF16 / FP8 / GGUF variants available · See benchmark data

⚠️ License: CC BY-NC 4.0 (non-commercial). Per the HF model card, Juggernaut Z is licensed for non-commercial use only — you may not use the model or its outputs in a commercial workflow without a license. Commercial licensing is via juggernaut@rundiffusion.com. The Civitai release page lists Apache 2.0 in error — the HF canonical card is the source of truth.

Not Z-Image Turbo. Juggernaut Z is built on Z-Image Base (not the distilled Turbo). That means a different step/CFG profile — Juggernaut Z's default is 35 steps at guidance scale 6 per the HF model card, not the 8-NFE pattern of the Z-Image-Turbo on RTX 5070 Ti recipe. Use the settings below.

Requirements

Component	Minimum	Tested
GPU	16GB VRAM consumer card (bf16/fp16); ~8GB with FP8 or GGUF Q4–Q5	RTX 5070 Ti (16GB)
RAM	16GB system RAM	—
Storage	~12.3GB for bf16 / fp16 weights; ~6.15GB for fp8; ~4.83GB for Q4_K_S GGUF	—
Software	Python 3.10+, PyTorch with cu128 (CUDA 12.8) + bf16 support, `diffusers` ≥ 0.37.1	ComfyUI with RES4LFY node / `diffusers` ≥ 0.37.1

The headline 16 GB tier is anchored on the BF16 weights themselves: the Juggernaut-Z-Image repo file listing ships the bf16 checkpoint at 12.31 GB on disk, leaving ~3 GB of headroom on a 16 GB card for the activations / VAE / latents. The same repo also ships an FP8 e4m3fn safetensors variant (6.15 GB) and a full set of GGUF quantizations (Q4_K_S 4.83 GB through Q8_0 7.34 GB) for tighter VRAM budgets. As context, Tongyi-MAI describes the distilled sibling on its Z-Image-Turbo card as fitting comfortably within "16G VRAM consumer devices" — Juggernaut Z is a fine-tune of Z-Image Base (not Turbo) but shares the same Single-Stream Diffusion Transformer architecture per the Z-Image Base card, so the 16 GB tier framing applies to the bf16 build here too.

The RTX 5070 Ti is a Blackwell GB203 sm_120 card. Install a PyTorch build compiled against CUDA 12.8 (cu128) — earlier cu121/cu126 wheels do not ship sm_120 kernels and will fall back to slow paths or fail to launch on this GPU:

pip install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu128

Installation

Path A — HuggingFace diffusers (Python script)

Per the Juggernaut-Z-Image model card, Juggernaut Z loads through the standard DiffusionPipeline once diffusers is recent enough to know about ZImagePipeline:

pip install -U "diffusers>=0.37.1" transformers accelerate safetensors

Path B — ComfyUI (RunDiffusion workflow)

The official RunDiffusion ComfyUI guide ships a IMG-JuggernautZ-Txt2Img.json workflow that expects the RES4LFY custom node. Install order:

# 1. Open ComfyUI Manager → Custom Nodes Manager → install "RES4LFY", then restart ComfyUI.

# 2. Download a Juggernaut Z checkpoint to ComfyUI/models/checkpoints/
#    Pick ONE based on your VRAM budget. URLs from the official RunDiffusion repo:
#    https://huggingface.co/RunDiffusion/Juggernaut-Z-Image/tree/main

# bf16 (12.31 GB on disk — fits 16GB VRAM with room to spare):
wget -P ComfyUI/models/checkpoints/ \
  https://huggingface.co/RunDiffusion/Juggernaut-Z-Image/resolve/main/Juggernaut_Z_V1_by_RunDiffusion.safetensors

# fp8 e4m3fn (6.15 GB on disk — for ≤12 GB cards):
wget -P ComfyUI/models/checkpoints/ \
  https://huggingface.co/RunDiffusion/Juggernaut-Z-Image/resolve/main/Juggernaut_Z_V1_FP8_e4m3fn.safetensors

Load the IMG-JuggernautZ-Txt2Img.json workflow into ComfyUI by dragging the file onto the canvas (download from the RunDiffusion guide linked above).

Running

Path A — diffusers snippet

The inference snippet below is verbatim from the Juggernaut-Z-Image HF model card:

from diffusers import DiffusionPipeline
import torch

pipe = DiffusionPipeline.from_pretrained(
    "RunDiffusion/Juggernaut-Z-Image",
    torch_dtype=torch.bfloat16,
).to("cuda")

image = pipe(
    "a cinematic portrait, dramatic lighting",
    guidance_scale=6.0,
    num_inference_steps=35,
).images[0]
image.save("output.png")

The HF model card's Recommended Settings table lists the defaults as CFG 6 (range 6–9) and Steps 35 (range 25–45). from_pretrained only downloads the files declared in model_index.json, so it will not pull the standalone .safetensors / .gguf variants at the repo root.

Path B — ComfyUI

After loading the official workflow JSON, edit the prompt node and hit Queue Prompt. The official guide recommends starting at a moderate resolution — 1024×1024, 832×1216, or 1216×832 — before scaling up. The Civitai release page for Juggernaut Z v1.0 additionally documents a two-pass setup the model author tunes for sharpness:

First pass: sampler Res_2s, scheduler Beta, 22 steps, denoise 1.00
Second pass: sampler Res_2s, scheduler Normal, 3 steps, denoise 0.15
Recommended resolution: 960×1440 (or a similar pixel area); the author notes that low resolutions like 1024×1024 can sometimes look grainy or noisy with this fine-tune

Results

Speed: No RTX 5070 Ti-named benchmark for Juggernaut Z is published yet, and the backend has no measurement for this pair. The RTX 5070 Ti is the same Blackwell GB203 sm_120 die as the RTX 5080 with the same 16 GB GDDR7 tier; its ~896 GB/s of memory bandwidth and 8960 CUDA cores sit modestly below the 5080's, so quoting a 5080 (or any other card's) per-step time as if it were measured here would be a guess, not a measurement — no speed figure is quoted. When a community benchmark lands it will appear on /check/juggernaut-z/rtx-5070-ti. If you run it, please submit your numbers.
VRAM usage: The bf16 Juggernaut Z checkpoint is 12.31 GB on disk per the HF repo listing; the 5070 Ti's 16 GB absorbs the weights plus activations / VAE / latents with ~3 GB of headroom. Live measurements: /check/juggernaut-z/rtx-5070-ti.
Quality notes: Per the HF card, Juggernaut Z is licensed CC BY-NC 4.0 (non-commercial; commercial licensing via juggernaut@rundiffusion.com). It is tuned for "stronger lighting, sharper focus, more refined skin texture, and more cinematic atmosphere" relative to Z-Image Base, and the card flags composition as an area with "further work planned for v2".

For the full benchmark data, see /check/juggernaut-z/rtx-5070-ti.

Troubleshooting

ComfyUI errors out with a missing custom node

The official Juggernaut Z workflow requires the RES4LFY node; install it from ComfyUI Manager → Custom Nodes, then restart ComfyUI. Documented in the RunDiffusion ComfyUI guide.

`DiffusionPipeline` raises "Cannot find pipeline class ZImagePipeline"

ZImagePipeline ships in diffusers 0.37.1 and later (the card states it was verified against diffusers 0.37.1 and 0.38.0). Upgrade with pip install -U "diffusers>=0.37.1" per the HF model card. If your environment is pinned to an older release, install from main: pip install git+https://github.com/huggingface/diffusers.

Torch fails to launch or runs slowly on the RTX 5070 Ti

The RTX 5070 Ti is Blackwell GB203 sm_120. Install a PyTorch build compiled against CUDA 12.8 (--index-url https://download.pytorch.org/whl/nightly/cu128) — cu121/cu126 wheels lack sm_120 kernels. If a custom node or sample snippet hardcodes attn_implementation="flash_attention_2", switch it to "sdpa" or "eager": FlashAttention-2 wheels do not yet ship sm_120 kernels (Dao-AILab#2168).

1024×1024 outputs look noisy or grainy

The Juggernaut Z author flags this on the Civitai release notes: use 960×1440 (or a similar pixel area) instead, or apply the documented two-pass schedule (22 steps Res_2s/Beta at denoise 1.00, then 3 steps Res_2s/Normal at denoise 0.15).

Tight on VRAM (≤ 12 GB card)

Download the FP8 e4m3fn safetensors (6.15 GB) or one of the GGUF Q4–Q5 quantizations (4.83–5.68 GB) from the HF repo instead of the bf16 build. Blackwell sm_120 has native FP8 tensor cores, so the FP8 path runs at hardware speed on the 5070 Ti. GGUF requires a GGUF-aware loader node in ComfyUI.