self-hosted/ai
§01·recipe · image

Juggernaut Z on RTX 5070: Cinematic Photoreal Fine-Tune of Z-Image Base via FP8 in ComfyUI

imageintermediate12GB+ VRAMJun 5, 2026
models
tools
prerequisites
  • NVIDIA RTX 5070 (12GB VRAM) or any consumer GPU with 12GB VRAM
  • Python 3.10+
  • PyTorch built against CUDA 12.8 (cu128) for Blackwell sm_120 kernel coverage
  • ComfyUI (with the RES4LFY custom node) on the latest build

What You'll Build

A local install of Juggernaut Z V1 — Team Juggernaut's photoreal fine-tune of Tongyi-MAI's 6B Z-Image Base, trained by KandooAI and released through RunDiffusion — running on a 12 GB RTX 5070 via the FP8 e4m3fn single-file in ComfyUI. Per the HF model card, Juggernaut Z is tuned for stronger lighting, sharper focus, more refined skin texture, and more cinematic atmosphere relative to the upstream Base.

Hardware data: RTX 5070 (12GB VRAM) · FP8 e4m3fn transformer 6.15 GB on disk · See benchmark data

⚠️ License: CC BY-NC 4.0 (non-commercial). Per the HF model card, Juggernaut Z is licensed for non-commercial use only — you may not use the model or its outputs in a commercial workflow without a license. Commercial licensing is via juggernaut@rundiffusion.com. The Civitai release page lists Apache 2.0 in error — the HF canonical card is the source of truth.

Not Z-Image Turbo. Juggernaut Z is built on Z-Image Base (not the distilled Turbo). That means a full-model step/CFG profile — Juggernaut Z's default is 35 steps at guidance scale 6 per the HF model card, not the 8-step / CFG 1.0 pattern of Z-Image-Turbo workflows. Use the settings below.

Why FP8 (or GGUF), not BF16, on this card. The original BF16 build is 12.31 GB on disk per the repo file listing — that already exceeds the ~11.3 GB of usable VRAM a 12 GB card has free with a display attached, before any activations or VAE. This recipe leads with the FP8 e4m3fn variant (6.15 GB) instead. Blackwell sm_120 has native FP8 tensor cores, so FP8 runs at hardware speed on the RTX 5070 — it is the correct path here, not a fallback.

Requirements

ComponentMinimumTested
GPU12GB VRAM consumer cardRTX 5070 (12GB)
RAM16GB system RAM
Storage~6.15GB for FP8 e4m3fn; ~4.83GB for Q4_K_S GGUF; +~4GB for the Qwen3-4B text encoder + VAE
SoftwarePython 3.10+, PyTorch with cu128 (CUDA 12.8), latest ComfyUI build with RES4LFY nodeComfyUI (latest)

The RTX 5070 is a Blackwell GB205 sm_120 card with 12 GB GDDR7. Install a PyTorch build compiled against CUDA 12.8 (cu128) — earlier cu121/cu126 wheels do not ship sm_120 kernels and will fall back to slow paths or fail to launch on this GPU:

pip install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu128

Installation

1. Install ComfyUI and the RES4LFY node

Use the latest ComfyUI build (the Z-Image / lumina2 loader support landed in the Nov 2025 release). The official RunDiffusion ComfyUI guide ships a IMG-JuggernautZ-Txt2Img.json workflow that expects the RES4LFY custom node:

# Open ComfyUI Manager → Custom Nodes Manager → install "RES4LFY", then restart ComfyUI.

2. Download the FP8 checkpoint

Pick the FP8 e4m3fn single-file for the 12 GB card. URLs are from the official RunDiffusion repo:

# FP8 e4m3fn transformer (6.15 GB on disk — the recommended 12 GB path):
wget -P ComfyUI/models/checkpoints/ \
  https://huggingface.co/RunDiffusion/Juggernaut-Z-Image/resolve/main/Juggernaut_Z_V1_FP8_e4m3fn.safetensors

If you prefer an even smaller on-disk footprint, the repo also ships a full GGUF ladder — download one and use a GGUF-aware loader node (the Z-Image text encoder is loaded via CLIPLoader (GGUF) with type lumina2):

# Q5_K_M GGUF (5.68 GB) or Q4_K_S GGUF (4.83 GB) — load into ComfyUI/models/unet/
wget -P ComfyUI/models/unet/ \
  https://huggingface.co/RunDiffusion/Juggernaut-Z-Image/resolve/main/Juggernaut_Z_V1_by_RunDiffusion_q5_k_m-003.gguf

3. Load the workflow

Drag the IMG-JuggernautZ-Txt2Img.json workflow (download from the RunDiffusion guide) onto the ComfyUI canvas. The Z-Image graph loads three components: the diffusion model (the FP8 / GGUF file above), the Qwen3-4B text encoder (qwen_3_4b.safetensors, CLIP type lumina2), and the Flux VAE (ae.safetensors).

Running

After loading the official workflow JSON, edit the prompt node and hit Queue Prompt. Per the HF model card Recommended Settings table, use the Base-model profile: CFG 6 (range 6–9) and Steps 35 (range 25–45). Start at a moderate resolution — 1024×1024, 832×1216, or 1216×832 — before scaling up.

The Civitai release page for Juggernaut Z v1.0 additionally documents a two-pass setup the model author tunes for sharpness:

  • First pass: sampler Res_2s, scheduler Beta, 22 steps, denoise 1.00
  • Second pass: sampler Res_2s, scheduler Normal, 3 steps, denoise 0.15
  • Recommended resolution: 960×1440 (or a similar pixel area); the author notes that low resolutions like 1024×1024 can sometimes look grainy or noisy with this fine-tune

Results

  • Speed: No RTX 5070-named benchmark for Juggernaut Z is published yet, and the backend has no measurement for this pair. The RTX 5070 (GB205, ~672 GB/s memory bandwidth, 6144 CUDA cores) sits well below its Blackwell siblings on both bandwidth and core count, so quoting another card's per-step time as if it were measured here would be a guess, not a measurement — no speed figure is quoted. If you run it, please submit your numbers.
  • VRAM usage (derived): The FP8 e4m3fn transformer is 6.15 GB on disk and the Q4–Q5 GGUF tiers are 4.83–5.68 GB, both cited from the HF repo file listing. With ComfyUI's native sequential offload, the Qwen3-4B text encoder (~4 GB at FP8) computes conditioning and is freed before the FP8 transformer dominates the sampling peak, so the FP8 path lands comfortably inside the RTX 5070's 12 GB. This is a derived envelope from the cited on-disk sizes, not a measured peak — a measured number will appear on /check/juggernaut-z/rtx-5070 once a community benchmark lands.
  • Quality notes: Juggernaut Z is tuned for cinematic lighting, sharper focus, and cleaner portraits versus the upstream Z-Image Base, per the HF model card.

For the full benchmark data, see /check/juggernaut-z/rtx-5070.

Troubleshooting

BF16 build out-of-memories on the 12 GB card

The original Juggernaut_Z_V1_by_RunDiffusion.safetensors is 12.31 GB on disk — it does not fit a 12 GB card with a display attached. Use the FP8 e4m3fn (6.15 GB) or a GGUF Q4–Q5 (4.83–5.68 GB) variant from the HF repo instead. Blackwell sm_120 has native FP8 tensor cores, so the FP8 path runs at hardware speed on the RTX 5070.

Torch fails to launch or runs slowly on the RTX 5070

The RTX 5070 is Blackwell GB205 sm_120. Install a PyTorch build compiled against CUDA 12.8 (--index-url https://download.pytorch.org/whl/nightly/cu128) — cu121/cu126 wheels lack sm_120 kernels. If a custom node or sample snippet hardcodes attn_implementation="flash_attention_2", switch it to "sdpa" or "eager": FlashAttention-2 wheels do not yet ship sm_120 kernels (Dao-AILab#2168).

ComfyUI errors out with a missing custom node

The official Juggernaut Z workflow requires the RES4LFY node; install it from ComfyUI Manager → Custom Nodes, then restart ComfyUI. Documented in the RunDiffusion ComfyUI guide.

The text encoder outputs garbage / wrong CLIP type

Z-Image uses the Qwen3-4B text encoder, not a standard CLIP. In ComfyUI set the CLIP type to lumina2 and point it at qwen_3_4b.safetensors; standard CLIP nodes produce unusable conditioning.

1024×1024 outputs look noisy or grainy

The Juggernaut Z author flags this on the Civitai release notes: use 960×1440 (or a similar pixel area) instead, or apply the documented two-pass schedule (22 steps Res_2s/Beta at denoise 1.00, then 3 steps Res_2s/Normal at denoise 0.15).