self-hosted/ai
§01·recipe · image

Juggernaut Z on RTX 4070: Cinematic Photoreal Fine-Tune of Z-Image Base via FP8 in ComfyUI

imageintermediate12GB+ VRAMJun 9, 2026
models
tools
prerequisites
  • NVIDIA RTX 4070 (12GB VRAM) or any consumer GPU with 12GB VRAM
  • Python 3.10+
  • PyTorch with CUDA support (the default cu124/cu126 wheel — Ada sm_89 needs no special index)
  • ComfyUI (with the RES4LFY custom node) on a recent build

What You'll Build

A local install of Juggernaut Z V1 — Team Juggernaut's photoreal fine-tune of Tongyi-MAI's Z-Image Base, trained by KandooAI and released through RunDiffusion — running on a 12 GB RTX 4070 via the FP8 e4m3fn single-file in ComfyUI. Per the HF model card, Juggernaut Z is tuned for "stronger lighting, sharper focus, more refined skin texture, and more cinematic atmosphere" relative to the upstream Base.

Hardware data: RTX 4070 (12GB VRAM) · FP8 e4m3fn transformer 6.15 GB on disk · See benchmark data

⚠️ License: CC BY-NC 4.0 (non-commercial). Per the HF model card, Juggernaut Z is "non-commercial use only" — you may not use the model or its outputs in a commercial workflow without a license. Commercial licensing is via juggernaut@rundiffusion.com. The Civitai release page lists Apache 2.0 in error — the HF canonical card is the source of truth.

Not Z-Image Turbo. Juggernaut Z is built on Z-Image Base (not the distilled Turbo). That means a full-model step/CFG profile — Juggernaut Z's default is 35 steps at guidance scale 6 per the HF model card Recommended Settings table, not the low-step / low-CFG pattern of Z-Image-Turbo workflows. Use the settings below.

Why FP8 (or GGUF), not BF16, on this card. The original BF16 build is 12.31 GB on disk per the repo file listing — that already exceeds the ~11.3 GB of usable VRAM a 12 GB card has free with a display attached, before any activations or VAE. This recipe leads with the FP8 e4m3fn variant (6.15 GB) instead. The RTX 4070 is Ada Lovelace (sm_89), whose 4th-generation tensor cores have native E4M3 / E5M2 FP8 support, so FP8 runs at hardware speed on this GPU — it is the correct path here, not a fallback.

Requirements

ComponentMinimumTested
GPU12GB VRAM consumer cardRTX 4070 (12GB)
RAM16GB system RAM
Storage~6.15GB for FP8 e4m3fn; ~4.83GB for Q4_K_S GGUF; +~4GB for the Qwen3-4B text encoder + VAE
SoftwarePython 3.10+, PyTorch with CUDA (default cu124/cu126 wheel), recent ComfyUI build with RES4LFY nodeComfyUI (recent build)

The RTX 4070 is an Ada Lovelace AD104 sm_89 card with 12 GB GDDR6X on a PCIe Gen4 x16 link. Unlike Blackwell-class cards (sm_120), it needs no special wheel selection — the default pip install torch already ships full sm_89 kernel coverage:

pip install torch

FlashAttention-2 prebuilt wheels include sm_89 kernels, so if a custom node or snippet uses flash_attention_2 it works as-is on the RTX 4070 — there is no sm_120 kernel gap to work around (that override applies only to Blackwell GPUs). No cu128-specific index URL is required.

Installation

1. Install ComfyUI and the RES4LFY node

Use a recent ComfyUI build (the Z-Image / lumina2 loader support landed in the Nov 2025 release). The official RunDiffusion ComfyUI guide ships a IMG-JuggernautZ-Txt2Img.json workflow that expects the RES4LFY custom node:

# Open ComfyUI Manager → Custom Nodes Manager → install "RES4LFY", then restart ComfyUI.

2. Download the FP8 checkpoint

Pick the FP8 e4m3fn single-file for the 12 GB card. URLs are from the official RunDiffusion repo:

# FP8 e4m3fn transformer (6.15 GB on disk — the recommended 12 GB path):
wget -P ComfyUI/models/checkpoints/ \
  https://huggingface.co/RunDiffusion/Juggernaut-Z-Image/resolve/main/Juggernaut_Z_V1_FP8_e4m3fn.safetensors

If you prefer an even smaller on-disk footprint, the repo also ships a full GGUF ladder — download one and use a GGUF-aware loader node (the Z-Image text encoder is loaded via CLIPLoader (GGUF) with type lumina2):

# Q5_K_M GGUF (5.68 GB) or Q4_K_S GGUF (4.83 GB) — load into ComfyUI/models/unet/
wget -P ComfyUI/models/unet/ \
  https://huggingface.co/RunDiffusion/Juggernaut-Z-Image/resolve/main/Juggernaut_Z_V1_by_RunDiffusion_q5_k_m-003.gguf

3. Load the workflow

Drag the IMG-JuggernautZ-Txt2Img.json workflow (download from the RunDiffusion guide) onto the ComfyUI canvas. The Z-Image graph loads three components: the diffusion model (the FP8 / GGUF file above), the Qwen3-4B text encoder (qwen_3_4b.safetensors, CLIP type lumina2), and the Flux VAE (ae.safetensors).

Running

After loading the official workflow JSON, edit the prompt node and hit Queue Prompt. Per the HF model card Recommended Settings table, use the Base-model profile: CFG 6 (range 6–9) and Steps 35 (range 25–45). Start at a moderate resolution — 1024×1024, 832×1216, or 1216×832 — before scaling up.

The Civitai release page for Juggernaut Z v1.0 additionally documents a two-pass setup the model author tunes for sharpness:

  • First pass: sampler Res_2s, scheduler Beta, 22 steps, denoise 1.00
  • Second pass: sampler Res_2s, scheduler Normal, 3 steps, denoise 0.15
  • Recommended resolution: 960×1440 (or a similar pixel area); the author notes that low resolutions like 1024×1024 can sometimes look grainy or noisy with this fine-tune

Results

  • Speed: No RTX 4070-named benchmark for Juggernaut Z is published yet, and the backend has no measurement for this pair (/check/juggernaut-z/rtx-4070 currently reports verdict: unknown). The RTX 4070 (AD104, ~504 GB/s memory bandwidth, 5888 CUDA cores) is a lower-bandwidth, lower-core-count Ada card than its 4070 Ti SUPER / 4080 siblings, so quoting another card's per-step time as if it were measured here would be a guess, not a measurement — no speed figure is quoted. If you run it, please submit your numbers.
  • VRAM usage (derived): The FP8 e4m3fn transformer is 6.15 GB on disk and the Q4–Q5 GGUF tiers are 4.83–5.68 GB, both cited from the HF repo file listing. With ComfyUI's native sequential offload, the Qwen3-4B text encoder (~4 GB at FP8) computes conditioning and is freed before the FP8 transformer dominates the sampling peak, so the FP8 path lands comfortably inside the RTX 4070's 12 GB. This is a derived envelope from the cited on-disk sizes, not a measured peak — a measured number will appear on /check/juggernaut-z/rtx-4070 once a community benchmark lands.
  • Quality notes: Juggernaut Z is tuned for cinematic lighting, sharper focus, and cleaner portraits versus the upstream Z-Image Base, per the HF model card. It shares the upstream Z-Image "Single-Stream Diffusion Transformer" architecture.

For the full benchmark data, see /check/juggernaut-z/rtx-4070.

Troubleshooting

BF16 build out-of-memories on the 12 GB card

The original Juggernaut_Z_V1_by_RunDiffusion.safetensors is 12.31 GB on disk per the repo listing — it does not fit a 12 GB card with a display attached. Use the FP8 e4m3fn (6.15 GB) or a GGUF Q4–Q5 (4.83–5.68 GB) variant from the HF repo instead. The RTX 4070's Ada sm_89 tensor cores have native FP8 support, so the FP8 path runs at hardware speed.

ComfyUI errors out with a missing custom node

The official Juggernaut Z workflow requires the RES4LFY node; install it from ComfyUI Manager → Custom Nodes, then restart ComfyUI. Documented in the RunDiffusion ComfyUI guide.

The text encoder outputs garbage / wrong CLIP type

Z-Image uses the Qwen3-4B text encoder, not a standard CLIP. In ComfyUI set the CLIP type to lumina2 and point it at qwen_3_4b.safetensors; standard CLIP nodes produce unusable conditioning.

The text-encoder offload feels slow

ComfyUI streams the Qwen3-4B text encoder across the PCIe link during sequential offload. The RTX 4070 is on a PCIe Gen4 x16 link (roughly half the bandwidth of a Gen5 card), so the offloaded-encoder stage is a little slower than on a Gen5 host — the FP8 transformer sampling itself runs on-GPU and is unaffected. If conditioning latency bothers you, keep the encoder resident by stepping down to a smaller GGUF tier to free VRAM.

1024×1024 outputs look noisy or grainy

The Juggernaut Z author flags this on the Civitai release notes: use 960×1440 (or a similar pixel area) instead, or apply the documented two-pass schedule (22 steps Res_2s/Beta at denoise 1.00, then 3 steps Res_2s/Normal at denoise 0.15).

Prefer the diffusers path instead of ComfyUI

The repo also ships the 🤗 Diffusers component layout, loadable with DiffusionPipeline.from_pretrained("RunDiffusion/Juggernaut-Z-Image", torch_dtype=torch.bfloat16) — but that path loads the BF16 weights, which do not fit the 12 GB RTX 4070 with a display attached. Per the HF model card, ZImagePipeline support requires diffusers ≥ 0.37.1. On this card stay on the FP8 ComfyUI path above; the diffusers BF16 route is for 16 GB+ cards.