What You'll Build
A local install of Juggernaut Z V1 — Team Juggernaut's photoreal fine-tune of Tongyi-MAI's Z-Image Base, trained by KandooAI and released through RunDiffusion — running on a 12 GB RTX 3060 via the FP8 e4m3fn single-file in ComfyUI. Per the HF model card, Juggernaut Z is "tuned for stronger lighting, sharper focus, more refined skin texture, and more cinematic atmosphere — out of the box" relative to the upstream Base.
Hardware data: RTX 3060 (12GB VRAM, 360 GB/s memory bandwidth, Ampere sm_86) · FP8 e4m3fn transformer 6.15 GB on disk · See benchmark data
⚠️ License: CC BY-NC 4.0 (non-commercial). Per the HF model card, Juggernaut Z is "non-commercial use only" — you may not use the model or its outputs in a workflow for commercial purposes without a license. Commercial licensing is via
juggernaut@rundiffusion.com. The Civitai release page lists Apache 2.0 in error — the HF canonical card is the source of truth.
Not Z-Image Turbo. Juggernaut Z is built on Z-Image Base (
base_model: Tongyi-MAI/Z-Image, the un-distilled Base — not the distilled Turbo). That means a full-model step/CFG profile — Juggernaut Z's default is 35 steps at guidance scale 6 per the HF model card Recommended Settings table, not the low-step / low-CFG pattern of Z-Image-Turbo workflows. Use the settings below.
Why FP8 weights (or GGUF), not BF16, on this card — and why FP8 here is a memory escape hatch, not a speed win. The original BF16 build is 12.31 GB on disk per the repo file listing — that already exceeds the ~10.5–11.3 GB of usable VRAM a 12 GB card has free with a display attached, before any activations or VAE. This recipe leads with the FP8 e4m3fn variant (6.15 GB) instead. But note the RTX 3060 is Ampere sm_86, which has BF16 / FP16 / INT8 / TF32 tensor cores but no native FP8 tensor cores (FP8 e4m3fn / e5m2 first shipped on Hopper sm_90 and Ada sm_89). The FP8 weights load on the 3060 — keeping the on-disk and resident-weights footprint small enough to fit 12 GB — but PyTorch dequantizes each weight to BF16 / FP16 at compute time, so you get the VRAM savings without the matched speed boost an RTX 40-series / 50-series owner sees. On this card FP8 is the path that makes the model fit, not the path that makes it faster.
Requirements
| Component | Minimum | Tested |
|---|---|---|
| GPU | 12GB VRAM consumer card | RTX 3060 (12GB) |
| RAM | 16GB system RAM | — |
| Storage | ~6.15GB for FP8 e4m3fn; ~4.83GB for Q4_K_S GGUF; +~4GB for the Qwen3-4B text encoder + VAE | — |
| Software | Python 3.10+, PyTorch with CUDA (default cu124/cu121 wheel), recent ComfyUI build with RES4LFY node | ComfyUI (recent build) |
The RTX 3060 is an Ampere GA106 sm_86 card with 12 GB GDDR6 (192-bit, 15 Gbps, 360 GB/s) on a PCIe Gen4 x16 link, 3584 CUDA cores and 112 third-generation Tensor cores at 170 W (the 12 GB GA106-300 variant, not the later 8 GB cut-down — specs per TechPowerUp). Like Ada-class cards, and unlike Blackwell (sm_120), it needs no special wheel selection — the default pip install torch already ships full sm_86 kernel coverage:
pip install torch
FlashAttention-2 prebuilt wheels include sm_86 kernels, so if a custom node or snippet uses flash_attention_2 it works as-is on the RTX 3060 — there is no sm_120 kernel gap to work around (that override applies only to Blackwell GPUs). No cu128-specific index URL is required.
Installation
1. Install ComfyUI and the RES4LFY node
Use a recent ComfyUI build (the Z-Image / lumina2 loader support landed in the Nov 2025 release). The official RunDiffusion ComfyUI guide ships a IMG-JuggernautZ-Txt2Img.json workflow that expects the RES4LFY custom node:
# Open ComfyUI Manager → Custom Nodes Manager → install "RES4LFY", then restart ComfyUI.
2. Download the FP8 checkpoint
Pick the FP8 e4m3fn single-file for the 12 GB card. URLs are from the official RunDiffusion repo:
# FP8 e4m3fn transformer (6.15 GB on disk — the recommended 12 GB path):
wget -P ComfyUI/models/checkpoints/ \
https://huggingface.co/RunDiffusion/Juggernaut-Z-Image/resolve/main/Juggernaut_Z_V1_FP8_e4m3fn.safetensors
If you prefer an even smaller on-disk footprint, the repo also ships a full GGUF ladder — download one and use a GGUF-aware loader node (the Z-Image text encoder is loaded via CLIPLoader (GGUF) with type lumina2):
# Q5_K_M GGUF (5.68 GB) or Q4_K_S GGUF (4.83 GB) — load into ComfyUI/models/unet/
wget -P ComfyUI/models/unet/ \
https://huggingface.co/RunDiffusion/Juggernaut-Z-Image/resolve/main/Juggernaut_Z_V1_by_RunDiffusion_q5_k_m-003.gguf
3. Load the workflow
Drag the IMG-JuggernautZ-Txt2Img.json workflow (download from the RunDiffusion guide) onto the ComfyUI canvas. The Z-Image graph loads three components: the diffusion model (the FP8 / GGUF file above), the Qwen3-4B text encoder (qwen_3_4b.safetensors, CLIP type lumina2), and the Flux VAE (ae.safetensors).
Running
After loading the official workflow JSON, edit the prompt node and hit Queue Prompt. Per the HF model card Recommended Settings table, use the Base-model profile: CFG 6 (range 6–9) and Steps 35 (range 25–45). Start at a moderate resolution — 1024×1024, 832×1216, or 1216×832 — before scaling up.
The Civitai release page for Juggernaut Z v1.0 additionally documents a two-pass setup the model author tunes for sharpness:
- First pass: sampler
Res_2s, schedulerBeta, 22 steps, denoise 1.00 - Second pass: sampler
Res_2s, schedulerNormal, 3 steps, denoise 0.15 - Recommended resolution: 960×1440 (or a similar pixel area); the author notes that low resolutions like 1024×1024 can sometimes look grainy or noisy with this fine-tune
Results
- Speed: No RTX 3060-named benchmark for Juggernaut Z is published yet, and the backend has no measurement for this pair (/check/juggernaut-z/rtx-3060 currently reports
verdict: unknown). The RTX 3060 (GA106, 360 GB/s memory bandwidth, 3584 CUDA cores) is the lowest-bandwidth card in the 12 GB tier, and — being Ampere sm_86 — runs the FP8 path at dequantized BF16 throughput rather than on FP8 tensor cores (see the admonition above). Quoting an RTX 40-/50-series per-step time as if it were measured here would be a guess, not a measurement, so no speed figure is quoted. If you run it, please submit your numbers. - VRAM usage (derived): The FP8 e4m3fn transformer is 6.15 GB on disk and the Q4–Q5 GGUF tiers are 4.83–5.68 GB, both cited from the HF repo file listing. The FP8 weights stay FP8 in VRAM on Ampere (the dequantization happens per-op at compute time, not at load), so the resident-weights footprint is preserved at ~6 GB. With ComfyUI's native sequential offload, the Qwen3-4B text encoder (~4 GB at FP8) computes conditioning and is freed before the FP8 transformer dominates the sampling peak, so the FP8 path lands inside the RTX 3060's 12 GB. This is a derived envelope from the cited on-disk sizes, not a measured peak — a measured number will appear on /check/juggernaut-z/rtx-3060 once a community benchmark lands.
- Quality notes: Juggernaut Z is tuned for cinematic lighting, sharper focus, and cleaner portraits versus the upstream Z-Image Base, per the HF model card. It shares the upstream Z-Image "Single-Stream Diffusion Transformer" architecture.
For the full benchmark data, see /check/juggernaut-z/rtx-3060.
Troubleshooting
FP8 weights loaded but inference is no faster than BF16
Expected on Ampere. The RTX 3060's sm_86 tensor cores cover BF16 / FP16 / INT8 / TF32 but not FP8 (e4m3fn / e5m2 first shipped on Hopper sm_90 and Ada sm_89). PyTorch will load the 6.15 GB Juggernaut_Z_V1_FP8_e4m3fn.safetensors variant from the HF repo file listing and dequantize each weight to BF16 / FP16 at compute time — so you keep the ~6 GB on-disk and ~6 GB resident-weights footprint but pay roughly BF16 throughput, not "FP8-on-tensor-cores" throughput. On the 3060 FP8 is the memory escape hatch that lets the model fit 12 GB, not a speed setting; an RTX 40-/50-series owner with FP8 tensor cores would also see the speedup, but you only get the VRAM saving.
BF16 build out-of-memories on the 12 GB card
The original Juggernaut_Z_V1_by_RunDiffusion.safetensors is 12.31 GB on disk per the repo listing — it does not fit a 12 GB card with a display attached. Use the FP8 e4m3fn (6.15 GB) or a GGUF Q4–Q5 (4.83–5.68 GB) variant from the HF repo instead. On Ampere sm_86 both fit the 12 GB envelope; neither buys an FP8 speed boost (see above), but both make the model runnable on this card.
ComfyUI errors out with a missing custom node
The official Juggernaut Z workflow requires the RES4LFY node; install it from ComfyUI Manager → Custom Nodes, then restart ComfyUI. Documented in the RunDiffusion ComfyUI guide.
The text encoder outputs garbage / wrong CLIP type
Z-Image uses the Qwen3-4B text encoder, not a standard CLIP. In ComfyUI set the CLIP type to lumina2 and point it at qwen_3_4b.safetensors; standard CLIP nodes produce unusable conditioning.
The text-encoder offload feels slow
ComfyUI streams the Qwen3-4B text encoder across the PCIe link during sequential offload. The RTX 3060 is on a PCIe Gen4 x16 link, so the offloaded-encoder stage is the same Gen4 throughput as the other 12 GB Gen4 cards — the FP8 transformer sampling itself runs on-GPU and is unaffected. If conditioning latency bothers you, keep the encoder resident by stepping down to a smaller GGUF tier to free VRAM.
1024×1024 outputs look noisy or grainy
The Juggernaut Z author flags this on the Civitai release notes: use 960×1440 (or a similar pixel area) instead, or apply the documented two-pass schedule (22 steps Res_2s/Beta at denoise 1.00, then 3 steps Res_2s/Normal at denoise 0.15).
Prefer the diffusers path instead of ComfyUI
The repo also ships the 🤗 Diffusers component layout, loadable with DiffusionPipeline.from_pretrained("RunDiffusion/Juggernaut-Z-Image", torch_dtype=torch.bfloat16) — but that path loads the BF16 weights, which do not fit the 12 GB RTX 3060 with a display attached. Per the HF model card, ZImagePipeline support requires diffusers ≥ 0.37.1 (verified against 0.37.1 and 0.38.0). On this card stay on the FP8 ComfyUI path above; the diffusers BF16 route is for 16 GB+ cards.