Juggernaut Z on RTX 3090: Cinematic Photoreal Fine-Tune of Z-Image Base at BF16 via Diffusers or ComfyUI

What You'll Build

A local install of Juggernaut Z V1 - Team Juggernaut / KandooAI's cinematic, photoreal fine-tune of Tongyi-MAI's Z-Image Base, released through RunDiffusion. This recipe covers two paths on an RTX 3090 (Ampere, sm_86, 24 GB VRAM): a Python script via HuggingFace diffusers, and a ComfyUI workflow using the official RunDiffusion node graph. With 24 GB of VRAM the BF16 weights fit with substantial headroom - enabling the author-recommended 960x1440 / 1440x960 portrait/landscape presets and num_images_per_prompt >= 2 batches that 16 GB-tier cards cannot absorb. Juggernaut Z is tuned for stronger lighting, sharper focus, refined skin texture, and a more cinematic atmosphere than the upstream Base.

Hardware data: RTX 3090 (24 GB VRAM, 936 GB/s memory bandwidth, Ampere sm_86) - runs at BF16 (12.3 GB single-file checkpoint per the HF repo file listing) with ample headroom for higher-resolution presets and small batches - See benchmark data

Warning - License: CC BY-NC 4.0 (non-commercial). Per the HF model card, Juggernaut Z is licensed for non-commercial use only. Commercial licensing is via juggernaut@rundiffusion.com. The Civitai release page lists Apache 2.0 in error - the HF canonical card is the source of truth.

Note - Not Z-Image Turbo, not Juggernaut X / Reborn / vanilla Z-Image Base. Juggernaut Z is built on Z-Image Base (not the distilled Turbo) and is a distinct RunDiffusion fine-tune from prior Juggernaut releases (Juggernaut X / Reborn target SDXL; vanilla Z-Image Base is the unfinetuned Tongyi-MAI upstream). The defaults are different: Juggernaut Z's HF card recommends 35 steps at guidance scale 6 (valid ranges: 25-45 steps, 6-9 CFG) per the model card, not the 8-NFE / CFG 0.0 pattern used by Z-Image Turbo.

Note - Stay on BF16, not FP8, on Ampere. The RTX 3090 is Ampere sm_86, which has BF16 / FP16 / INT8 / TF32 tensor cores but no native FP8 tensor cores (FP8 e4m3fn / e5m2 first shipped on Hopper sm_90 and Ada sm_89). The 6.15 GB Juggernaut_Z_V1_FP8_e4m3fn.safetensors variant in the HF repo will still load on a 3090 but its weights are dequantized to BF16 / FP16 at compute time rather than running on dedicated FP8 hardware, so the VRAM savings come without a matched speed win. With 24 GB to spend you have no reason to drop below BF16 - and on a 3090 doing so will not buy you the speedup a 4090 / 5090 owner would see.

Requirements

Component	Minimum	Tested
GPU	16 GB VRAM consumer card (bf16/fp16); ~8 GB with FP8 or GGUF Q4-Q5	RTX 3090 (24 GB)
RAM	16 GB system RAM (32 GB recommended for batched generation)	-
Storage	~13 GB for BF16 / FP16 weights; ~6 GB for FP8; ~5 GB for Q4_K_S GGUF	-
Software	Python 3.10+, PyTorch with CUDA + bf16 support, `diffusers` >= 0.37.1	ComfyUI with RES4LFY node / `diffusers` >= 0.37.1

The 16 GB minimum is anchored on the BF16 weights themselves: the Juggernaut-Z-Image repo file listing ships the canonical BF16 single-file checkpoint at 12.3 GB on disk (the FP16 variant is also 12.3 GB; the Diffusers component layout in transformer/, text_encoder/, vae/ resolves the same weight tensors). With 24 GB on an RTX 3090, the BF16 weights leave ~11.7 GB of headroom for the text encoder, VAE, latents, and activations - enough to comfortably run the author-recommended 960x1440 / 1440x960 presets and num_images_per_prompt=2 batches that 16 GB cards cannot. The repo also ships an FP8 e4m3fn safetensors variant (6.15 GB on disk) and a full GGUF ladder (Q4_K_S 4.83 GB through Q8_0 7.34 GB) for memory-constrained setups, but on a 3090 there is no reason to drop precision below BF16 - and FP8 specifically buys no speed advantage on Ampere sm_86 (see admonition above).

This is a derived runtime envelope based on the cited on-disk weight sizes, not a measured peak - no community benchmark for Juggernaut Z on an RTX 3090 has been published as of writing. Live measurements (when they land via /contribute) will appear at /check/juggernaut-z/rtx-3090.

The RTX 3090 is Ampere sm_86 - the default pip install torch already ships full sm_86 kernel coverage for FlashAttention-2, xformers, and the standard attention backends. No cu128-specific wheel selection is required, and FA2 is happy on Ampere without any extra build-from-source dance.

Installation

Path A - HuggingFace diffusers (Python script)

Per the Juggernaut-Z-Image model card, Juggernaut Z loads through the standard DiffusionPipeline once diffusers is recent enough to know about ZImagePipeline:

pip install -U "diffusers>=0.37.1" transformers accelerate safetensors

Path B - ComfyUI (RunDiffusion workflow)

The official RunDiffusion ComfyUI guide ships a IMG-JuggernautZ-Txt2Img.json workflow that expects the RES4LFY custom node. Install order:

# 1. Open ComfyUI Manager -> Custom Nodes Manager -> install "RES4LFY", then restart ComfyUI.

# 2. Download the Juggernaut Z BF16 checkpoint to ComfyUI/models/checkpoints/
#    On a 24 GB RTX 3090 there is no reason to drop below BF16. URLs are from the
#    official RunDiffusion repo: https://huggingface.co/RunDiffusion/Juggernaut-Z-Image/tree/main

wget -P ComfyUI/models/checkpoints/ \
  https://huggingface.co/RunDiffusion/Juggernaut-Z-Image/resolve/main/Juggernaut_Z_V1_by_RunDiffusion.safetensors

Load the IMG-JuggernautZ-Txt2Img.json workflow into ComfyUI by dragging the file onto the canvas (download from the RunDiffusion guide linked above).

Running

Path A - diffusers snippet

The inference snippet below is verbatim from the Juggernaut-Z-Image HF model card, with two 3090-specific knobs surfaced as comments:

import torch
from diffusers import DiffusionPipeline

pipe = DiffusionPipeline.from_pretrained(
    "RunDiffusion/Juggernaut-Z-Image",
    torch_dtype=torch.bfloat16,
).to("cuda")

image = pipe(
    "a cinematic portrait, dramatic lighting",
    guidance_scale=6.0,         # default per HF card; valid range 6-9
    num_inference_steps=35,     # default per HF card; valid range 25-45
    # On a 3090 24 GB you can also pass height=1440, width=960 for the
    # author-recommended portrait preset, or num_images_per_prompt=2 for batched runs.
).images[0]
image.save("output.png")

The HF model card lists the default sampler settings as guidance_scale=6 (valid range 6-9) and num_inference_steps=35 (valid range 25-45).

Path B - ComfyUI

After loading the official workflow JSON, edit the prompt node and hit Queue Prompt. The Civitai release page for Juggernaut Z v1.0 documents an alternative two-pass setup that the model author tunes for sharpness:

First pass: sampler Res_2s, scheduler Beta, 22 steps, denoise 1.00
Second pass: sampler Res_2s, scheduler Normal, 3 steps, denoise 0.15
Recommended resolutions: 960x1440 (portrait) or 1440x960 (landscape) - the Civitai notes call out that 1024x1024 "will sometimes look too grainy/noisy" with this fine-tune

The 3090's 24 GB makes both presets straightforward at BF16; you can also bump to a small batch (num_images_per_prompt=2 in diffusers, or duplicate the sampler output node in the ComfyUI graph) without running into the per-step VAE allocation that pushes a 16 GB sibling card to its limit.

Results

Speed: No community benchmark for Juggernaut Z on RTX 3090 has been published as of writing. The closely-related Z-Image Turbo has been measured on an RTX 4090 (Ada sm_89), but that figure is not transferable to either Juggernaut Z (un-distilled Base path at 35 steps vs Turbo's 8) or to the RTX 3090 (Ampere sm_86 dense BF16 throughput is materially lower than Ada's). We deliberately omit a specific seconds-per-image number here rather than fabricate one. When a measured 3090 benchmark lands via /contribute, it will appear on /check/juggernaut-z/rtx-3090.
VRAM usage: Derived envelope of ~13-15 GB at BF16, 1024x1024, batch size 1 - based on the cited 12.3 GB on-disk BF16 weights per the HF repo file listing plus typical Z-Image-class text encoder + VAE + latent overhead. The RTX 3090's 24 GB absorbs this comfortably with ~9-11 GB of headroom for higher resolutions (960x1440, 1440x960, or up to 2048x2048 within the Z-Image Base card spec) or small batches. This is a derived envelope - not a measured peak - and will be replaced by live data once a community measurement is submitted at /check/juggernaut-z/rtx-3090.
Quality notes: Per the HF card, Juggernaut Z is licensed CC BY-NC 4.0 (non-commercial; commercial licensing via juggernaut@rundiffusion.com). Tuned for "stronger lighting, sharper focus, more refined skin texture, and more cinematic atmosphere" relative to Z-Image Base.

For the full benchmark data, see /check/juggernaut-z/rtx-3090.

Troubleshooting

ComfyUI errors out with a missing custom node

The official Juggernaut Z workflow requires the RES4LFY node; install it from ComfyUI Manager -> Custom Nodes, then restart ComfyUI. Documented in the RunDiffusion ComfyUI guide.

`DiffusionPipeline` raises "Cannot find pipeline class ZImagePipeline"

ZImagePipeline ships in diffusers 0.37.1 and later. Upgrade with pip install -U "diffusers>=0.37.1" per the HF model card requirements. If your environment is pinned to an older release, install from main: pip install git+https://github.com/huggingface/diffusers.

1024x1024 outputs look noisy or grainy

The Juggernaut Z author flags this on the Civitai release notes: use 960x1440 / 1440x960 instead, or apply the documented two-pass schedule (22 steps Res_2s/Beta at denoise 1.00, then 3 steps Res_2s/Normal at denoise 0.15). The 3090's 24 GB makes the non-square presets free of any memory penalty over the 1024-square default.

FP8 weights loaded but inference is no faster than BF16

Expected on Ampere. The RTX 3090's sm_86 tensor cores cover BF16 / FP16 / INT8 / TF32 but not FP8 (e4m3fn / e5m2 first shipped on Hopper sm_90 and Ada sm_89). PyTorch will load the 6.15 GB Juggernaut_Z_V1_FP8_e4m3fn.safetensors variant from the HF repo file listing and dequantize each weight to BF16 / FP16 at compute time - so you keep the ~6 GB on-disk and ~6 GB resident-weights footprint but pay roughly BF16 throughput, not "FP8-on-tensor-cores" throughput. On a 24 GB 3090 the right default is BF16; FP8 only makes sense if you are sharing the card with another workload that needs the freed VRAM.

Want to push higher resolution or batch size

The 3090's 24 GB unlocks two axes the 16 GB sibling cards cannot run:

Higher resolutions - the Z-Image Base card documents the supported range as 512x512 to 2048x2048 (total pixel area, any aspect ratio). On a 3090 you can request 1536x1536 or 1440x1920 directly through pipe(..., height=H, width=W) without offload, though wall-clock time scales with pixel count and 3090's per-step throughput is lower than Ada-class cards on this workload.
Small batches - pass num_images_per_prompt=2 in the diffusers call, or replicate the sampler output in ComfyUI, to render two variants per Queue Prompt. Memory scales roughly linearly with batch size for this architecture; batch=2 at 1024x1024 BF16 stays well within the 24 GB envelope.

If you do measure peak VRAM at these settings, please submit your numbers so /check/juggernaut-z/rtx-3090 can pick them up.