How much VRAM does Z-Image Turbo need?

About 17 GB — the minimum this recipe targets.

How hard is this setup?

Beginner — follow the steps above.

Z-Image Turbo on Apple M2 Max: 8-step 1024x1024 text-to-image in unified memory with mflux

What You'll Build

A fully-local install of Z-Image-Turbo — Alibaba Tongyi-MAI's 6B-parameter distilled text-to-image model — generating 1024×1024 images in 8 NFEs on an Apple M2 Max, running on Apple's native mflux (an MLX implementation), with no NVIDIA GPU, no CUDA, and no FlashAttention. Z-Image-Turbo pairs a Scalable Single-Stream DiT (S3-DiT) with a large text encoder and is tuned for fast few-step sampling; mflux ships a first-class Z-Image path with on-the-fly 8-bit quantization, so a single command turns a prompt into a PNG entirely in the M2 Max's unified memory.

Hardware data: Apple M2 Max (64 GB unified memory) · mflux 8-bit Z-Image-Turbo, ~17 GB-class working set · See benchmark data

ℹ️ Unified memory is not VRAM. The M2 Max has 64 GB of unified memory shared by CPU and GPU — not 64 GB of dedicated VRAM. macOS lets the GPU address only ~75% of it by default (~48 GB via Metal's recommendedMaxWorkingSetSize). Z-Image-Turbo's full-precision weights are ~33 GB on disk, which would sit tightly against that ~48 GB ceiling at runtime — so on Apple Silicon the recommended path is mflux's 8-bit build (-q 8), whose working set is roughly half that and clears the default addressable pool with room to spare. No wired-limit tuning is needed at 8-bit on a 64 GB Mac.

Note on variants: The Tongyi-MAI Z-Image family ships multiple weight sets — Z-Image (Base), Z-Image-Turbo, and a further-compressed distilled build. This recipe targets Z-Image-Turbo, the consumer-friendly few-step variant (the mflux model table lists Z-Image as 6B · Distilled & Base · "Fast, small, very good quality and realism", per the mflux README). Fine-tunes like Juggernaut-Z are a separate model with their own slug.

Requirements

Component	Minimum	Tested
GPU / memory	16 GB unified memory (~10.5 GB GPU-addressable, use `-q 4`)	Apple M2 Max (64 GB unified memory, ~48 GB addressable)
RAM	Same pool — unified	64 GB unified
Storage	~6 GB (mflux 4-bit) / ~33 GB (full-precision Tongyi-MAI repo)	~6 GB (4-bit) / ~17 GB (8-bit on-the-fly from full repo)
Software	Python 3.10+, macOS Sonoma 14 / Sequoia 15+	macOS Sequoia 15

The binding constraint on Apple Silicon is addressable unified memory, not raw capacity. Z-Image-Turbo's full-precision weights total ~33 GB on disk — transformer 24.6 GB + text encoder 8.0 GB + VAE 0.17 GB (HF tree, Tongyi-MAI/Z-Image-Turbo). Against the ~48 GB the M2 Max's GPU can address by default, the full-precision build runs but leaves a thin margin once activations and the OS are accounted for. mflux's pre-quantized 4-bit mirror, filipstrand/Z-Image-Turbo-mflux-4bit, is only ~5.9 GB on disk (transformer ~3.46 GB + text encoder ~2.26 GB + VAE 0.17 GB), and the on-the-fly 8-bit path (-q 8) lands roughly midway. The HF model card states Z-Image-Turbo "fits comfortably within 16G VRAM consumer devices" (Tongyi-MAI model card) on NVIDIA hardware; on Apple, the 8-bit mflux build clears the M2 Max's ~48 GB addressable pool comfortably, and the 4-bit build runs even a 16 GB Mac (~10.5 GB addressable).

Installation

1. Install mflux (the Apple-native MLX image path)

uv tool install --upgrade mflux

mflux is a from-scratch MLX implementation of the FLUX / Qwen-Image / Z-Image families (filipstrand/mflux). There is nothing CUDA-shaped to install — no torch CUDA wheel, no cu12x index, no FlashAttention, no bitsandbytes. If you prefer pip, pip install -U mflux works too; the project recommends the uv tool install above. This pulls the mflux-generate-z-image-turbo entry point onto your PATH.

2. Generate an image (weights download on first use)

mflux-generate-z-image-turbo \
  --prompt "A puffin standing on a cliff" \
  --width 1280 \
  --height 500 \
  --seed 42 \
  --steps 9 \
  -q 8

This command is verbatim from the mflux README's Z-Image example. On first run, mflux pulls the Z-Image-Turbo weights from Hugging Face and caches them under ~/.cache/huggingface; -q 8 quantizes the weights to 8-bit as they load (mflux's --quantize accepts 3, 4, 5, 6, or 8 per the mflux quantization docs). The --steps 9 value matches Z-Image-Turbo's recommended schedule — the model card notes the canonical setting is 8 NFEs, and num_inference_steps=9 "actually results in 8 DiT forwards" (Tongyi-MAI model card). The --width 1280 --height 500 above is the README's stock example — swap in --width 1024 --height 1024 for a square image. The PNG lands in the working directory.

3. (Optional) Pull the pre-quantized 4-bit weights for small Macs

On a 16 GB Mac, run the pre-quantized 4-bit mirror so nothing has to be quantized on the fly:

mflux-generate-z-image-turbo \
  --model filipstrand/Z-Image-Turbo-mflux-4bit \
  --prompt "A puffin standing on a cliff" \
  --width 1024 \
  --height 1024 \
  --seed 42 \
  --steps 9

The --model filipstrand/Z-Image-Turbo-mflux-4bit form is shown in the mflux Z-Image model README; at ~5.9 GB on disk it clears the ~10.5 GB a 16 GB Mac can GPU-address.

Running

After installation the same mflux-generate-z-image-turbo command is your day-to-day entry point — change --prompt, --width/--height, and --seed to taste. Z-Image-Turbo "excels at accurately rendering complex Chinese and English text" and at photorealistic, bilingual (English & Chinese) generation per the model card, so prompts with embedded text or non-Latin scripts are a genuine strength. Keep --steps 9 (the 8-NFE schedule the Turbo variant is distilled for); pushing far higher rarely helps a step-distilled model.

mflux-generate-z-image-turbo \
  --prompt "storefront sign reading '欢迎光临 · OPEN', neon, rain-slick pavement at dusk" \
  --width 1024 \
  --height 1024 \
  --seed 7 \
  --steps 9 \
  -q 8

If you are tight on memory while a large generation runs, mflux exposes a --low-ram flag that trades some speed for a smaller resident footprint (mflux quantization docs).

Alternative: Draw Things (Metal-native GUI)

If you would rather not touch the terminal, Draw Things is a native Apple-Silicon Metal app (iOS / iPadOS / macOS) with its own Metal attention engine — a point-and-click alternative to mflux for Apple-Silicon image generation. It is the GUI counterpart to the mflux CLI path on the same hardware; consult its in-app model browser for the current Z-Image availability, since model support there evolves independently of mflux.

Results

Speed: No first-party Apple M2 Max benchmark for this pair has been recorded yet — /check/z-image-turbo/m2-max currently returns verdict: unknown with no measurements. We are deliberately not quoting a seconds-per-image figure: image generation throughput on Apple Silicon is bound by the M2 Max's ~400 GB/s unified-memory bandwidth and its 38-core GPU, and no chip-named first-party number exists for Z-Image-Turbo on this Mac. The published latency numbers in the NVIDIA recipes (e.g. ~2.3 s on an RTX 4090) come from CUDA hardware and do not forward to Apple Silicon. If you run this, please contribute your timing so we can seed a real M2 Max datapoint.
Memory usage: ~17 GB-class working set for the 8-bit (-q 8) build, ~6 GB for the pre-quantized 4-bit mirror — both well inside the M2 Max's ~48 GB default-addressable pool. The full-precision ~33 GB build also runs but with a thinner margin; 8-bit is the recommended default. Live measurements (once contributed): /check/z-image-turbo/m2-max.
Quality notes: Z-Image-Turbo is a step-distilled 6B model on a Scalable Single-Stream DiT (S3-DiT) architecture, designed so its 8-NFE output rivals full-step competitors (model card). 8-bit (-q 8) preserves quality very close to full precision while halving the footprint; drop to -q 4 (or the 4-bit mirror) only when memory forces it, accepting a small fidelity trade. Strengths per launch coverage: photorealistic portraits and strong bilingual (English + Chinese) text rendering.

For the full benchmark data (and to be the first to populate it), see /check/z-image-turbo/m2-max.

Troubleshooting

Tried to install FlashAttention / bitsandbytes / a `cu12x` wheel and it failed

None of those apply on Apple Silicon. There is no CUDA, no FlashAttention, and no GPU bitsandbytes kernel on macOS — mflux runs entirely on MLX with Metal, and quantizes via its own -q path rather than --load-in-4bit, GPTQ, AWQ, FP8, or NVFP4. If a generic Z-Image or diffusers tutorial tells you to pip install flash-attn, select a cu128 wheel index, or load an FP8/GPTQ build, skip those steps entirely; the mflux-generate-z-image-turbo commands above are the complete Apple path.

`bf16`-style precision errors or unexpectedly slow generation

mflux handles dtype internally on MLX, so you should not hit the bf16-breaks-on-MPS pitfall that affects hand-rolled PyTorch-MPS pipelines. If generation is slower than expected, confirm you passed -q 8 (or used the 4-bit mirror) — running the full-precision ~33 GB weights leaves little headroom on a 64 GB Mac and can push macOS into memory pressure. Watch Activity Monitor's Memory-Pressure gauge; if it goes yellow/red, switch to -q 8 or -q 4.

Do I need to raise the unified-memory wired limit?

Not for the 8-bit or 4-bit builds on a 64 GB M2 Max — both fit the ~48 GB default-addressable share comfortably. The sudo sysctl iogpu.wired_limit_mb=<MB> raise (macOS Sonoma 14 / Sequoia 15+; older macOS uses debug.iogpu.wired_limit in bytes) matters only if you insist on running the full-precision ~33 GB weights and macOS reports memory pressure — in that case raise the limit but leave 8–16 GB of headroom for the OS (the setting is temporary and resets on reboot). For the recommended -q 8 path, leave the default alone.

Confusion with Juggernaut-Z or the wrong "Z-Image"

Juggernaut-Z is a RunDiffusion fine-tune of Z-Image Base — a different model with its own slug. For the original Tongyi-MAI turbo weights, stick to Tongyi-MAI/Z-Image-Turbo (or the filipstrand/Z-Image-Turbo-mflux-4bit mirror) linked above, and the mflux-generate-z-image-turbo entry point, which targets exactly that model family.

No other widely-reported issues. Report problems via the submission form.