What You'll Build
A fully-local install of Z-Image-Turbo — Alibaba Tongyi-MAI's 6B-parameter distilled text-to-image model — generating 1024×1024 images in 8 NFEs on an Apple M2 Pro with 16 GB unified memory, running on Apple's native mflux (an MLX implementation), with no NVIDIA GPU, no CUDA, and no FlashAttention. Z-Image-Turbo pairs a Scalable Single-Stream DiT (S3-DiT) with a large text encoder and is tuned for fast few-step sampling; mflux ships a first-class Z-Image path and a pre-quantized 4-bit mirror sized for memory-constrained Macs, so a single command turns a prompt into a PNG entirely in the M2 Pro's 16 GB unified memory.
Hardware data: Apple M2 Pro (16 GB unified memory) · mflux pre-quantized 4-bit Z-Image-Turbo, ~5.9 GB on disk · See benchmark data
ℹ️ Unified memory is not VRAM, and 16 GB is tight. The M2 Pro has 16 GB of unified memory shared by CPU and GPU — not 16 GB of dedicated VRAM. macOS lets the GPU address only about two-thirds of it by default (~10.5 GB via Metal's
recommendedMaxWorkingSetSizeon a sub-64 GB Mac). That is the binding constraint here: the full-precision Z-Image-Turbo weights are ~33 GB and the on-the-fly 8-bit build is a ~17 GB-class working set — both exceed 16 GB physical memory and will not load on this chip. The path that fits is mflux's pre-quantized 4-bit mirror, which is only ~5.9 GB on disk and clears the ~10.5 GB addressable pool comfortably. Lead with it.
Note on variants: The Tongyi-MAI Z-Image family ships multiple weight sets — Z-Image (Base), Z-Image-Turbo, and a further-compressed distilled build. This recipe targets Z-Image-Turbo, the consumer-friendly few-step variant (the mflux model table lists Z-Image as 6B · Distilled & Base · "Fast, small, very good quality and realism", per the mflux README). Fine-tunes like Juggernaut-Z are a separate model with their own slug.
Requirements
| Component | Minimum | Tested |
|---|---|---|
| GPU / memory | 16 GB unified memory (~10.5 GB GPU-addressable; use the pre-quantized 4-bit mirror) | Apple M2 Pro (16 GB unified memory, ~10.5 GB addressable) |
| RAM | Same pool — unified | 16 GB unified |
| Storage | ~6 GB (mflux pre-quantized 4-bit mirror) | ~6 GB |
| Software | Python 3.10+, macOS Sonoma 14 / Sequoia 15+ | macOS Sequoia 15 |
The binding constraint on Apple Silicon is addressable unified memory, not raw capacity. On a 16 GB Mac the GPU can address only ~10.5 GB by default. The pre-quantized 4-bit mirror, filipstrand/Z-Image-Turbo-mflux-4bit, is ~5.9 GB on disk (transformer ~3.46 GB + text encoder ~2.27 GB + VAE 0.16 GB, per the HF tree, filipstrand/Z-Image-Turbo-mflux-4bit) and is the path this recipe uses. By contrast, the full-precision Tongyi-MAI/Z-Image-Turbo repo is ~33 GB (transformer ~24.6 GB + text encoder ~8.0 GB + VAE 0.17 GB) and the on-the-fly 8-bit build is a ~17 GB-class working set — neither fits 16 GB; those paths require a Mac with ≥ 48 GB of unified memory. The HF model card states Z-Image-Turbo "fits comfortably within […] 16G VRAM consumer devices" (Tongyi-MAI model card) on NVIDIA hardware; on a 16 GB Apple Silicon Mac, the equivalent is the pre-quantized 4-bit mflux mirror.
Installation
1. Install mflux (the Apple-native MLX image path)
uv tool install --upgrade mflux
mflux is a from-scratch MLX implementation of the FLUX / Qwen-Image / Z-Image families (filipstrand/mflux). There is nothing CUDA-shaped to install — no torch CUDA wheel, no cu12x index, no FlashAttention, no bitsandbytes. If you prefer pip, pip install -U mflux works too; the project recommends the uv tool install above. This pulls the mflux-generate-z-image-turbo entry point onto your PATH.
2. Generate an image with the pre-quantized 4-bit mirror (weights download on first use)
mflux-generate-z-image-turbo \
--model filipstrand/Z-Image-Turbo-mflux-4bit \
--prompt "A puffin standing on a cliff" \
--width 1024 \
--height 1024 \
--seed 42 \
--steps 9
The --model filipstrand/Z-Image-Turbo-mflux-4bit form is shown in the mflux Z-Image model README, which recommends using the pre-quantized 4-bit model on memory-constrained machines rather than quantizing during generation. On first run, mflux pulls the ~5.9 GB 4-bit weights from Hugging Face and caches them under ~/.cache/huggingface. The --steps 9 value matches Z-Image-Turbo's recommended schedule — the model card notes the canonical setting is 8 NFEs, and num_inference_steps=9 "actually results in 8 DiT forwards" (Tongyi-MAI model card). The PNG lands in the working directory.
Because the mirror is already 4-bit, you do not pass a -q/--quantize flag here — that flag (which accepts 3, 4, 5, 6, or 8 per the mflux quantization docs) is for on-the-fly quantization of the full-precision repo, which is the path you are avoiding on a 16 GB Mac.
Running
After installation the same mflux-generate-z-image-turbo --model filipstrand/Z-Image-Turbo-mflux-4bit command is your day-to-day entry point — change --prompt, --width/--height, and --seed to taste. Z-Image-Turbo "excels at accurately rendering complex Chinese and English text" and at photorealistic, bilingual (English & Chinese) generation per the model card, so prompts with embedded text or non-Latin scripts are a genuine strength. Keep --steps 9 (the 8-NFE schedule the Turbo variant is distilled for); pushing far higher rarely helps a step-distilled model.
mflux-generate-z-image-turbo \
--model filipstrand/Z-Image-Turbo-mflux-4bit \
--prompt "storefront sign reading '欢迎光临 · OPEN', neon, rain-slick pavement at dusk" \
--width 1024 \
--height 1024 \
--seed 7 \
--steps 9
If you are tight on memory while a generation runs — likely on a 16 GB Mac if you have other apps open — mflux exposes a --low-ram flag that reduces memory usage at the cost of some performance (mflux quantization docs). Add it to the command above when macOS reports memory pressure.
Alternative: Draw Things (Metal-native GUI)
If you would rather not touch the terminal, Draw Things is a native Apple-Silicon Metal app (iOS / iPadOS / macOS) with its own Metal attention engine — a point-and-click alternative to mflux for Apple-Silicon image generation. It is the GUI counterpart to the mflux CLI path on the same hardware; consult its in-app model browser for the current Z-Image availability, since model support there evolves independently of mflux.
Results
- Speed: No first-party Apple M2 Pro benchmark for this pair has been recorded yet — /check/z-image-turbo/m2-pro currently returns
verdict: unknownwith no measurements. We are deliberately not quoting a seconds-per-image figure: image generation throughput on Apple Silicon is bound by unified-memory bandwidth (the M2 Pro's ~200 GB/s is the lowest of the Apple lineup), and no chip-named first-party number exists for Z-Image-Turbo on this Mac. Latency figures published for NVIDIA hardware (CUDA) do not forward to Apple Silicon, and neither do numbers from larger Apple chips. If you run this, please contribute your timing so we can seed a real M2 Pro datapoint. - Memory usage: ~5.9 GB on disk for the pre-quantized 4-bit mirror, comfortably inside the M2 Pro's ~10.5 GB default-addressable pool. The on-the-fly 8-bit (
-q 8) and full-precision (~33 GB) builds do not fit 16 GB and require a Mac with ≥ 48 GB of unified memory. Live measurements (once contributed): /check/z-image-turbo/m2-pro. - Quality notes: Z-Image-Turbo is a step-distilled 6B model on a Scalable Single-Stream DiT (S3-DiT) architecture, designed so its 8-NFE output rivals full-step competitors (model card). The 4-bit build trades a small amount of fidelity for the memory headroom that makes Z-Image-Turbo runnable on a 16 GB Mac at all; on a Mac with more unified memory you would step up to the 8-bit or full-precision build for closer-to-reference quality. Strengths per launch coverage: photorealistic portraits and strong bilingual (English + Chinese) text rendering.
For the full benchmark data (and to be the first to populate it), see /check/z-image-turbo/m2-pro.
Troubleshooting
Out of memory / the generation hangs or swaps on a 16 GB Mac
Confirm you are using the pre-quantized 4-bit mirror (--model filipstrand/Z-Image-Turbo-mflux-4bit), not the default full-precision weights or an on-the-fly -q 8 build — those are ~33 GB and ~17 GB-class respectively and exceed the 16 GB physical memory. Close other memory-hungry apps, and add --low-ram to the command to trim the resident footprint. Watch Activity Monitor's Memory-Pressure gauge; if it goes yellow/red, the model is too large for the current free memory — the 4-bit mirror with --low-ram is the smallest-footprint Apple path. Raising the unified-memory wired limit (sudo sysctl iogpu.wired_limit_mb=<MB>) does not help here: it cannot grow past the 16 GB physical ceiling, and the heavier builds genuinely need a larger Mac.
Tried to install FlashAttention / bitsandbytes / a cu12x wheel and it failed
None of those apply on Apple Silicon. There is no CUDA, no FlashAttention, and no GPU bitsandbytes kernel on macOS — mflux runs entirely on MLX with Metal, and quantizes via its own --quantize path (or ships pre-quantized weights) rather than --load-in-4bit, GPTQ, AWQ, FP8, or NVFP4. If a generic Z-Image or diffusers tutorial tells you to pip install flash-attn, select a cu128 wheel index, or load an FP8/GPTQ build, skip those steps entirely; the mflux-generate-z-image-turbo commands above are the complete Apple path.
bf16-style precision errors or unexpectedly slow generation
mflux handles dtype internally on MLX, so you should not hit the bf16-breaks-on-MPS pitfall that affects hand-rolled PyTorch-MPS pipelines. If generation is slower than expected, confirm you are on the 4-bit mirror and add --low-ram; on a 16 GB Mac the most common cause of slowness is memory pressure forcing macOS to swap.
Confusion with Juggernaut-Z or the wrong "Z-Image"
Juggernaut-Z is a RunDiffusion fine-tune of Z-Image Base — a different model with its own slug. For the original Tongyi-MAI turbo weights on a 16 GB Mac, stick to the filipstrand/Z-Image-Turbo-mflux-4bit mirror linked above, and the mflux-generate-z-image-turbo entry point, which targets exactly that model family.
No other widely-reported issues. Report problems via the submission form.