self-hosted/ai
§01·recipe · image

Qwen-Image on Apple M4 Max: 20B text-to-image in unified memory with mflux

imageintermediate23GB+ VRAMJun 23, 2026

This intermediate recipe sets up Qwen-Image on the Apple M4 Max, needing about 23 GB of VRAM.

models
tools
prerequisites
  • Apple M4 Max with 48 GB unified memory (64 GB Macs have more headroom; 16 GB Macs need the on-the-fly 4-bit path and are tight — see Requirements)
  • macOS Sonoma 14 or Sequoia 15+
  • Python 3.10+ (for the `uv`/`pip` mflux install)
  • ~23 GB free disk for the pre-quantized 6-bit mflux weights (~58 GB if you pull the full-precision Qwen/Qwen-Image repo)

What You'll Build

A fully-local install of Qwen-Image — Alibaba Tongyi Lab's "20B MMDiT image foundation model" (Qwen-Image GitHub README) — generating text-to-image on an Apple M4 Max, running on Apple's native mflux (an MLX implementation), with no NVIDIA GPU, no CUDA, and no FlashAttention. Qwen-Image is a multimodal diffusion transformer released under Apache 2.0 whose headline strength is high-fidelity text rendering in both English and Chinese; mflux ships a first-class Qwen-Image path with on-the-fly quantization, so a single command turns a prompt into a PNG entirely in the M4 Max's unified memory.

Hardware data: Apple M4 Max (48 GB unified memory) · mflux 6-bit Qwen-Image, ~23 GB on disk · See benchmark data

ℹ️ Unified memory is not VRAM. The M4 Max has 48 GB of unified memory shared by CPU and GPU — not 48 GB of dedicated VRAM. macOS lets the GPU address only a fraction of it by default — on a sub-64 GB Mac that is roughly two-thirds via Metal's recommendedMaxWorkingSetSize, so plan against ~32 GB safe (and up to ~36 GB optimistic), not the full 48 GB. Qwen-Image's full-precision weights total ~58 GB on disk — transformer 40.86 GB + text encoder 16.58 GB + VAE 0.25 GB (HF tree, Qwen/Qwen-Image) — which sits far over that ceiling, so full precision cannot run on a 48 GB Mac (more on that below). The recommended build is mflux's pre-quantized 6-bit mirror (~23 GB), which fits inside the ~32 GB safe pool — with room for activations, but tighter than it would be on a 64 GB Mac.

Note on the 20B/text-encoder footprint: Qwen-Image is genuinely large for an Apple image model — much heavier than FLUX.1 or the 6B Z-Image. Per the model's model_index.json it pairs a QwenImageTransformer2DModel (the 20B MMDiT, 40.86 GB at bf16) with a Qwen2_5_VLForConditionalGeneration text encoder (a ~7B Qwen2.5-VL stack, 16.58 GB at bf16) — the big text encoder is why even the quantized footprint stays in the ~23 GB range, and why on a 48 GB Mac it sits in the upper half of the safe addressable pool. This is comfortably a 48-GB-Mac recipe on the 6-bit build; treat 16 GB as the experimental floor.

Requirements

ComponentMinimumTested
GPU / memory16 GB unified memory (~10.5 GB GPU-addressable, -q 4, tight)Apple M4 Max (48 GB unified memory, ~32 GB addressable safe)
RAMSame pool — unified48 GB unified
Storage~23 GB (mflux 6-bit) / ~58 GB (full-precision Qwen/Qwen-Image repo)~23 GB (6-bit)
SoftwarePython 3.10+, macOS Sonoma 14 / Sequoia 15+macOS Sequoia 15

The binding constraint on Apple Silicon is addressable unified memory, not raw capacity. Qwen-Image's full-precision weights total ~58 GB on disk — transformer 40.86 GB + text encoder 16.58 GB + VAE 0.25 GB (HF tree, Qwen/Qwen-Image). On a 48 GB M4 Max the GPU can address only ~32 GB safely (~36 GB optimistic), so the full-precision build is far over the wall — it cannot run unquantized on this Mac (see Troubleshooting). mflux's pre-quantized 6-bit mirror, filipstrand/Qwen-Image-mflux-6bit, is ~23 GB on disk (transformer ~16.6 GB + text encoder ~6.2 GB + VAE 0.25 GB, summed via the HF tree API) and fits the M4 Max's ~32 GB safe pool with room for activations — tighter than the same build would run on a 64 GB Mac, but workable. The lighter 4-bit mirror (~15 GB) is the option to reach for if you want more headroom, and is also the floor that lets a 16 GB Mac (~10.5 GB addressable) attempt this 20B model via the on-the-fly -q 4 path, where it is tight — treat 16 GB as experimental.

Installation

1. Install mflux (the Apple-native MLX image path)

uv tool install --upgrade mflux

mflux is a from-scratch MLX implementation of the FLUX / Qwen-Image / Z-Image families; its model table lists Qwen Image as a 20B Base model (Aug 2025+) with the note "Large model (slower); strong prompt understanding and world knowledge" (mflux README). There is nothing CUDA-shaped to install — no torch CUDA wheel, no cu12x index, no FlashAttention, no bitsandbytes. If you prefer pip, pip install -U mflux works too; the project recommends the uv tool install above. This pulls the mflux-generate-qwen entry point onto your PATH.

2. Generate an image with the pre-quantized 6-bit weights (recommended)

On a 48 GB M4 Max, point mflux at the pre-quantized 6-bit mirror so nothing has to be quantized on the fly:

mflux-generate-qwen \
  --model filipstrand/Qwen-Image-mflux-6bit \
  --prompt "Close-up portrait of a majestic tiger in its natural habitat, detailed fur texture, piercing eyes, natural forest background, soft natural lighting, wildlife photography, photorealistic, high detail, professional wildlife shot" \
  --width 1920 \
  --height 816 \
  --steps 30 \
  --seed 42

This --model filipstrand/Qwen-Image-mflux-6bit invocation is verbatim from the 6-bit mirror's model card, maintained by the mflux author. On first run mflux pulls the ~23 GB of 6-bit weights from Hugging Face and caches them under ~/.cache/huggingface; the PNG lands in the working directory. The mirror "inherits the license of the original Qwen model" — i.e. Apache 2.0 — so no license-acceptance step is required to download it.

3. (Alternative) Lighter 4-bit, or on-the-fly quantization from the full repo

If you want more headroom on the 48 GB Mac, the pre-quantized 4-bit mirror is ~15 GB and leaves the safe pool less crowded. Or, if you would rather quantize the full-precision Qwen/Qwen-Image weights yourself, mflux quantizes as the weights load. Pass -q with one of the accepted integer values — "Most models support --quantize (3, 4, 5, 6, 8)" (mflux CLI docs):

mflux-generate-qwen \
  --prompt "A puffin standing on a cliff, oil painting" \
  --width 1024 \
  --height 1024 \
  --steps 30 \
  --seed 42 \
  -q 4

The -q 8 form gives the closest-to-full-precision quantized output (~30 GB-class working set — over the M4 Max's ~32 GB safe ceiling, so not the path here); -q 6 (or the pre-quantized 6-bit mirror, ~23 GB) is the recommended balance, and -q 4 produces a ~15 GB-class working set — the lightest path, and the one a 16 GB Mac must use (tight). The base mflux-generate-qwen --steps 30 -q 8 example is verbatim from the Qwen-Image mflux README.

Running

After installation the same mflux-generate-qwen command is your day-to-day entry point — change --prompt, --width/--height, and --seed to taste, and keep --steps 30 (the value the mflux Qwen-Image examples use). Qwen-Image's standout capability is text rendering — the Qwen-Image card highlights "high-fidelity text rendering" across both "alphabetic languages like English" and "logographic scripts like Chinese" (Qwen/Qwen-Image card) — so prompts with embedded signage, captions, or non-Latin scripts are a genuine strength:

mflux-generate-qwen \
  --model filipstrand/Qwen-Image-mflux-6bit \
  --prompt "storefront sign reading '欢迎光临 · OPEN', neon, rain-slick pavement at dusk, cinematic" \
  --width 1024 \
  --height 1024 \
  --steps 30 \
  --seed 7

If you are tight on memory while a large generation runs, mflux exposes a --low-ram flag — "Use --low-ram to reduce memory usage (at the cost of performance)" (mflux CLI docs).

Alternative: Draw Things (Metal-native GUI)

If you would rather not touch the terminal, Draw Things is a native Apple-Silicon Metal app (iOS / iPadOS / macOS) with its own Metal attention engine, and it lists Qwen-Image among its supported models — a point-and-click alternative to the mflux CLI on the same hardware. It is the GUI counterpart to the mflux path; consult its in-app model browser for the current Qwen-Image build, since model support there evolves independently of mflux.

Results

  • Speed: No first-party Apple M4 Max benchmark for this pair has been recorded yet — /check/qwen-image/m4-max currently returns verdict: unknown with no measurements. We are deliberately not quoting a seconds-per-image figure: image generation throughput on Apple Silicon is bound by the M4 Max's ~546 GB/s unified-memory bandwidth and its 40-core GPU, and no chip-named first-party number exists for Qwen-Image on this Mac. The mflux table itself flags Qwen-Image as a "Large model (slower)" (mflux README), so expect a heavier per-image cost than FLUX.1 or Z-Image. The latency numbers in the NVIDIA/AMD Qwen-Image recipes come from CUDA/ROCm hardware and do not forward to Apple Silicon; nor do M2/M3 Max figures, whose ~400 GB/s bandwidth differs from the M4 Max's 546 GB/s. If you run this, please contribute your timing so we can seed a real M4 Max datapoint.
  • Memory usage: ~23 GB on disk for the recommended 6-bit mirror, with a runtime working set that fits the M4 Max's ~32 GB safe-addressable pool with room for activations — tighter than on a 64 GB Mac, but workable. The lighter 4-bit mirror lands ~15 GB-class; on-the-fly -q 8 is ~30 GB-class and pushes the ~32 GB safe ceiling, so prefer the 6-bit or 4-bit build here. The full-precision ~58 GB build is far over the addressable wall — it cannot run unquantized on a 48 GB Mac. Live measurements (once contributed): /check/qwen-image/m4-max.
  • Quality notes: Qwen-Image is a 20B MMDiT paired with a Qwen2.5-VL text encoder (architecture per the model_index.json), which is what powers its strong prompt understanding and bilingual text rendering. 6-bit (-q 6, or the pre-quantized mirror) preserves quality close to full precision while roughly quartering the footprint; drop to -q 4 for more headroom (or when a 16 GB Mac forces it), accepting a fidelity trade. Strengths per the model card: complex text rendering and precise prompt adherence in English and Chinese.

For the full benchmark data (and to be the first to populate it), see /check/qwen-image/m4-max.

Troubleshooting

Tried to install FlashAttention / bitsandbytes / a cu12x wheel and it failed

None of those apply on Apple Silicon. There is no CUDA, no FlashAttention, and no GPU bitsandbytes kernel on macOS — mflux runs entirely on MLX with Metal, and quantizes via its own -q path rather than --load-in-4bit, GPTQ, AWQ, FP8, or NVFP4. If a generic Qwen-Image or diffusers tutorial tells you to pip install flash-attn, select a cu128 wheel index, or load an FP8/GGUF-CUDA build, skip those steps entirely; the mflux-generate-qwen commands above are the complete Apple path.

Out of memory or heavy swapping on a 48 GB or 16 GB Mac

Qwen-Image is a 20B model with a 16.58 GB text encoder — it is the heaviest image model in this Apple line-up. On a 48 GB M4 Max (~32 GB addressable safe) the 6-bit mirror fits but with less slack than on a 64 GB Mac, and the on-the-fly -q 8 path (~30 GB-class) pushes the safe ceiling — stick to the 6-bit mirror or drop to the lighter -q 4 (~15 GB) if you hit pressure. On a 16 GB Mac the 6-bit weights won't fit at all; you must use -q 4 and even then it is tight — watch Activity Monitor's Memory-Pressure gauge and add --low-ram. If you are running the 6-bit build and macOS still reports memory pressure, you can nudge the GPU's wired limit up with sudo sysctl iogpu.wired_limit_mb=<MB> (macOS Sonoma 14 / Sequoia 15+; older macOS uses debug.iogpu.wired_limit in bytes), leaving 8–16 GB of headroom for the OS — the setting is temporary and resets on reboot. Note this does not unlock the full-precision ~58 GB build: even a raised wired limit on a 48 GB Mac cannot address 58 GB of weights, so a quantized build is mandatory here.

bf16-style precision errors or unexpectedly slow generation

mflux handles dtype internally on MLX, so you should not hit the bf16-breaks-on-MPS pitfall that affects hand-rolled PyTorch-MPS pipelines. If generation is slower than expected, confirm you used the pre-quantized 6-bit mirror or passed -q 6/-q 4 — attempting the full-precision ~58 GB weights simply will not fit a 48 GB Mac's addressable pool and pushes macOS into swapping. Qwen-Image is also inherently slower than FLUX.1/Z-Image at the same resolution because of its size; lower --width/--height or --steps if you need faster iteration.

Confirming Qwen-Image and not a different "Qwen" model

The mflux entry point is mflux-generate-qwen and the canonical weights are Qwen/Qwen-Image (the 6-bit mirror is filipstrand/Qwen-Image-mflux-6bit). This is the text-to-image diffusion model — distinct from the Qwen language models (Qwen3, Qwen2.5-VL) and from Qwen-Image-Edit (the image-editing sibling). For pure text-to-image on this Mac, use the commands above.

No other widely-reported issues. Report problems via the submission form.

common questions
How much VRAM does Qwen-Image need?

About 23 GB — the minimum this recipe targets.

Which GPUs is Qwen-Image tested on?

Apple M4 Max (48 GB).

How hard is this setup?

Intermediate — follow the steps above.