What You'll Build
A fully-local Qwen3-32B chat endpoint running on an Apple M2 Max with 64 GB unified memory, using Apple's native MLX runtime and 4-bit weights (~18 GB) — no NVIDIA GPU, no CUDA, no FlashAttention. You get an OpenAI-compatible local server you can point any chat client at, plus a one-shot CLI for scripting. Qwen3 ships a switchable thinking/non-thinking mode, so the same model can do step-by-step reasoning or fast direct answers.
Hardware data: Apple M2 Max (64 GB unified memory) · MLX 4-bit weights ~18.4 GB on disk · See benchmark data
ℹ️ Unified memory is not VRAM. The M2 Max has 64 GB of unified memory shared by CPU and GPU — it is not 64 GB of dedicated VRAM. By default macOS only lets the GPU address roughly 75% of it (~48 GB via Metal's
recommendedMaxWorkingSetSize). At ~18.4 GB, the 4-bit Qwen3-32B model fits that ~48 GB pool with plenty of room to spare — unlike a 70B-class model, this 32B leaves comfortable headroom for the KV-cache and macOS, even at long context.
Requirements
| Component | Minimum | Tested |
|---|---|---|
| GPU / memory | 48 GB unified memory (~32 GB GPU-addressable by default) | Apple M2 Max (64 GB unified memory) |
| RAM | Same pool — unified | 64 GB unified |
| Storage | ~19 GB (MLX 4-bit) / ~20 GB (GGUF Q4_K_M) | ~19 GB |
| Software | Python 3.10+, macOS Sonoma 14 / Sequoia 15+ | macOS Sequoia 15 |
The binding constraint on Apple Silicon is addressable unified memory, not raw capacity. The MLX 4-bit weights are 18.45 GB on disk (HF tree API for mlx-community/Qwen3-32B-4bit — 4 safetensors shards). Against the ~48 GB the GPU can address by default on a 64 GB Mac, that leaves roughly 30 GB for the KV-cache plus macOS — a wide margin. This is why the model also runs comfortably on a 48 GB Apple Silicon Mac, where the ~32 GB default-addressable pool still clears the ~18 GB weights with room for context. Qwen3-32B is a dense 32B model (64 layers, hidden size 5120, config.json), so all weights are resident at once — there are no sparse-MoE expert tricks to worry about here.
Installation
1. Install MLX-LM (the Apple-native path)
pip install mlx-lm
MLX is Apple's array framework; mlx-lm is its LLM front-end. There is nothing CUDA-shaped to install — no torch build flags, no cu12x wheel, no FlashAttention. (ml-explore/mlx-lm)
2. Run the model (weights download on first use)
mlx_lm.generate --model mlx-community/Qwen3-32B-4bit --prompt "Explain unified memory in one paragraph."
On first run, mlx-lm pulls the 4-bit weights (~18.4 GB, 4 shards) from the mlx-community Hugging Face org and caches them under ~/.cache/huggingface. These weights were converted directly from Qwen/Qwen3-32B and are not gated, so no license-acceptance step is needed to download them. Both the source model and the MLX conversion are released under the Apache-2.0 license.
Running
For an interactive, OpenAI-compatible local server (so you can point Open WebUI, a chat client, or your own code at it):
mlx_lm.server --model mlx-community/Qwen3-32B-4bit
This starts a local server on 127.0.0.1:8080 exposing an OpenAI-style /v1/chat/completions endpoint. It is a development server — bind it to localhost only. (mlx-lm SERVER docs)
Thinking vs non-thinking mode
Qwen3 exposes a switchable reasoning mode. The canonical Qwen/Qwen3-32B model card documents an enable_thinking flag on apply_chat_template (default True), plus per-turn soft switches: append /think or /no_think to a message to flip the behaviour mid-conversation. Use thinking mode for math, code, and multi-step reasoning; turn it off for fast, direct chat. Because mlx-lm applies the model's own chat template, the soft-switch tokens work through the MLX path as well.
Alternative: the GGUF path (llama.cpp / Ollama / LM Studio)
If you prefer the portable GGUF ecosystem, the same model is available as a Q4_K_M GGUF (~19.8 GB on disk):
# Ollama (simplest) — pulls the qwen3:32b build
ollama run qwen3:32b
Ollama lists qwen3:32b as a 20 GB build with a 40K context window (ollama.com/library/qwen3). For a hand-managed llama.cpp build, Metal is enabled by default on macOS — "On MacOS, Metal is enabled by default. Using Metal makes the computation run on the GPU." (llama.cpp build docs) — so a standard cmake -B build && cmake --build build --config Release already runs on the GPU; point it at the official Qwen/Qwen3-32B-GGUF Qwen3-32B-Q4_K_M.gguf (19.76 GB) or the bartowski/Qwen_Qwen3-32B-GGUF mirror. LM Studio runs both MLX and GGUF from a GUI if you prefer not to touch the terminal.
On Apple Silicon, MLX and the llama.cpp Metal backend are both first-class; MLX is generally the faster native runtime for token generation, while GGUF gives you the broadest tooling. Both the ~18.4 GB MLX 4-bit and the ~19.8 GB GGUF Q4_K_M sit well inside the ~48 GB default-addressable pool on this 64 GB Mac, so no wired-limit raise is needed for either at normal context lengths.
Results
- Speed: No first-party Apple M2 Max benchmark for this pair has been recorded yet — /check/qwen3-32b/m2-max currently returns
verdict: unknownwith no measurements. We are deliberately not quoting a token/sec figure: token generation on Apple Silicon is bandwidth-bound (the M2 Max runs ~400 GB/s unified memory), and no chip-named first-party Qwen3-32B-on-M2-Max throughput number exists to publish as a measured result. If you run this, please contribute your tok/s so we can seed a real datapoint. - Memory usage: ~18.4 GB resident for the MLX 4-bit weights, plus a KV-cache that grows with context. Fits the ~48 GB default-addressable pool of a 64 GB Mac with wide headroom; also clears the ~32 GB pool of a 48 GB Mac.
- Quality notes: The 4-bit quantization trades a small amount of quality for a much smaller footprint. If you have a 96 GB or 128 GB Apple Silicon machine, the 8-bit
mlx-communitybuild is a higher-fidelity option. Qwen3-32B's native context is 32,768 tokens, extensible to 131,072 with YaRN per the model card.
For the full benchmark data (and to be the first to populate it), see /check/qwen3-32b/m2-max.
Troubleshooting
Tried to install FlashAttention / bitsandbytes / a cu12x wheel and it failed
None of those apply on Apple Silicon. There is no CUDA, no FlashAttention, and no GPU bitsandbytes kernel on macOS — MLX uses its own Metal attention and its own 4-bit quantization, and llama.cpp uses Metal + GGUF K-quants. If a generic Qwen3 tutorial tells you to pip install flash-attn, pass --load-in-4bit, or load a GPTQ/AWQ build, skip those steps entirely; the commands above are the complete Apple path.
Heavy swapping only at very long context
At ~18.4 GB the weights leave a wide margin under the ~48 GB default-addressable pool, so for ordinary chat you never need to touch the memory limit. Only if you push the context window very long (tens of thousands of tokens, where the KV-cache grows large) on top of other apps might you approach the ceiling. If you do, raise the GPU's wired-memory limit (macOS Sonoma 14 / Sequoia 15+):
sudo sysctl iogpu.wired_limit_mb=57344 # 56 GB; leaves ~8 GB for macOS
This lets the GPU address up to 56 GB. Always leave 8–16 GB of headroom for macOS — pushing to 100% causes instability. The setting is temporary and resets on reboot (persist it via /etc/sysctl.conf); sudo sysctl iogpu.wired_limit_mb=0 restores the default. Watch Activity Monitor's Memory-Pressure gauge while loading. For this 32B model at normal context this is rarely necessary — it matters far more for 70B-class models.
Reasoning mode produces overly long answers
If responses ramble with visible chain-of-thought when you wanted a quick reply, you are in thinking mode (the default). Append /no_think to your message, or set enable_thinking=False in apply_chat_template, to get direct answers — see the model card for the exact toggle.
No other widely-reported issues. Report problems via the submission form.