How much VRAM does Qwen3 32B need?

About 19 GB — the minimum this recipe targets.

How hard is this setup?

Intermediate — follow the steps above.

Qwen3-32B on Apple M4 Max: 32B local chat with MLX 4-bit in 48 GB unified memory

What You'll Build

A fully-local Qwen3-32B chat endpoint running on an Apple M4 Max with 48 GB unified memory, using Apple's native MLX runtime and 4-bit weights (~18 GB) — no NVIDIA GPU, no CUDA, no FlashAttention. You get an OpenAI-compatible local server you can point any chat client at, plus a one-shot CLI for scripting. Qwen3 ships a switchable thinking/non-thinking mode, so the same model can do step-by-step reasoning or fast direct answers.

Hardware data: Apple M4 Max (48 GB unified memory) · MLX 4-bit weights ~18.4 GB on disk · See benchmark data

ℹ️ Unified memory is not VRAM. The M4 Max has 48 GB of unified memory shared by CPU and GPU — it is not 48 GB of dedicated VRAM. By default macOS only lets the GPU address roughly two-thirds of it (~32 GB safe / ~36 GB optimistic via Metal's recommendedMaxWorkingSetSize). At ~18.4 GB, the 4-bit Qwen3-32B model fits that ~32 GB safe pool with room to spare — unlike a 70B-class model, this 32B leaves comfortable headroom for the KV-cache and macOS at ordinary context lengths, and needs no wired-limit raise.

Requirements

Component	Minimum	Tested
GPU / memory	48 GB unified memory (~32 GB GPU-addressable by default)	Apple M4 Max (48 GB unified memory)
RAM	Same pool — unified	48 GB unified
Storage	~19 GB (MLX 4-bit) / ~20 GB (GGUF Q4_K_M)	~19 GB
Software	Python 3.10+, macOS Sonoma 14 / Sequoia 15+	macOS Sequoia 15

The binding constraint on Apple Silicon is addressable unified memory, not raw capacity. The MLX 4-bit weights are 18.45 GB on disk (HF tree API for mlx-community/Qwen3-32B-4bit — 4 safetensors shards). On a 48 GB Mac the GPU addresses roughly 32 GB by default (about 66–75% of the unified pool), so the ~18 GB of weights leave roughly 13 GB for the KV-cache plus macOS — a comfortable margin at normal context. This is why the 32B fits the M4 Max without any tuning, where a 70B-class model would need a wired-limit raise; a 64 GB Mac (~48 GB addressable) simply has even more headroom. Qwen3-32B is a dense 32B model (64 layers, hidden size 5120, config.json), so all weights are resident at once — there are no sparse-MoE expert tricks to worry about here.

Installation

1. Install MLX-LM (the Apple-native path)

pip install mlx-lm

MLX is Apple's array framework; mlx-lm is its LLM front-end. There is nothing CUDA-shaped to install — no torch build flags, no cu12x wheel, no FlashAttention. (ml-explore/mlx-lm)

2. Run the model (weights download on first use)

mlx_lm.generate --model mlx-community/Qwen3-32B-4bit --prompt "Explain unified memory in one paragraph."

On first run, mlx-lm pulls the 4-bit weights (~18.4 GB, 4 shards) from the mlx-community Hugging Face org and caches them under ~/.cache/huggingface. These weights were converted directly from Qwen/Qwen3-32B and are not gated, so no license-acceptance step is needed to download them. Both the source model and the MLX conversion are released under the Apache-2.0 license.

Running

For an interactive, OpenAI-compatible local server (so you can point Open WebUI, a chat client, or your own code at it):

mlx_lm.server --model mlx-community/Qwen3-32B-4bit

This starts a local server on 127.0.0.1:8080 exposing an OpenAI-style /v1/chat/completions endpoint. It is a development server — bind it to localhost only. (mlx-lm SERVER docs)

Thinking vs non-thinking mode

Qwen3 exposes a switchable reasoning mode. The canonical Qwen/Qwen3-32B model card documents an enable_thinking flag on apply_chat_template (default True), plus per-turn soft switches: append /think or /no_think to a message to flip the behaviour mid-conversation. Use thinking mode for math, code, and multi-step reasoning; turn it off for fast, direct chat. Because mlx-lm applies the model's own chat template, the soft-switch tokens work through the MLX path as well.

Alternative: the GGUF path (llama.cpp / Ollama / LM Studio)

If you prefer the portable GGUF ecosystem, the same model is available as a Q4_K_M GGUF (~19.8 GB on disk):

# Ollama (simplest) — pulls the qwen3:32b build
ollama run qwen3:32b

Ollama lists qwen3:32b as a 20 GB build with a 40K context window (ollama.com/library/qwen3). For a hand-managed llama.cpp build, Metal is enabled by default on macOS — "On MacOS, Metal is enabled by default. Using Metal makes the computation run on the GPU." (llama.cpp build docs) — so a standard cmake -B build && cmake --build build --config Release already runs on the GPU; point it at the official Qwen/Qwen3-32B-GGUF Qwen3-32B-Q4_K_M.gguf (19.76 GB) or the bartowski/Qwen_Qwen3-32B-GGUF mirror. LM Studio runs both MLX and GGUF from a GUI if you prefer not to touch the terminal.

On Apple Silicon, MLX and the llama.cpp Metal backend are both first-class; MLX is generally the faster native runtime for token generation, while GGUF gives you the broadest tooling. Both the ~18.4 GB MLX 4-bit and the ~19.8 GB GGUF Q4_K_M sit inside the ~32 GB default-addressable pool on this 48 GB Mac, so no wired-limit raise is needed for either at normal context lengths.

Results

Speed: No first-party Apple M4 Max benchmark for this pair has been recorded yet — /check/qwen3-32b/m4-max currently returns verdict: unknown with no measurements. We are deliberately not quoting a token/sec figure: token generation on Apple Silicon is bandwidth-bound (the M4 Max runs ~546 GB/s unified memory — the fastest in Apple's current lineup), and no chip-named first-party Qwen3-32B-on-M4-Max throughput number exists to publish as a measured result. If you run this, please contribute your tok/s so we can seed a real datapoint.
Memory usage: ~18.4 GB resident for the MLX 4-bit weights, plus a KV-cache that grows with context. Fits the ~32 GB safe default-addressable pool of a 48 GB Mac with room to spare; a 64 GB Mac (~48 GB addressable) has even more.
Quality notes: The 4-bit quantization trades a small amount of quality for a much smaller footprint. If you have a 64 GB or larger Apple Silicon machine, the 8-bit mlx-community build is a higher-fidelity option that still fits comfortably. Qwen3-32B's native context is 32,768 tokens, extensible to 131,072 with YaRN per the model card.

For the full benchmark data (and to be the first to populate it), see /check/qwen3-32b/m4-max.

Troubleshooting

Tried to install FlashAttention / bitsandbytes / a `cu12x` wheel and it failed

None of those apply on Apple Silicon. There is no CUDA, no FlashAttention, and no GPU bitsandbytes kernel on macOS — MLX uses its own Metal attention and its own 4-bit quantization, and llama.cpp uses Metal + GGUF K-quants. If a generic Qwen3 tutorial tells you to pip install flash-attn, pass --load-in-4bit, or load a GPTQ/AWQ build, skip those steps entirely; the commands above are the complete Apple path.

Heavy swapping only at very long context

At ~18.4 GB the weights leave room under the ~32 GB safe default-addressable pool of a 48 GB Mac, so for ordinary chat you never need to touch the memory limit. Only if you push the context window very long (tens of thousands of tokens, where the KV-cache grows large) on top of other apps might you approach the ceiling. If you do, raise the GPU's wired-memory limit (macOS Sonoma 14 / Sequoia 15+):

sudo sysctl iogpu.wired_limit_mb=40960   # 40 GB; leaves ~8 GB for macOS

This lets the GPU address up to 40 GB. Always leave 8–16 GB of headroom for macOS — pushing to 100% causes instability. The setting is temporary and resets on reboot (persist it via /etc/sysctl.conf); sudo sysctl iogpu.wired_limit_mb=0 restores the default. Watch Activity Monitor's Memory-Pressure gauge while loading. For this 32B model at normal context this is rarely necessary — it matters far more for 70B-class models.

Reasoning mode produces overly long answers

If responses ramble with visible chain-of-thought when you wanted a quick reply, you are in thinking mode (the default). Append /no_think to your message, or set enable_thinking=False in apply_chat_template, to get direct answers — see the model card for the exact toggle.

No other widely-reported issues. Report problems via the submission form.