self-hosted/ai
§01·recipe · llm

Qwen3-14B on Apple M2 Pro: the strongest LLM that fits a 16 GB unified-memory Mac, via MLX 4-bit

llmintermediate9GB+ VRAMJun 25, 2026

This intermediate recipe sets up Qwen3 14B on the Apple M2 Pro, needing about 9 GB of VRAM.

models
tools
prerequisites
  • Apple M2 Pro with 16 GB unified memory (any Apple Silicon Mac with ≥16 GB also runs this)
  • macOS Sonoma 14 or Sequoia 15+
  • Python 3.10+
  • ~9 GB free disk for the MLX 4-bit weights (~9 GB for the GGUF Q4_K_M path)

What You'll Build

A fully-local Qwen3-14B chat endpoint running on an Apple M2 Pro with 16 GB unified memory, using Apple's native MLX runtime and 4-bit weights (~8.3 GB) — no NVIDIA GPU, no CUDA, no FlashAttention. At 14B parameters, this is the strongest LLM that fits a 16 GB Mac cleanly: heavier than an 8B but small enough to leave room for the KV-cache and macOS at modest context. You get an OpenAI-compatible local server you can point any chat client at, plus a one-shot CLI for scripting. Qwen3 ships a switchable thinking/non-thinking mode, so the same model can do step-by-step reasoning or fast direct answers.

Hardware data: Apple M2 Pro (16 GB unified memory) · MLX 4-bit weights ~8.3 GB on disk · See benchmark data

ℹ️ Unified memory is not VRAM. The M2 Pro has 16 GB of unified memory shared by CPU and GPU — it is not 16 GB of dedicated VRAM. On a 16 GB Mac macOS lets the GPU address only about two-thirds of it (~10.5 GB via Metal's recommendedMaxWorkingSetSize). At ~8.3 GB the 4-bit Qwen3-14B weights fit that ~10.5 GB pool with enough room left for a modest KV-cache and the OS — but the margin is real, not generous. Keep your context window modest (≤8k) for a no-fuss fit, or raise the wired-memory limit for long sessions (see Troubleshooting).

Requirements

ComponentMinimumTested
GPU / memory16 GB unified memory (~10.5 GB GPU-addressable by default)Apple M2 Pro (16 GB unified memory)
RAMSame pool — unified16 GB unified
Storage~9 GB (MLX 4-bit) / ~9 GB (GGUF Q4_K_M)~9 GB
SoftwarePython 3.10+, macOS Sonoma 14 / Sequoia 15+macOS Sequoia 15

The binding constraint on Apple Silicon is addressable unified memory, not raw capacity. The MLX 4-bit weights are 8.32 GB on disk (two safetensors shards, 5.35 GB + 2.95 GB, per the HF tree API for mlx-community/Qwen3-14B-4bit). Against the ~10.5 GB the GPU can address by default on a 16 GB Mac, that leaves roughly 2 GB for the KV-cache plus macOS at short context — enough, but watch it as the context grows. Qwen3-14B is a dense 14B model (40 layers, hidden size 5120, config.json), so all weights are resident at once — there are no sparse-MoE expert tricks to worry about here. It also uses grouped-query attention (8 KV heads), which keeps the KV-cache small: roughly 0.16 MB per token at fp16, so an 8k-token context costs only ~1.3 GB. That is why 8k is the comfortable ceiling on this chip — push toward 16k (~2.7 GB KV) and you start crowding the ~10.5 GB addressable pool.

Installation

1. Install MLX-LM (the Apple-native path)

pip install mlx-lm

MLX is Apple's array framework; mlx-lm is its LLM front-end. There is nothing CUDA-shaped to install — no torch build flags, no cu12x wheel, no FlashAttention. (ml-explore/mlx-lm)

2. Run the model (weights download on first use)

mlx_lm.generate --model mlx-community/Qwen3-14B-4bit --prompt "Explain unified memory in one paragraph."

On first run, mlx-lm pulls the 4-bit weights (~8.3 GB, 2 shards) from the mlx-community Hugging Face org and caches them under ~/.cache/huggingface. These weights were converted directly from Qwen/Qwen3-14B and are not gated, so no license-acceptance step is needed to download them. Both the source model and the MLX conversion are released under the Apache-2.0 license.

Running

For an interactive, OpenAI-compatible local server (so you can point Open WebUI, a chat client, or your own code at it):

mlx_lm.server --model mlx-community/Qwen3-14B-4bit

This starts a local server on 127.0.0.1:8080 exposing an OpenAI-style /v1/chat/completions endpoint. It is a development server — bind it to localhost only. (mlx-lm SERVER docs)

Thinking vs non-thinking mode

Qwen3 exposes a switchable reasoning mode. The canonical Qwen/Qwen3-14B model card documents an enable_thinking flag on apply_chat_template (default True), plus per-turn soft switches: add /think or /no_think to a user prompt or system message to flip the behaviour from turn to turn. Use thinking mode for math, code, and multi-step reasoning; turn it off for fast, direct chat. Because mlx-lm applies the model's own chat template, the soft-switch tokens work through the MLX path as well.

Alternative: the GGUF path (llama.cpp / Ollama / LM Studio)

If you prefer the portable GGUF ecosystem, the same model is available as a Q4_K_M GGUF (9.0 GB on disk):

# Ollama (simplest) — pulls the qwen3:14b build
ollama run qwen3:14b

Ollama lists qwen3:14b as a 9.3 GB build with a 40K context window (ollama.com/library/qwen3). For a hand-managed llama.cpp build, Metal is enabled by default on macOS"On MacOS, Metal is enabled by default. Using Metal makes the computation run on the GPU." (llama.cpp build docs) — so a standard cmake -B build && cmake --build build --config Release already runs on the GPU; point it at the official Qwen/Qwen3-14B-GGUF Qwen3-14B-Q4_K_M.gguf (9.0 GB). LM Studio runs both MLX and GGUF from a GUI if you prefer not to touch the terminal. The 9.0 GB GGUF Q4_K_M sits just under the ~10.5 GB addressable pool, the same as the MLX 4-bit path — keep context modest with either runtime.

Results

  • Speed: No first-party Apple M2 Pro benchmark for this pair has been recorded yet — /check/qwen3-14b/m2-pro currently returns verdict: unknown with no measurements. We are deliberately not quoting a token/sec figure: token generation on Apple Silicon is bandwidth-bound (the M2 Pro runs ~200 GB/s unified memory — the slowest of Apple's Pro/Max chips), and no chip-named first-party Qwen3-14B-on-M2-Pro throughput number exists to publish as a measured result. If you run this, please contribute your tok/s so we can seed a real datapoint.
  • Memory usage: ~8.3 GB resident for the MLX 4-bit weights, plus a KV-cache that grows with context (~0.16 MB/token, so ~1.3 GB at 8k). Fits the ~10.5 GB default-addressable pool of a 16 GB Mac at modest context; long context (≳16k) crowds the pool and may need a wired-limit raise.
  • Quality notes: The 4-bit quantization trades a small amount of quality for a footprint that fits 16 GB. Qwen3-14B's native context is 32,768 tokens, extensible to 131,072 with YaRN per the model card — but on a 16 GB Mac the KV-cache, not the model, is what limits how far you can push context.

For the full benchmark data (and to be the first to populate it), see /check/qwen3-14b/m2-pro.

Troubleshooting

Tried to install FlashAttention / bitsandbytes / a cu12x wheel and it failed

None of those apply on Apple Silicon. There is no CUDA, no FlashAttention, and no GPU bitsandbytes kernel on macOS — MLX uses its own Metal attention and its own 4-bit quantization, and llama.cpp uses Metal + GGUF K-quants. If a generic Qwen3 tutorial tells you to pip install flash-attn, pass --load-in-4bit, or load a GPTQ/AWQ build, skip those steps entirely; the commands above are the complete Apple path.

Swapping or beachballs at long context

On a 16 GB Mac the ~8.3 GB weights leave only ~2 GB of the default ~10.5 GB addressable pool for everything else. For ordinary chat at ≤8k context you stay inside it. But if you push the context window long (the KV-cache grows ~0.16 MB/token, so tens of thousands of tokens add gigabytes) — especially with other apps open — you can hit the ceiling and start swapping. Two fixes: keep context modest, or raise the GPU's wired-memory limit (macOS Sonoma 14 / Sequoia 15+):

sudo sysctl iogpu.wired_limit_mb=13312   # 13 GB; leaves ~3 GB for macOS

This lets the GPU address up to 13 GB instead of the ~10.5 GB default. On a 16 GB machine this is genuinely tight — leave at least 2.5–3 GB for macOS and watch Activity Monitor's Memory-Pressure gauge; if it goes yellow/red, back the limit down or shorten your context. The setting is temporary and resets on reboot (persist it via /etc/sysctl.conf); sudo sysctl iogpu.wired_limit_mb=0 restores the default.

Reasoning mode produces overly long answers

If responses ramble with visible chain-of-thought when you wanted a quick reply, you are in thinking mode (the default). Append /no_think to your message, or set enable_thinking=False in apply_chat_template, to get direct answers — see the model card for the exact toggle.

No other widely-reported issues. Report problems via the submission form.

common questions
How much VRAM does Qwen3 14B need?

About 9 GB — the minimum this recipe targets.

Which GPUs is Qwen3 14B tested on?

Apple M2 Pro (16 GB).

How hard is this setup?

Intermediate — follow the steps above.