What You'll Build
A fully-local Llama 3.3 70B Instruct chat endpoint running on an Apple M2 Max with 64 GB unified memory, using Apple's native MLX runtime and 4-bit weights — no NVIDIA GPU, no CUDA, no FlashAttention. You get an OpenAI-compatible local server you can point any chat client at, plus a one-shot CLI for scripting.
Hardware data: Apple M2 Max (64 GB unified memory) · MLX 4-bit weights ~39.7 GB on disk · See benchmark data
ℹ️ Unified memory is not VRAM. The M2 Max has 64 GB of unified memory shared by CPU and GPU — it is not 64 GB of dedicated VRAM. By default macOS only lets the GPU address roughly 75% of it (~48 GB via Metal's
recommendedMaxWorkingSetSize). The 39.7 GB 4-bit model fits that ~48 GB pool, but tightly — see Requirements and Troubleshooting for the context budget and how to raise the limit.
Requirements
| Component | Minimum | Tested |
|---|---|---|
| GPU / memory | 64 GB unified memory (~48 GB GPU-addressable by default) | Apple M2 Max (64 GB unified memory) |
| RAM | Same pool — unified | 64 GB unified |
| Storage | ~40 GB (MLX 4-bit) / ~43 GB (GGUF Q4_K_M) | ~40 GB |
| Software | Python 3.10+, macOS Sonoma 14 / Sequoia 15+ | macOS Sequoia 15 |
The binding constraint here is addressable unified memory, not raw capacity. The MLX 4-bit weights are 39.69 GB on disk (HF tree API for mlx-community/Llama-3.3-70B-Instruct-4bit — 8 safetensors shards). Against the ~48 GB the GPU can address by default, that leaves roughly 8 GB for the KV-cache plus macOS itself. A 70B-class model's KV-cache runs ~1.25 MB/token in fp16 — about 10 GB at an 8k context — so keep the context window at or below 8k on the default limit, or raise the wired limit (Troubleshooting) before pushing longer prompts.
Installation
1. Install MLX-LM (the Apple-native path)
pip install mlx-lm
MLX is Apple's array framework; mlx-lm is its LLM front-end. There is nothing CUDA-shaped to install — no torch build flags, no cu12x wheel, no FlashAttention. (ml-explore/mlx-lm)
2. Run the model (weights download on first use)
mlx_lm.generate --model mlx-community/Llama-3.3-70B-Instruct-4bit --prompt "Explain unified memory in one paragraph."
On first run, mlx-lm pulls the 4-bit weights (~39.7 GB, 8 shards) from the mlx-community Hugging Face org and caches them under ~/.cache/huggingface. These mirror weights are not gated, so no license-acceptance step is needed to download them (the underlying model is governed by the Llama 3.3 Community License).
Running
For an interactive, OpenAI-compatible local server (so you can point Open WebUI, a chat client, or your own code at it):
mlx_lm.server --model mlx-community/Llama-3.3-70B-Instruct-4bit
This starts a local server on 127.0.0.1:8080 exposing an OpenAI-style /v1/chat/completions endpoint. It is a development server — bind it to localhost only. (mlx-lm SERVER docs)
Alternative: the GGUF path (llama.cpp / Ollama / LM Studio)
If you prefer the portable GGUF ecosystem, the same model is available as a Q4_K_M GGUF (~42.5 GB on disk):
# Ollama (simplest) — pulls a ~43 GB Q4_K_M build
ollama run llama3.3:70b
Ollama reports llama3.3:70b as a 43 GB build with a 128K context window (ollama.com/library/llama3.3). For a hand-managed llama.cpp build, Metal is enabled by default on macOS — "On MacOS, Metal is enabled by default. Using Metal makes the computation run on the GPU." (llama.cpp build docs) — so a standard cmake -B build && cmake --build build --config Release already runs on the GPU; point it at a Q4_K_M GGUF from bartowski/Llama-3.3-70B-Instruct-GGUF or unsloth/Llama-3.3-70B-Instruct-GGUF (Q4_K_M = 42.52 GB). LM Studio runs both MLX and GGUF from a GUI if you prefer not to touch the terminal.
The GGUF Q4_K_M (~42.5 GB) is tighter against the ~48 GB default-addressable pool than the MLX 4-bit (~39.7 GB) — on the GGUF path, raising the wired limit (Troubleshooting) is recommended rather than optional.
Results
- Speed: No first-party Apple M2 Max benchmark for this pair has been recorded yet — /check/llama-3-3-70b/m2-max currently returns
verdict: unknownwith no measurements. We are deliberately not quoting a token/sec figure: token generation on Apple Silicon is bandwidth-bound (the M2 Max runs ~400 GB/s unified memory), and the only public 70B-on-M2-Max throughput figures come from a single community aggregator, which is not enough to publish as a measured result. If you run this, please contribute your tok/s so we can seed a real datapoint. - Memory usage: ~39.7 GB resident for the 4-bit weights, plus KV-cache that grows with context. Fits the ~48 GB default-addressable pool with ctx ≤ 8k; see Troubleshooting to go longer.
- Quality notes: The 4-bit quantization trades a small amount of quality for the ability to fit 70B in 64 GB at all. If you have a 96 GB or 128 GB Apple Silicon machine, the 8-bit
mlx-communitybuild is a higher-fidelity option.
For the full benchmark data (and to be the first to populate it), see /check/llama-3-3-70b/m2-max.
Troubleshooting
Out of memory / heavy swapping with long prompts
The default GPU-addressable share of unified memory is ~75% (~48 GB on a 64 GB Mac). The 4-bit weights alone are ~39.7 GB, so a long context can push past the ceiling and trigger swapping. Two fixes:
-
Keep the context at or below 8k tokens — the simplest option, and enough for most chat use.
-
Raise the wired-memory limit (macOS Sonoma 14 / Sequoia 15+):
sudo sysctl iogpu.wired_limit_mb=57344 # 56 GB; leaves ~8 GB for macOSThis lets the GPU address up to 56 GB. Always leave 8–16 GB of headroom for macOS — pushing to 100% causes instability. The setting is temporary and resets on reboot (persist it via
/etc/sysctl.conf);sudo sysctl iogpu.wired_limit_mb=0restores the default. Watch Activity Monitor's Memory-Pressure gauge while loading. This is recommended for the GGUF Q4_K_M path (~42.5 GB), which is tighter than the MLX 4-bit path.
The meta-llama/Llama-3.3-70B-Instruct repo asks for access
The canonical Meta repo is gated (manual approval). You do not need it for this recipe: the mlx-community 4-bit weights and the bartowski/unsloth GGUF mirrors are ungated re-distributions of the same model and download without an access request. The Llama 3.3 Community License still governs your use of the model.
Tried to install FlashAttention / bitsandbytes / a cu12x wheel and it failed
None of those apply on Apple Silicon. There is no CUDA, no FlashAttention, and no GPU bitsandbytes kernel on macOS — MLX uses its own Metal attention and its own 4-bit quantization, and llama.cpp uses Metal + GGUF K-quants. If a generic Llama tutorial tells you to pip install flash-attn or pass --load-in-4bit, skip those steps entirely; the commands above are the complete Apple path.
No other widely-reported issues. Report problems via the submission form.