How much VRAM does Llama 3.3 70B need?

About 40 GB — the minimum this recipe targets.

How hard is this setup?

Intermediate — follow the steps above.

Llama 3.3 70B on Apple M2 Max: 70B-class chat in 64 GB unified memory with MLX

What You'll Build

A fully-local Llama 3.3 70B Instruct chat endpoint running on an Apple M2 Max with 64 GB unified memory, using Apple's native MLX runtime and 4-bit weights — no NVIDIA GPU, no CUDA, no FlashAttention. You get an OpenAI-compatible local server you can point any chat client at, plus a one-shot CLI for scripting.

Hardware data: Apple M2 Max (64 GB unified memory) · MLX 4-bit weights ~39.7 GB on disk · See benchmark data

ℹ️ Unified memory is not VRAM. The M2 Max has 64 GB of unified memory shared by CPU and GPU — it is not 64 GB of dedicated VRAM. By default macOS only lets the GPU address roughly 75% of it (~48 GB via Metal's recommendedMaxWorkingSetSize). The 39.7 GB 4-bit model fits that ~48 GB pool, but tightly — see Requirements and Troubleshooting for the context budget and how to raise the limit.

Requirements

Component	Minimum	Tested
GPU / memory	64 GB unified memory (~48 GB GPU-addressable by default)	Apple M2 Max (64 GB unified memory)
RAM	Same pool — unified	64 GB unified
Storage	~40 GB (MLX 4-bit) / ~43 GB (GGUF Q4_K_M)	~40 GB
Software	Python 3.10+, macOS Sonoma 14 / Sequoia 15+	macOS Sequoia 15

The binding constraint here is addressable unified memory, not raw capacity. The MLX 4-bit weights are 39.69 GB on disk (HF tree API for mlx-community/Llama-3.3-70B-Instruct-4bit — 8 safetensors shards). Against the ~48 GB the GPU can address by default, that leaves roughly 8 GB for the KV-cache plus macOS itself. A 70B-class model's KV-cache runs ~1.25 MB/token in fp16 — about 10 GB at an 8k context — so keep the context window at or below 8k on the default limit, or raise the wired limit (Troubleshooting) before pushing longer prompts.

Installation

1. Install MLX-LM (the Apple-native path)

pip install mlx-lm

MLX is Apple's array framework; mlx-lm is its LLM front-end. There is nothing CUDA-shaped to install — no torch build flags, no cu12x wheel, no FlashAttention. (ml-explore/mlx-lm)

2. Run the model (weights download on first use)

mlx_lm.generate --model mlx-community/Llama-3.3-70B-Instruct-4bit --prompt "Explain unified memory in one paragraph."

On first run, mlx-lm pulls the 4-bit weights (~39.7 GB, 8 shards) from the mlx-community Hugging Face org and caches them under ~/.cache/huggingface. These mirror weights are not gated, so no license-acceptance step is needed to download them (the underlying model is governed by the Llama 3.3 Community License).

Running

For an interactive, OpenAI-compatible local server (so you can point Open WebUI, a chat client, or your own code at it):

mlx_lm.server --model mlx-community/Llama-3.3-70B-Instruct-4bit

This starts a local server on 127.0.0.1:8080 exposing an OpenAI-style /v1/chat/completions endpoint. It is a development server — bind it to localhost only. (mlx-lm SERVER docs)

Alternative: the GGUF path (llama.cpp / Ollama / LM Studio)

If you prefer the portable GGUF ecosystem, the same model is available as a Q4_K_M GGUF (~42.5 GB on disk):

# Ollama (simplest) — pulls a ~43 GB Q4_K_M build
ollama run llama3.3:70b

Ollama reports llama3.3:70b as a 43 GB build with a 128K context window (ollama.com/library/llama3.3). For a hand-managed llama.cpp build, Metal is enabled by default on macOS — "On MacOS, Metal is enabled by default. Using Metal makes the computation run on the GPU." (llama.cpp build docs) — so a standard cmake -B build && cmake --build build --config Release already runs on the GPU; point it at a Q4_K_M GGUF from bartowski/Llama-3.3-70B-Instruct-GGUF or unsloth/Llama-3.3-70B-Instruct-GGUF (Q4_K_M = 42.52 GB). LM Studio runs both MLX and GGUF from a GUI if you prefer not to touch the terminal.

The GGUF Q4_K_M (~42.5 GB) is tighter against the ~48 GB default-addressable pool than the MLX 4-bit (~39.7 GB) — on the GGUF path, raising the wired limit (Troubleshooting) is recommended rather than optional.

Results

Speed: No first-party Apple M2 Max benchmark for this pair has been recorded yet — /check/llama-3-3-70b/m2-max currently returns verdict: unknown with no measurements. We are deliberately not quoting a token/sec figure: token generation on Apple Silicon is bandwidth-bound (the M2 Max runs ~400 GB/s unified memory), and the only public 70B-on-M2-Max throughput figures come from a single community aggregator, which is not enough to publish as a measured result. If you run this, please contribute your tok/s so we can seed a real datapoint.
Memory usage: ~39.7 GB resident for the 4-bit weights, plus KV-cache that grows with context. Fits the ~48 GB default-addressable pool with ctx ≤ 8k; see Troubleshooting to go longer.
Quality notes: The 4-bit quantization trades a small amount of quality for the ability to fit 70B in 64 GB at all. If you have a 96 GB or 128 GB Apple Silicon machine, the 8-bit mlx-community build is a higher-fidelity option.

For the full benchmark data (and to be the first to populate it), see /check/llama-3-3-70b/m2-max.

Troubleshooting

Out of memory / heavy swapping with long prompts

The default GPU-addressable share of unified memory is ~75% (~48 GB on a 64 GB Mac). The 4-bit weights alone are ~39.7 GB, so a long context can push past the ceiling and trigger swapping. Two fixes:

Keep the context at or below 8k tokens — the simplest option, and enough for most chat use.
Raise the wired-memory limit (macOS Sonoma 14 / Sequoia 15+):
```
sudo sysctl iogpu.wired_limit_mb=57344   # 56 GB; leaves ~8 GB for macOS
```
This lets the GPU address up to 56 GB. Always leave 8–16 GB of headroom for macOS — pushing to 100% causes instability. The setting is temporary and resets on reboot (persist it via /etc/sysctl.conf); sudo sysctl iogpu.wired_limit_mb=0 restores the default. Watch Activity Monitor's Memory-Pressure gauge while loading. This is recommended for the GGUF Q4_K_M path (~42.5 GB), which is tighter than the MLX 4-bit path.

The `meta-llama/Llama-3.3-70B-Instruct` repo asks for access

The canonical Meta repo is gated (manual approval). You do not need it for this recipe: the mlx-community 4-bit weights and the bartowski/unsloth GGUF mirrors are ungated re-distributions of the same model and download without an access request. The Llama 3.3 Community License still governs your use of the model.

Tried to install FlashAttention / bitsandbytes / a `cu12x` wheel and it failed

None of those apply on Apple Silicon. There is no CUDA, no FlashAttention, and no GPU bitsandbytes kernel on macOS — MLX uses its own Metal attention and its own 4-bit quantization, and llama.cpp uses Metal + GGUF K-quants. If a generic Llama tutorial tells you to pip install flash-attn or pass --load-in-4bit, skip those steps entirely; the commands above are the complete Apple path.

No other widely-reported issues. Report problems via the submission form.