How much VRAM does Llama 3.3 70B need?

About 40 GB — the minimum this recipe targets.

How hard is this setup?

Advanced — follow the steps above.

Llama 3.3 70B on Apple M4 Max: 70B-class chat in 48 GB unified memory with MLX

What You'll Build

A fully-local Llama 3.3 70B Instruct chat endpoint running on an Apple M4 Max with 48 GB unified memory, using Apple's native MLX runtime — no NVIDIA GPU, no CUDA, no FlashAttention. You get an OpenAI-compatible local server you can point any chat client at, plus a one-shot CLI for scripting. The headline constraint on this chip is fit: 48 GB of unified memory leaves only ~32 GB the GPU can address by default, so this recipe leads with the 3-bit weights (which fit comfortably) and treats 4-bit as a higher-fidelity option that needs the memory limit raised.

Hardware data: Apple M4 Max (48 GB unified memory) · MLX 3-bit weights ~30.9 GB on disk · See benchmark data

ℹ️ Unified memory is not VRAM. The M4 Max has 48 GB of unified memory shared by CPU and GPU — it is not 48 GB of dedicated VRAM. By default macOS only lets the GPU address roughly two-thirds of it (~32 GB safe / ~36 GB optimistic via Metal's recommendedMaxWorkingSetSize). The 3-bit model (~30.9 GB) fits that pool; the 4-bit model (~39.7 GB) does not fit by default and requires raising the wired-memory limit — see Requirements and Troubleshooting.

Requirements

Component	Minimum	Tested
GPU / memory	48 GB unified memory (~32 GB GPU-addressable by default)	Apple M4 Max (48 GB unified memory)
RAM	Same pool — unified	48 GB unified
Storage	~31 GB (MLX 3-bit) / ~40 GB (MLX 4-bit) / ~43 GB (GGUF Q4_K_M)	~31 GB
Software	Python 3.10+, macOS Sonoma 14 / Sequoia 15+	macOS Sequoia 15

The binding constraint here is addressable unified memory, not raw capacity. On a 48 GB Mac the GPU can address only ~32 GB safely (~36 GB optimistically) by default. The MLX 3-bit weights are 30.9 GB on disk (HF tree API for mlx-community/Llama-3.3-70B-Instruct-3bit — 6 safetensors shards; the model card lists 30.9 GB), which fits the ~32 GB safe pool with a small margin for the KV-cache. The MLX 4-bit weights are 39.69 GB on disk (HF tree API for mlx-community/Llama-3.3-70B-Instruct-4bit — 8 shards) — that exceeds the default-addressable pool, so the 4-bit path requires raising the wired-memory limit (Troubleshooting) before it will run. A 70B-class model's KV-cache runs ~1.25 MB/token in fp16 — about 5 GB at a 4k context — so keep the context window low (≤ 4k) on either path; the 3-bit weights leave only a ~2–3 GB sliver above the model for the KV-cache.

Installation

1. Install MLX-LM (the Apple-native path)

pip install mlx-lm

MLX is Apple's array framework; mlx-lm is its LLM front-end. There is nothing CUDA-shaped to install — no torch build flags, no cu12x wheel, no FlashAttention. (ml-explore/mlx-lm)

2. Run the model (weights download on first use)

mlx_lm.generate --model mlx-community/Llama-3.3-70B-Instruct-3bit --prompt "Explain unified memory in one paragraph."

On first run, mlx-lm pulls the 3-bit weights (~30.9 GB, 6 shards) from the mlx-community Hugging Face org and caches them under ~/.cache/huggingface. These mirror weights are not gated, so no license-acceptance step is needed to download them (the underlying model is governed by the Llama 3.3 Community License).

Running

For an interactive, OpenAI-compatible local server (so you can point Open WebUI, a chat client, or your own code at it):

mlx_lm.server --model mlx-community/Llama-3.3-70B-Instruct-3bit

This starts a local server on 127.0.0.1:8080 exposing an OpenAI-style /v1/chat/completions endpoint. It is a development server — bind it to localhost only. (mlx-lm SERVER docs)

Higher fidelity: the 4-bit path (requires raising the wired limit)

The 3-bit weights are the comfortable default on a 48 GB Mac. If you want the higher-fidelity 4-bit weights (~39.7 GB), they do not fit the ~32 GB default-addressable pool — you must raise the wired-memory limit first (see Troubleshooting), then:

sudo sysctl iogpu.wired_limit_mb=43008   # ~42 GB; leaves ~6 GB for macOS
mlx_lm.generate --model mlx-community/Llama-3.3-70B-Instruct-4bit --prompt "Explain unified memory in one paragraph."

With the limit raised to ~42 GB the 4-bit weights fit, but the headroom above them is small — keep the context window at or below 4k and watch Activity Monitor's Memory-Pressure gauge.

Alternative: the GGUF path (llama.cpp / Ollama / LM Studio)

If you prefer the portable GGUF ecosystem, the same model is available as a Q4_K_M GGUF (~42.5 GB on disk):

# Ollama (simplest) — pulls a ~43 GB Q4_K_M build
ollama run llama3.3:70b

Ollama reports llama3.3:70b as a 43 GB build with a 128K context window (ollama.com/library/llama3.3). For a hand-managed llama.cpp build, Metal is enabled by default on macOS — "On MacOS, Metal is enabled by default. Using Metal makes the computation run on the GPU." (llama.cpp build docs) — so a standard cmake -B build && cmake --build build --config Release already runs on the GPU; point it at a Q4_K_M GGUF from bartowski/Llama-3.3-70B-Instruct-GGUF or unsloth/Llama-3.3-70B-Instruct-GGUF (Q4_K_M = 42.52 GB). LM Studio runs both MLX and GGUF from a GUI if you prefer not to touch the terminal.

The GGUF Q4_K_M (~42.5 GB) is tighter than even the 4-bit MLX path against a 48 GB Mac's memory — running it needs the wired limit raised to ~44 GB with near-zero headroom. It is possible but not recommended on a 48 GB machine; it is better suited to a Mac with ≥ 64 GB unified memory. On the M4 Max, prefer the 3-bit MLX path (or 4-bit with the raise).

Results

Speed: No first-party Apple M4 Max benchmark for this pair has been recorded yet — /check/llama-3-3-70b/m4-max currently returns verdict: unknown with no measurements. We are deliberately not quoting a token/sec figure: token generation on Apple Silicon is bandwidth-bound (the M4 Max runs ~546 GB/s unified memory — the fastest of the Apple chips we cover), and the only public 70B-on-M4-Max throughput figures come from a single community aggregator, which is not enough to publish as a measured result. If you run this, please contribute your tok/s so we can seed a real datapoint.
Memory usage: ~30.9 GB resident for the 3-bit weights (or ~39.7 GB for 4-bit), plus KV-cache that grows with context. The 3-bit weights fit the ~32 GB default-addressable pool with ctx ≤ 4k; the 4-bit weights need the wired limit raised first — see Troubleshooting.
Quality notes: The 3-bit quantization trades more quality than 4-bit for the ability to fit comfortably in 48 GB without touching the memory limit. If you want higher fidelity, use the 4-bit build with the wired-limit raise; if you have a 64 GB or larger Apple Silicon machine, the 4-bit (or 8-bit) mlx-community build runs with more headroom.

For the full benchmark data (and to be the first to populate it), see /check/llama-3-3-70b/m4-max.

Troubleshooting

Out of memory / heavy swapping (or the 4-bit / GGUF weights won't load)

The default GPU-addressable share of unified memory on a 48 GB Mac is only ~32 GB safe (~36 GB optimistic). The 3-bit weights (~30.9 GB) fit this pool; the 4-bit weights (~39.7 GB) and the GGUF Q4_K_M (~42.5 GB) do not fit by default. Two fixes:

Use the 3-bit weights and keep the context at or below 4k tokens — the simplest option, and enough for most chat use. No memory-limit change needed.
Raise the wired-memory limit (macOS Sonoma 14 / Sequoia 15+) for the 4-bit or GGUF path:
```
sudo sysctl iogpu.wired_limit_mb=43008   # ~42 GB; leaves ~6 GB for macOS
```
This lets the GPU address up to ~42 GB, enough for the 4-bit weights. Always leave 6–16 GB of headroom for macOS — pushing to 100% causes instability. The setting is temporary and resets on reboot (persist it via /etc/sysctl.conf); sudo sysctl iogpu.wired_limit_mb=0 restores the default. Watch Activity Monitor's Memory-Pressure gauge while loading. The GGUF Q4_K_M path (~42.5 GB) is tighter still — it needs ~44 GB with almost no headroom and is better run on a ≥ 64 GB Mac.

The `meta-llama/Llama-3.3-70B-Instruct` repo asks for access

The canonical Meta repo is gated (manual approval). You do not need it for this recipe: the mlx-community 3-bit / 4-bit weights and the bartowski/unsloth GGUF mirrors are ungated re-distributions of the same model and download without an access request. The Llama 3.3 Community License still governs your use of the model.

Tried to install FlashAttention / bitsandbytes / a `cu12x` wheel and it failed

None of those apply on Apple Silicon. There is no CUDA, no FlashAttention, and no GPU bitsandbytes kernel on macOS — MLX uses its own Metal attention and its own quantization, and llama.cpp uses Metal + GGUF K-quants. If a generic Llama tutorial tells you to pip install flash-attn or pass --load-in-4bit, skip those steps entirely; the commands above are the complete Apple path.

No other widely-reported issues. Report problems via the submission form.