self-hosted/ai
§01·recipe · llm

Llama 3.3 70B on Apple M4 Max: 70B-class chat in 48 GB unified memory with MLX

llmadvanced40GB+ VRAMJun 23, 2026

This advanced recipe sets up Llama 3.3 70B on the Apple M4 Max, needing about 40 GB of VRAM.

models
tools
prerequisites
  • Apple M4 Max with 48 GB unified memory (or any Apple Silicon Mac with ≥ 48 GB)
  • macOS Sonoma 14 or Sequoia 15+
  • Python 3.10+
  • ~31 GB free disk for the 3-bit weights (~40 GB for 4-bit, ~43 GB for the GGUF path)

What You'll Build

A fully-local Llama 3.3 70B Instruct chat endpoint running on an Apple M4 Max with 48 GB unified memory, using Apple's native MLX runtime — no NVIDIA GPU, no CUDA, no FlashAttention. You get an OpenAI-compatible local server you can point any chat client at, plus a one-shot CLI for scripting. The headline constraint on this chip is fit: 48 GB of unified memory leaves only ~32 GB the GPU can address by default, so this recipe leads with the 3-bit weights (which fit comfortably) and treats 4-bit as a higher-fidelity option that needs the memory limit raised.

Hardware data: Apple M4 Max (48 GB unified memory) · MLX 3-bit weights ~30.9 GB on disk · See benchmark data

ℹ️ Unified memory is not VRAM. The M4 Max has 48 GB of unified memory shared by CPU and GPU — it is not 48 GB of dedicated VRAM. By default macOS only lets the GPU address roughly two-thirds of it (~32 GB safe / ~36 GB optimistic via Metal's recommendedMaxWorkingSetSize). The 3-bit model (~30.9 GB) fits that pool; the 4-bit model (~39.7 GB) does not fit by default and requires raising the wired-memory limit — see Requirements and Troubleshooting.

Requirements

ComponentMinimumTested
GPU / memory48 GB unified memory (~32 GB GPU-addressable by default)Apple M4 Max (48 GB unified memory)
RAMSame pool — unified48 GB unified
Storage~31 GB (MLX 3-bit) / ~40 GB (MLX 4-bit) / ~43 GB (GGUF Q4_K_M)~31 GB
SoftwarePython 3.10+, macOS Sonoma 14 / Sequoia 15+macOS Sequoia 15

The binding constraint here is addressable unified memory, not raw capacity. On a 48 GB Mac the GPU can address only ~32 GB safely (~36 GB optimistically) by default. The MLX 3-bit weights are 30.9 GB on disk (HF tree API for mlx-community/Llama-3.3-70B-Instruct-3bit — 6 safetensors shards; the model card lists 30.9 GB), which fits the ~32 GB safe pool with a small margin for the KV-cache. The MLX 4-bit weights are 39.69 GB on disk (HF tree API for mlx-community/Llama-3.3-70B-Instruct-4bit — 8 shards) — that exceeds the default-addressable pool, so the 4-bit path requires raising the wired-memory limit (Troubleshooting) before it will run. A 70B-class model's KV-cache runs ~1.25 MB/token in fp16 — about 5 GB at a 4k context — so keep the context window low (≤ 4k) on either path; the 3-bit weights leave only a ~2–3 GB sliver above the model for the KV-cache.

Installation

1. Install MLX-LM (the Apple-native path)

pip install mlx-lm

MLX is Apple's array framework; mlx-lm is its LLM front-end. There is nothing CUDA-shaped to install — no torch build flags, no cu12x wheel, no FlashAttention. (ml-explore/mlx-lm)

2. Run the model (weights download on first use)

mlx_lm.generate --model mlx-community/Llama-3.3-70B-Instruct-3bit --prompt "Explain unified memory in one paragraph."

On first run, mlx-lm pulls the 3-bit weights (~30.9 GB, 6 shards) from the mlx-community Hugging Face org and caches them under ~/.cache/huggingface. These mirror weights are not gated, so no license-acceptance step is needed to download them (the underlying model is governed by the Llama 3.3 Community License).

Running

For an interactive, OpenAI-compatible local server (so you can point Open WebUI, a chat client, or your own code at it):

mlx_lm.server --model mlx-community/Llama-3.3-70B-Instruct-3bit

This starts a local server on 127.0.0.1:8080 exposing an OpenAI-style /v1/chat/completions endpoint. It is a development server — bind it to localhost only. (mlx-lm SERVER docs)

Higher fidelity: the 4-bit path (requires raising the wired limit)

The 3-bit weights are the comfortable default on a 48 GB Mac. If you want the higher-fidelity 4-bit weights (~39.7 GB), they do not fit the ~32 GB default-addressable pool — you must raise the wired-memory limit first (see Troubleshooting), then:

sudo sysctl iogpu.wired_limit_mb=43008   # ~42 GB; leaves ~6 GB for macOS
mlx_lm.generate --model mlx-community/Llama-3.3-70B-Instruct-4bit --prompt "Explain unified memory in one paragraph."

With the limit raised to ~42 GB the 4-bit weights fit, but the headroom above them is small — keep the context window at or below 4k and watch Activity Monitor's Memory-Pressure gauge.

Alternative: the GGUF path (llama.cpp / Ollama / LM Studio)

If you prefer the portable GGUF ecosystem, the same model is available as a Q4_K_M GGUF (~42.5 GB on disk):

# Ollama (simplest) — pulls a ~43 GB Q4_K_M build
ollama run llama3.3:70b

Ollama reports llama3.3:70b as a 43 GB build with a 128K context window (ollama.com/library/llama3.3). For a hand-managed llama.cpp build, Metal is enabled by default on macOS"On MacOS, Metal is enabled by default. Using Metal makes the computation run on the GPU." (llama.cpp build docs) — so a standard cmake -B build && cmake --build build --config Release already runs on the GPU; point it at a Q4_K_M GGUF from bartowski/Llama-3.3-70B-Instruct-GGUF or unsloth/Llama-3.3-70B-Instruct-GGUF (Q4_K_M = 42.52 GB). LM Studio runs both MLX and GGUF from a GUI if you prefer not to touch the terminal.

The GGUF Q4_K_M (~42.5 GB) is tighter than even the 4-bit MLX path against a 48 GB Mac's memory — running it needs the wired limit raised to ~44 GB with near-zero headroom. It is possible but not recommended on a 48 GB machine; it is better suited to a Mac with ≥ 64 GB unified memory. On the M4 Max, prefer the 3-bit MLX path (or 4-bit with the raise).

Results

  • Speed: No first-party Apple M4 Max benchmark for this pair has been recorded yet — /check/llama-3-3-70b/m4-max currently returns verdict: unknown with no measurements. We are deliberately not quoting a token/sec figure: token generation on Apple Silicon is bandwidth-bound (the M4 Max runs ~546 GB/s unified memory — the fastest of the Apple chips we cover), and the only public 70B-on-M4-Max throughput figures come from a single community aggregator, which is not enough to publish as a measured result. If you run this, please contribute your tok/s so we can seed a real datapoint.
  • Memory usage: ~30.9 GB resident for the 3-bit weights (or ~39.7 GB for 4-bit), plus KV-cache that grows with context. The 3-bit weights fit the ~32 GB default-addressable pool with ctx ≤ 4k; the 4-bit weights need the wired limit raised first — see Troubleshooting.
  • Quality notes: The 3-bit quantization trades more quality than 4-bit for the ability to fit comfortably in 48 GB without touching the memory limit. If you want higher fidelity, use the 4-bit build with the wired-limit raise; if you have a 64 GB or larger Apple Silicon machine, the 4-bit (or 8-bit) mlx-community build runs with more headroom.

For the full benchmark data (and to be the first to populate it), see /check/llama-3-3-70b/m4-max.

Troubleshooting

Out of memory / heavy swapping (or the 4-bit / GGUF weights won't load)

The default GPU-addressable share of unified memory on a 48 GB Mac is only ~32 GB safe (~36 GB optimistic). The 3-bit weights (~30.9 GB) fit this pool; the 4-bit weights (~39.7 GB) and the GGUF Q4_K_M (~42.5 GB) do not fit by default. Two fixes:

  1. Use the 3-bit weights and keep the context at or below 4k tokens — the simplest option, and enough for most chat use. No memory-limit change needed.

  2. Raise the wired-memory limit (macOS Sonoma 14 / Sequoia 15+) for the 4-bit or GGUF path:

    sudo sysctl iogpu.wired_limit_mb=43008   # ~42 GB; leaves ~6 GB for macOS
    

    This lets the GPU address up to ~42 GB, enough for the 4-bit weights. Always leave 6–16 GB of headroom for macOS — pushing to 100% causes instability. The setting is temporary and resets on reboot (persist it via /etc/sysctl.conf); sudo sysctl iogpu.wired_limit_mb=0 restores the default. Watch Activity Monitor's Memory-Pressure gauge while loading. The GGUF Q4_K_M path (~42.5 GB) is tighter still — it needs ~44 GB with almost no headroom and is better run on a ≥ 64 GB Mac.

The meta-llama/Llama-3.3-70B-Instruct repo asks for access

The canonical Meta repo is gated (manual approval). You do not need it for this recipe: the mlx-community 3-bit / 4-bit weights and the bartowski/unsloth GGUF mirrors are ungated re-distributions of the same model and download without an access request. The Llama 3.3 Community License still governs your use of the model.

Tried to install FlashAttention / bitsandbytes / a cu12x wheel and it failed

None of those apply on Apple Silicon. There is no CUDA, no FlashAttention, and no GPU bitsandbytes kernel on macOS — MLX uses its own Metal attention and its own quantization, and llama.cpp uses Metal + GGUF K-quants. If a generic Llama tutorial tells you to pip install flash-attn or pass --load-in-4bit, skip those steps entirely; the commands above are the complete Apple path.

No other widely-reported issues. Report problems via the submission form.

common questions
How much VRAM does Llama 3.3 70B need?

About 40 GB — the minimum this recipe targets.

Which GPUs is Llama 3.3 70B tested on?

Apple M4 Max (48 GB).

How hard is this setup?

Advanced — follow the steps above.