What You'll Build
A fully-local Llama 3.1 8B Instruct chat endpoint running on an Apple M4 Max with 48 GB unified memory, using Apple's native MLX runtime and 4-bit weights — no NVIDIA GPU, no CUDA, no FlashAttention. You get an OpenAI-compatible local server you can point any chat client at, plus a one-shot CLI for scripting. At ~4.5 GB the weights are small enough that this is the easy on-ramp to local LLMs on a Mac — it fits with room to spare and needs no memory tuning.
Hardware data: Apple M4 Max (48 GB unified memory) · MLX 4-bit weights ~4.5 GB on disk · See benchmark data
ℹ️ Unified memory is not VRAM. The M4 Max has 48 GB of unified memory shared by CPU and GPU — not 48 GB of dedicated VRAM. By default macOS only lets the GPU address roughly two-thirds of it (~32 GB safe / ~36 GB optimistic via Metal's
recommendedMaxWorkingSetSize). Unlike a 70B model, the 4.5 GB Llama 3.1 8B sits far below that ceiling, so no wired-limit tuning is needed here — this recipe runs comfortably on any Apple Silicon Mac, down to a 16 GB MacBook Air (~10.5 GB GPU-addressable).
Requirements
| Component | Minimum | Tested |
|---|---|---|
| GPU / memory | 16 GB unified memory (~10.5 GB GPU-addressable) | Apple M4 Max (48 GB unified memory, ~32 GB addressable) |
| RAM | Same pool — unified | 48 GB unified |
| Storage | ~5 GB (MLX 4-bit) / ~5 GB (GGUF Q4_K_M) | ~5 GB |
| Software | Python 3.10+, macOS Sonoma 14 / Sequoia 15+ | macOS Sequoia 15 |
The binding constraint on Apple Silicon is addressable unified memory, not raw capacity — but for an 8B model at 4-bit it barely binds at all. The MLX 4-bit weights are 4.52 GB on disk (HF tree API for mlx-community/Meta-Llama-3.1-8B-Instruct-4bit — a single model.safetensors shard). Against the ~32 GB the M4 Max's GPU can address by default, that leaves ample headroom for the KV-cache and macOS itself. Even the smallest current Apple Silicon config (16 GB → ~10.5 GB addressable) has room for the weights plus a generous context window, so this recipe is not memory-bound on the M4 Max.
Installation
1. Install MLX-LM (the Apple-native path)
pip install mlx-lm
MLX is Apple's array framework; mlx-lm is its LLM front-end. There is nothing CUDA-shaped to install — no torch build flags, no cu12x wheel, no FlashAttention. (ml-explore/mlx-lm)
2. Run the model (weights download on first use)
mlx_lm.generate --model mlx-community/Meta-Llama-3.1-8B-Instruct-4bit --prompt "Explain unified memory in one paragraph."
On first run, mlx-lm pulls the 4-bit weights (~4.5 GB, single shard) from the mlx-community Hugging Face org and caches them under ~/.cache/huggingface. These mirror weights are not gated, so no license-acceptance step is needed to download them (the underlying model is governed by the Llama 3.1 Community License).
Running
For an interactive, OpenAI-compatible local server (so you can point Open WebUI, a chat client, or your own code at it):
mlx_lm.server --model mlx-community/Meta-Llama-3.1-8B-Instruct-4bit
This starts a local server on 127.0.0.1:8080 exposing an OpenAI-style /v1/chat/completions endpoint. It is a development server — bind it to localhost only. (mlx-lm SERVER docs)
Alternative: the GGUF path (llama.cpp / Ollama / LM Studio)
If you prefer the portable GGUF ecosystem, the same model is available as a Q4_K_M GGUF (~4.9 GB on disk):
# Ollama (simplest) — pulls a ~4.9 GB Q4_K_M build
ollama run llama3.1:8b
Ollama reports llama3.1:8b as a 4.9 GB build with a 128K context window (ollama.com/library/llama3.1). For a hand-managed llama.cpp build, Metal is enabled by default on macOS — "On MacOS, Metal is enabled by default. Using Metal makes the computation run on the GPU." (llama.cpp build docs) — so a standard cmake -B build && cmake --build build --config Release already runs on the GPU; point it at a Q4_K_M GGUF from bartowski/Meta-Llama-3.1-8B-Instruct-GGUF (Q4_K_M = 4.92 GB) or unsloth/Llama-3.1-8B-Instruct-GGUF (Q4_K_M = 4.92 GB). LM Studio runs both MLX and GGUF from a GUI if you prefer not to touch the terminal.
Both runtimes' Q4 files (~4.9 GB) fit the M4 Max's ~32 GB default-addressable pool with enormous margin — neither the MLX nor the GGUF path needs the wired-limit raise that a 70B model would.
Results
- Speed: No first-party Apple M4 Max benchmark for this pair has been recorded yet — /check/llama-3-1-8b/m4-max currently returns
verdict: unknownwith no measurements. We are deliberately not quoting a token/sec figure: token generation on Apple Silicon is bandwidth-bound (the M4 Max runs ~546 GB/s unified memory — the fastest of any Apple chip we cover), and no chip-named first-party throughput figure exists for this pair. If you run this, please contribute your tok/s so we can seed a real datapoint. - Memory usage: ~4.5 GB resident for the 4-bit weights, plus a KV-cache that grows with context. Fits the ~32 GB default-addressable pool many times over — memory is not the limiting factor on this hardware.
- Quality notes: The 4-bit quantization trades a small amount of quality for a smaller footprint. With ~32 GB addressable on the M4 Max you have plenty of headroom for the higher-fidelity 8-bit
mlx-communitybuild (mlx-community/Meta-Llama-3.1-8B-Instruct-8bit) if you want it — the 8B is small enough that even 8-bit leaves the memory budget untouched.
For the full benchmark data (and to be the first to populate it), see /check/llama-3-1-8b/m4-max.
Troubleshooting
The meta-llama/Llama-3.1-8B-Instruct repo asks for access
The canonical Meta repo is gated (manual approval). You do not need it for this recipe: the mlx-community 4-bit weights and the bartowski/unsloth GGUF mirrors are ungated re-distributions of the same model and download without an access request. The Llama 3.1 Community License still governs your use of the model.
Tried to install FlashAttention / bitsandbytes / a cu12x wheel and it failed
None of those apply on Apple Silicon. There is no CUDA, no FlashAttention, and no GPU bitsandbytes kernel on macOS — MLX uses its own Metal attention and its own 4-bit quantization, and llama.cpp uses Metal + GGUF K-quants. If a generic Llama tutorial tells you to pip install flash-attn or pass --load-in-4bit, skip those steps entirely; the commands above are the complete Apple path.
Do I need to raise the unified-memory wired limit?
No. The sudo sysctl iogpu.wired_limit_mb raise matters only when a model's weights plus KV-cache exceed the default-addressable share (~32 GB on a 48 GB Mac) — that is a 70B-class problem. Llama 3.1 8B at ~4.5 GB sits far below the default ceiling on every Apple Silicon Mac the site covers, so the default limit is more than sufficient. Leave it alone.
No other widely-reported issues. Report problems via the submission form.