How much VRAM does Llama 3.1 8B need?

About 5 GB — the minimum this recipe targets.

How hard is this setup?

Beginner — follow the steps above.

Llama 3.1 8B on Apple M2 Pro: your first local LLM in 16 GB unified memory with MLX

What You'll Build

A fully-local Llama 3.1 8B Instruct chat endpoint running on an Apple M2 Pro with 16 GB unified memory, using Apple's native MLX runtime and 4-bit weights — no NVIDIA GPU, no CUDA, no FlashAttention. You get an OpenAI-compatible local server you can point any chat client at, plus a one-shot CLI for scripting. At ~4.5 GB the weights fit comfortably inside what the GPU can address on a 16 GB Mac, which makes this the natural first local LLM on a 16 GB Mac — it runs with room to spare and needs no memory tuning.

Hardware data: Apple M2 Pro (16 GB unified memory) · MLX 4-bit weights ~4.5 GB on disk · See benchmark data

ℹ️ Unified memory is not VRAM. The M2 Pro has 16 GB of unified memory shared by CPU and GPU — not 16 GB of dedicated VRAM. By default macOS only lets the GPU address roughly two-thirds of it (~10.5 GB via Metal's recommendedMaxWorkingSetSize on a sub-64 GB Mac). Llama 3.1 8B at 4-bit is ~4.5 GB, so it sits well under that ~10.5 GB ceiling with room for the KV-cache and macOS — no wired-limit tuning is needed. This is the easy on-ramp: unlike a 20B or 70B model, the 8B never approaches the M2 Pro's fit wall.

Requirements

Component	Minimum	Tested
GPU / memory	16 GB unified memory (~10.5 GB GPU-addressable)	Apple M2 Pro (16 GB unified memory, ~10.5 GB addressable)
RAM	Same pool — unified	16 GB unified
Storage	~5 GB (MLX 4-bit) / ~5 GB (GGUF Q4_K_M)	~5 GB
Software	Python 3.10+, macOS Sonoma 14 / Sequoia 15+	macOS Sequoia 15

The binding constraint on Apple Silicon is addressable unified memory, not raw capacity — and on a 16 GB Mac that addressable share is only ~10.5 GB, so it matters more here than on the larger Apple chips. For an 8B model at 4-bit it still binds comfortably: the MLX 4-bit weights are 4.53 GB on disk (HF tree API for mlx-community/Meta-Llama-3.1-8B-Instruct-4bit — a single model.safetensors shard). Against the ~10.5 GB the M2 Pro's GPU can address by default, that leaves roughly 6 GB for the KV-cache and macOS — ample for a generous context window at this model size. So this recipe is not memory-bound on the M2 Pro.

Installation

1. Install MLX-LM (the Apple-native path)

pip install mlx-lm

MLX is Apple's array framework; mlx-lm is its LLM front-end. There is nothing CUDA-shaped to install — no torch build flags, no cu12x wheel, no FlashAttention. (ml-explore/mlx-lm)

2. Run the model (weights download on first use)

mlx_lm.generate --model mlx-community/Meta-Llama-3.1-8B-Instruct-4bit --prompt "Explain unified memory in one paragraph."

On first run, mlx-lm pulls the 4-bit weights (~4.5 GB, single shard) from the mlx-community Hugging Face org and caches them under ~/.cache/huggingface. These mirror weights are not gated, so no license-acceptance step is needed to download them (the underlying model is governed by the Llama 3.1 Community License).

Running

For an interactive, OpenAI-compatible local server (so you can point Open WebUI, a chat client, or your own code at it):

mlx_lm.server --model mlx-community/Meta-Llama-3.1-8B-Instruct-4bit

This starts a local server on 127.0.0.1:8080 exposing an OpenAI-style /v1/chat/completions endpoint. It is a development server — bind it to localhost only. (mlx-lm SERVER docs)

Alternative: the GGUF path (llama.cpp / Ollama / LM Studio)

If you prefer the portable GGUF ecosystem, the same model is available as a Q4_K_M GGUF (~4.9 GB on disk):

# Ollama (simplest) — pulls a ~4.9 GB Q4_K_M build
ollama run llama3.1:8b

Ollama reports llama3.1:8b as a 4.9 GB build with a 128K context window (ollama.com/library/llama3.1). For a hand-managed llama.cpp build, Metal is enabled by default on macOS — "On MacOS, Metal is enabled by default. Using Metal makes the computation run on the GPU." (llama.cpp build docs) — so a standard cmake -B build && cmake --build build --config Release already runs on the GPU; point it at a Q4_K_M GGUF from bartowski/Meta-Llama-3.1-8B-Instruct-GGUF (Q4_K_M = 4.92 GB) or unsloth/Llama-3.1-8B-Instruct-GGUF (Q4_K_M = 4.92 GB). LM Studio runs both MLX and GGUF from a GUI if you prefer not to touch the terminal.

Both runtimes' Q4 files (~4.9 GB) fit the M2 Pro's ~10.5 GB default-addressable pool with margin to spare — neither the MLX nor the GGUF path needs the wired-limit raise that a 20B or 70B model would on this chip.

Results

Speed: No first-party Apple M2 Pro benchmark for this pair has been recorded yet — /check/llama-3-1-8b/m2-pro currently returns verdict: unknown with no measurements. We are deliberately not quoting a token/sec figure: token generation on Apple Silicon is bandwidth-bound (the M2 Pro runs ~200 GB/s unified memory — the slowest of the Apple chips we cover), and no chip-named first-party throughput figure exists for this pair. Do not assume figures published for the faster M-series chips carry over. If you run this, please contribute your tok/s so we can seed a real datapoint.
Memory usage: ~4.5 GB resident for the 4-bit weights, plus a KV-cache that grows with context. Fits the ~10.5 GB default-addressable pool with several GB of headroom — memory is not the limiting factor on this hardware at default context lengths.
Quality notes: The 4-bit quantization trades a small amount of quality for a smaller footprint. The higher-fidelity 8-bit mlx-community build (mlx-community/Meta-Llama-3.1-8B-Instruct-8bit) is ~8.5 GB on disk — it still fits the ~10.5 GB addressable share, but with much less headroom for the KV-cache, so keep the context modest if you switch to it on a 16 GB Mac.

For the full benchmark data (and to be the first to populate it), see /check/llama-3-1-8b/m2-pro.

Troubleshooting

The `meta-llama/Llama-3.1-8B-Instruct` repo asks for access

The canonical Meta repo is gated (manual approval). You do not need it for this recipe: the mlx-community 4-bit weights and the bartowski/unsloth GGUF mirrors are ungated re-distributions of the same model and download without an access request. The Llama 3.1 Community License still governs your use of the model.

Tried to install FlashAttention / bitsandbytes / a `cu12x` wheel and it failed

None of those apply on Apple Silicon. There is no CUDA, no FlashAttention, and no GPU bitsandbytes kernel on macOS — MLX uses its own Metal attention and its own 4-bit quantization, and llama.cpp uses Metal + GGUF K-quants. If a generic Llama tutorial tells you to pip install flash-attn or pass --load-in-4bit, skip those steps entirely; the commands above are the complete Apple path.

Do I need to raise the unified-memory wired limit?

No — not for the 8B. The sudo sysctl iogpu.wired_limit_mb raise matters only when a model's weights plus KV-cache exceed the ~66% default-addressable share (~10.5 GB on a 16 GB Mac) — that is a 20B/70B-class problem on this chip. Llama 3.1 8B at ~4.5 GB sits well below the default ceiling, so the default limit is more than sufficient. Leave it alone.

No other widely-reported issues. Report problems via the submission form.