How much VRAM does Llama 3.1 8B need?

About 5 GB — the minimum this recipe targets.

How hard is this setup?

Beginner — follow the steps above.

Llama 3.1 8B on Apple M2 Max: the easy local-LLM on-ramp in unified memory with MLX

What You'll Build

A fully-local Llama 3.1 8B Instruct chat endpoint running on an Apple M2 Max with 64 GB unified memory, using Apple's native MLX runtime and 4-bit weights — no NVIDIA GPU, no CUDA, no FlashAttention. You get an OpenAI-compatible local server you can point any chat client at, plus a one-shot CLI for scripting. At ~4.5 GB the weights are small enough that this is the easy on-ramp to local LLMs on a Mac — it fits with room to spare and needs no memory tuning.

Hardware data: Apple M2 Max (64 GB unified memory) · MLX 4-bit weights ~4.5 GB on disk · See benchmark data

ℹ️ Unified memory is not VRAM. The M2 Max has 64 GB of unified memory shared by CPU and GPU — not 64 GB of dedicated VRAM. By default macOS only lets the GPU address roughly 75% of it (~48 GB via Metal's recommendedMaxWorkingSetSize). Unlike a 70B model, the 4.5 GB Llama 3.1 8B sits far below that ceiling, so no wired-limit tuning is needed here — this recipe runs comfortably on any Apple Silicon Mac, down to a 16 GB MacBook Air (~10.5 GB GPU-addressable).

Requirements

Component	Minimum	Tested
GPU / memory	16 GB unified memory (~10.5 GB GPU-addressable)	Apple M2 Max (64 GB unified memory, ~48 GB addressable)
RAM	Same pool — unified	64 GB unified
Storage	~5 GB (MLX 4-bit) / ~5 GB (GGUF Q4_K_M)	~5 GB
Software	Python 3.10+, macOS Sonoma 14 / Sequoia 15+	macOS Sequoia 15

The binding constraint on Apple Silicon is addressable unified memory, not raw capacity — but for an 8B model at 4-bit it barely binds at all. The MLX 4-bit weights are 4.52 GB on disk (HF tree API for mlx-community/Meta-Llama-3.1-8B-Instruct-4bit — a single model.safetensors shard). Against the ~48 GB the M2 Max's GPU can address by default, that leaves ample headroom for the KV-cache and macOS itself. Even the smallest current Apple Silicon config (16 GB → ~10.5 GB addressable) has room for the weights plus a generous context window, so this recipe is not memory-bound on the M2 Max.

Installation

1. Install MLX-LM (the Apple-native path)

pip install mlx-lm

MLX is Apple's array framework; mlx-lm is its LLM front-end. There is nothing CUDA-shaped to install — no torch build flags, no cu12x wheel, no FlashAttention. (ml-explore/mlx-lm)

2. Run the model (weights download on first use)

mlx_lm.generate --model mlx-community/Meta-Llama-3.1-8B-Instruct-4bit --prompt "Explain unified memory in one paragraph."

On first run, mlx-lm pulls the 4-bit weights (~4.5 GB, single shard) from the mlx-community Hugging Face org and caches them under ~/.cache/huggingface. These mirror weights are not gated, so no license-acceptance step is needed to download them (the underlying model is governed by the Llama 3.1 Community License).

Running

For an interactive, OpenAI-compatible local server (so you can point Open WebUI, a chat client, or your own code at it):

mlx_lm.server --model mlx-community/Meta-Llama-3.1-8B-Instruct-4bit

This starts a local server on 127.0.0.1:8080 exposing an OpenAI-style /v1/chat/completions endpoint. It is a development server — bind it to localhost only. (mlx-lm SERVER docs)

Alternative: the GGUF path (llama.cpp / Ollama / LM Studio)

If you prefer the portable GGUF ecosystem, the same model is available as a Q4_K_M GGUF (~4.9 GB on disk):

# Ollama (simplest) — pulls a ~4.9 GB Q4_K_M build
ollama run llama3.1:8b

Ollama reports llama3.1:8b as a 4.9 GB build with a 128K context window (ollama.com/library/llama3.1). For a hand-managed llama.cpp build, Metal is enabled by default on macOS — "On MacOS, Metal is enabled by default. Using Metal makes the computation run on the GPU." (llama.cpp build docs) — so a standard cmake -B build && cmake --build build --config Release already runs on the GPU; point it at a Q4_K_M GGUF from bartowski/Meta-Llama-3.1-8B-Instruct-GGUF (Q4_K_M = 4.92 GB) or unsloth/Llama-3.1-8B-Instruct-GGUF (Q4_K_M = 4.92 GB). LM Studio runs both MLX and GGUF from a GUI if you prefer not to touch the terminal.

Both runtimes' Q4 files (~4.9 GB) fit the M2 Max's ~48 GB default-addressable pool with enormous margin — neither the MLX nor the GGUF path needs the wired-limit raise that a 70B model would.

Results

Speed: No first-party Apple M2 Max benchmark for this pair has been recorded yet — /check/llama-3-1-8b/m2-max currently returns verdict: unknown with no measurements. We are deliberately not quoting a token/sec figure: token generation on Apple Silicon is bandwidth-bound (the M2 Max runs ~400 GB/s unified memory), and no chip-named first-party throughput figure exists for this pair. If you run this, please contribute your tok/s so we can seed a real datapoint.
Memory usage: ~4.5 GB resident for the 4-bit weights, plus a KV-cache that grows with context. Fits the ~48 GB default-addressable pool many times over — memory is not the limiting factor on this hardware.
Quality notes: The 4-bit quantization trades a small amount of quality for a smaller footprint. With ~48 GB addressable on the M2 Max you have plenty of headroom for the higher-fidelity 8-bit mlx-community build (mlx-community/Meta-Llama-3.1-8B-Instruct-8bit) if you want it — the 8B is small enough that even 8-bit leaves the memory budget untouched.

For the full benchmark data (and to be the first to populate it), see /check/llama-3-1-8b/m2-max.

Troubleshooting

The `meta-llama/Llama-3.1-8B-Instruct` repo asks for access

The canonical Meta repo is gated (manual approval). You do not need it for this recipe: the mlx-community 4-bit weights and the bartowski/unsloth GGUF mirrors are ungated re-distributions of the same model and download without an access request. The Llama 3.1 Community License still governs your use of the model.

Tried to install FlashAttention / bitsandbytes / a `cu12x` wheel and it failed

None of those apply on Apple Silicon. There is no CUDA, no FlashAttention, and no GPU bitsandbytes kernel on macOS — MLX uses its own Metal attention and its own 4-bit quantization, and llama.cpp uses Metal + GGUF K-quants. If a generic Llama tutorial tells you to pip install flash-attn or pass --load-in-4bit, skip those steps entirely; the commands above are the complete Apple path.

Do I need to raise the unified-memory wired limit?

No. The sudo sysctl iogpu.wired_limit_mb raise matters only when a model's weights plus KV-cache exceed the ~75% default-addressable share (~48 GB on a 64 GB Mac) — that is a 70B-class problem. Llama 3.1 8B at ~4.5 GB sits far below the default ceiling on every Apple Silicon Mac the site covers, so the default limit is more than sufficient. Leave it alone.

No other widely-reported issues. Report problems via the submission form.