How much VRAM does Mistral Small 3.2 24B need?

About 64 GB — the minimum this recipe targets.

How hard is this setup?

Intermediate — follow the steps above.

Mistral Small 3.2 24B on M2 Max (64GB): Local Private Assistant via llama.cpp / Ollama on Apple Metal

What You'll Build

A fully local, private general assistant: Mistral Small 3.2 24B — Mistral's newest generalist Small (release 2506, superseding 3.1 from 2503) — served as an OpenAI-compatible endpoint by llama.cpp or Ollama on an Apple M2 Max with 64GB of unified memory, then used from a chat UI (Open WebUI is a good local front-end) or directly via the API. This is a chat/reasoning/writing model, not a coding agent: general Q&A, drafting and editing, multi-step reasoning, 23-language multilingual support, and — because the checkpoint carries a Pixtral vision tower — optional image understanding (send it an image, it answers in text). Everything runs on your own hardware, so prompts and documents never leave the machine.

Hardware data: Apple M2 Max (64GB unified memory, Metal) · Mistral Small 3.2 24B, GGUF Q8_0 (25.05GB, recommended — near-lossless) — or Q6_K (19.35GB) / Q5_K_M (16.76GB) / Q4_K_M (14.33GB) for more context headroom; bf16 (47.15GB) is possible on this 64GB machine as an opt-in power-user path · See benchmark data

ℹ️ This is a dense 24B generalist, not a MoE and not text-only. Mistral Small 3.2 is a Mistral3ForConditionalGeneration (model_type: mistral3) — hidden size 5120, 40 layers, GQA with 32 query / 8 KV heads — the same base architecture as Devstral, so the quant byte-sizes are identical. Because it is dense, its footprint is simply the quant file you load plus the KV cache; there is no "active-parameters" shortcut that shrinks memory. The Pixtral vision tower means it can analyze images in addition to text, but it is positioned and used here as a general assistant (vertical llm), not a coding agent. Context window is 128K (max_position_embeddings 131072). It uses Mistral's Tekken tokenizer (tekken.json), which needs mistral-common >= 1.6.2 on the Python serving paths.

ℹ️ Runs on current llama.cpp out of the box. Unlike some later Mistral 3 releases, this June-2025 model needs no special patch — bartowski quantized it with llama.cpp release b5697 (June 2025), and Mistral3/Pixtral text support has been mainline since mid-2025. Just use a recent llama.cpp (or Ollama) build. Pass --jinja so the chat template applies; if tool-calling misbehaves, additionally pass the bundled --chat-template-file Mistral-Small-3.2-24B-Instruct-2506.jinja.

ℹ️ Unified memory is shared with the OS — budget ~70-75% for the GPU. On Apple Silicon the 64GB is one pool shared between CPU, GPU and macOS. Roughly 46GB is realistically usable by the model at defaults; the rest stays reserved for the system. The near-lossless Q8_0 (25.05GB) therefore sits very comfortably — about 21GB of headroom under that ~46GB ceiling (46 − 25.05 ≈ 21GB) for the KV cache and context. Running the full-precision bf16 (47.15GB) is possible on this 64GB machine, but only by raising the wired-memory cap with sudo sysctl iogpu.wired_limit_mb=<value> above the ~46GB default — it's tight and opt-in, not the recommended default.

Requirements

Component	Minimum	Tested target
GPU	Apple Silicon with Metal, 64GB unified memory (this starter's floor)	M2 Max (64GB unified, Metal)
Memory	64GB unified (shared CPU/GPU/OS)	64GB unified
Storage	~14GB (Q4_K_M) up to ~47GB (bf16)	~26GB for Q8_0
Software	Recent llama.cpp (Metal) or Ollama; optional Open WebUI chat client	`llama-server`, Open WebUI

Model weights (community GGUF — there is NO first-party GGUF). Mistral publishes only the full-precision weights (mistralai/Mistral-Small-3.2-24B-Instruct-2506); the model is quantized to GGUF by the community. Primary source is bartowski/mistralai_Mistral-Small-3.2-24B-Instruct-2506-GGUF; unsloth/Mistral-Small-3.2-24B-Instruct-2506-GGUF is a good alternative that also ships UD-*_XL "dynamic" quants. Byte-verified on-disk sizes (bartowski):

Quant	On-disk size	Fit on M2 Max (64GB unified)
Q4_K_M	14.33GB	Comfortable — frees the most unified memory for a large KV cache / long context
Q5_K_M	16.76GB	Comfortable — a small fidelity bump over Q4_K_M, still lots of room for context
Q6_K	19.35GB	Comfortable — near-lossless weights with generous context headroom
Q8_0	25.05GB	Recommended — near-lossless quality, ~21GB of headroom under the ~46GB usable ceiling (46 − 25.05 ≈ 21GB) for a large KV cache
bf16	47.15GB	Fits only by raising `iogpu.wired_limit_mb` above the ~46GB default — tight, opt-in power-user path, not the default

Not model weights — don't count these in the memory math:

The mmproj-* file (~0.88GB) is the vision projector, not the LLM. It is loaded alongside a quant via --mmproj only if you want image input, and adds ~0.88GB on top of the quant — exclude it from the weight/memory budget unless you actually enable vision.
The .imatrix (~10 MB) is calibration data used to produce the quants — never load it as a model.

Licensing. Mistral Small 3.2 24B is Apache-2.0 — free for commercial and non-commercial use, no revenue caps (model card).

Installation

You have two GGUF runtimes; pick one. Both are fine for this model — there is no patch requirement — so choose Ollama for the fastest start, or llama.cpp for the most control over context and KV-cache quantization. Both use the Apple Metal GPU backend automatically on macOS — there are no CUDA flags or drivers to install.

Option A — llama.cpp with Metal

On Apple Silicon, Metal is the default GPU backend — -DGGML_METAL=ON is on by default on macOS, so a standard build already uses the GPU. Build a recent llama.cpp per the official build guide:

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
# Metal is enabled by default on macOS; the flag is shown here for clarity
cmake -B build -DGGML_METAL=ON
cmake --build build --config Release -j 8

A recent release is all you need — Mistral3/Pixtral text has been mainline in llama.cpp since mid-2025 (bartowski built these GGUFs with release b5697). If you prefer a prebuilt binary, grab a current macOS build from the releases page.

Option B — Ollama

Ollama is built on llama.cpp and is the fastest way to stand this model up. On a Mac it uses Metal automatically. Use a recent Ollama release and pull the community GGUF straight from Hugging Face (HF × Ollama docs):

ollama run hf.co/bartowski/mistralai_Mistral-Small-3.2-24B-Instruct-2506-GGUF:Q8_0

Swap the :Q8_0 tag for :Q6_K, :Q5_K_M, or :Q4_K_M if you want more context headroom. Ollama serves an OpenAI-compatible API at http://localhost:11434/v1 for chat clients.

Running

With llama.cpp

Serve an OpenAI-compatible API on port 8000. The -hf flag pulls the GGUF from Hugging Face; append :Q8_0 (case-insensitive) to pick the quant (llama-server docs):

# Q8_0 (recommended, near-lossless), offload all layers to the Metal GPU
llama-server -hf bartowski/mistralai_Mistral-Small-3.2-24B-Instruct-2506-GGUF:Q8_0 \
    --port 8000 \
    -ngl 99 \
    -c 16384 \
    --jinja

-ngl 99 (--n-gpu-layers) offloads every layer to the Metal GPU — the dense 24B quant file (25.05GB at Q8_0) sits in unified memory.
-c 16384 sets a 16K context. Q8_0 leaves ~21GB free under the ~46GB usable ceiling, so you have ample room for a large KV cache; quantize the cache (below) to push the window much higher.
--jinja applies the GGUF's built-in chat template so the assistant format parses correctly. If tool-calling misbehaves, add --chat-template-file Mistral-Small-3.2-24B-Instruct-2506.jinja (the template bundled with the repo).

Push toward the 128K context window. Mistral Small 3.2 advertises a 128K context (max_position_embeddings 131072). To reach long windows, quantize the KV cache: add -fa on (Flash Attention, required for a quantized cache) and -ctk q8_0 -ctv q8_0, which roughly halves KV-cache memory versus f16 with minimal quality impact:

# Longer context by 8-bit-quantizing the KV cache
llama-server -hf bartowski/mistralai_Mistral-Small-3.2-24B-Instruct-2506-GGUF:Q8_0 \
    --port 8000 -ngl 99 -c 65536 --jinja \
    -fa on -ctk q8_0 -ctv q8_0

To free even more unified memory for very long context, drop to :Q6_K (19.35GB), :Q5_K_M (16.76GB), or :Q4_K_M (14.33GB) — each lighter quant trades a little weight fidelity for a larger context budget.

Optional — full-precision bf16 (power-user path). With 64GB you can run the full-precision bf16 (47.15GB) weights, but it exceeds the ~46GB default usable ceiling, so it fits only after raising the wired-memory cap:

# Opt-in: raise the GPU wired-memory limit before loading bf16 (tight; not the default)
sudo sysctl iogpu.wired_limit_mb=57344
llama-server -hf bartowski/mistralai_Mistral-Small-3.2-24B-Instruct-2506-GGUF:bf16 \
    --port 8000 -ngl 99 -c 8192 --jinja

This leaves little room for the KV cache and context, so keep -c modest — Q8_0 is the far more comfortable default and is near-lossless already.

Optional — image input. The Pixtral vision tower lets the model read images. Download the mmproj-* file from the same GGUF repo and pass it alongside the quant; it adds ~0.88GB of memory on top of the weights:

llama-server -hf bartowski/mistralai_Mistral-Small-3.2-24B-Instruct-2506-GGUF:Q8_0 \
    --mmproj mmproj-mistralai_Mistral-Small-3.2-24B-Instruct-2506-f16.gguf \
    --port 8000 -ngl 99 -c 16384 --jinja

With Ollama

Pull and run the community GGUF directly from Hugging Face; append a :quant tag to choose the quant (HF × Ollama docs):

ollama run hf.co/bartowski/mistralai_Mistral-Small-3.2-24B-Instruct-2506-GGUF:Q8_0

Ollama serves an OpenAI-compatible API at http://localhost:11434/v1 for chat clients.

Use it as a chat assistant

Point any OpenAI-compatible chat client at your local endpoint by setting its base URL and a dummy API key — no cloud, no per-token cost.

Open WebUI (optional local chat front-end). A self-hosted, ChatGPT-style UI that talks to any OpenAI-compatible server. Run it and point it at your local endpoint:

# Point Open WebUI at your local llama-server (or Ollama on :11434)
docker run -d -p 3000:8080 \
    -e OPENAI_API_BASE_URL=http://host.docker.internal:8000/v1 \
    -e OPENAI_API_KEY=EMPTY \
    ghcr.io/open-webui/open-webui:main

Then open http://localhost:3000 and chat. (Open WebUI also autodetects a local Ollama install, so with the Ollama path you can skip the base-URL wiring entirely.)

Directly via the API. Any OpenAI SDK or curl works against the same endpoint — use it for scripts, writing tools, or your own app:

curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
      "model": "mistral-small-3.2-24b",
      "messages": [{"role": "user", "content": "Summarize this in three bullet points: ..."}]
    }'

Local servers don't check the key, so any non-empty string (e.g. EMPTY) works where a client requires one.

Results

Memory usage: The dense 24B loads entirely as its GGUF file into unified memory — Q8_0 is 25.05GB on disk (byte-verified from the bartowski GGUF tree). On the M2 Max's 64GB unified memory (~46GB realistically usable by the GPU) that leaves roughly 21GB for the KV cache (46 − 25.05 ≈ 21GB) — comfortable for a large context at f16, or a much larger window with an 8-bit-quantized cache (see Running). Q6_K (19.35GB), Q5_K_M (16.76GB), and Q4_K_M (14.33GB) free even more memory. Full-precision bf16 (47.15GB) fits only by raising iogpu.wired_limit_mb above the ~46GB default — a tight, opt-in path, not the default. Enabling image input adds ~0.88GB for the mmproj projector. Watch actual pressure in Activity Monitor (GPU / memory) or sudo powermetrics — there is no nvidia-smi on a Mac.
Model capability (vendor evals — Mistral's own, NOT hardware throughput): Mistral reports MMLU Pro 5-shot CoT 69.06%, MATH 69.42%, GPQA Diamond 46.13%, HumanEval Plus pass@5 92.90%, MBPP Plus 78.33%, plus a sharp instruction-following jump over 3.1 — Wildbench v2 65.33% and Arena Hard v2 43.1%. On vision: MMMU 62.50% and DocVQA 94.86%. It handles 23 languages. These are the vendor's benchmarks, not measurements on this GPU.
Speed: No community throughput benchmark for Mistral Small 3.2 24B on the M2 Max exists yet — we would rather omit a tok/s figure than invent one or borrow it from different hardware. Live measurements will appear at /check/mistral-small-3-2-24b/m2-max once contributed.

For the full benchmark data, see /check/mistral-small-3-2-24b/m2-max.

Troubleshooting

The chat template looks wrong / responses are malformed

Pass --jinja to llama-server so the GGUF's built-in chat template is applied — without it the assistant format won't parse. Mistral Small 3.2 uses Mistral's own Tekken tokenizer (tekken.json), and on the Python serving paths that needs mistral-common >= 1.6.2. If tool-calling in particular misbehaves, additionally pass --chat-template-file Mistral-Small-3.2-24B-Instruct-2506.jinja (the template bundled in the model repo) to override the embedded one.

Out of memory, or the machine gets sluggish under load

Unified memory is shared with macOS, so a very long f16 context on top of Q8_0 — or the full bf16 weights — can push the system into memory pressure. Options, in order: quantize the KV cache with -fa on -ctk q8_0 -ctv q8_0 (roughly halves cache memory); lower -c; or drop to a lighter quant — Q6_K (19.35GB), Q5_K_M (16.76GB), or Q4_K_M (14.33GB) — to free more memory for context. If you deliberately raised iogpu.wired_limit_mb to run bf16, that's the tight case; Q8_0 with ~21GB of headroom is the comfortable default. If you enabled --mmproj for images, remember it's another ~0.88GB.

Image input doesn't work

Vision needs the mmproj projector loaded alongside the quant via --mmproj (see Running) — the quant alone is text-only. The mmproj-* file lives in the same GGUF repo as the weights; make sure you're on a recent llama.cpp/Ollama build with multimodal support, and that your client actually sends the image in the request. The projector is ~0.88GB of extra memory.

This is llama.cpp on Metal, not a Python ML stack

Serving Mistral Small 3.2 via llama.cpp or Ollama does not require PyTorch, a Python ML stack, or any GPU driver install — Metal ships with macOS and is the default backend on Apple Silicon. If the model runs on CPU only, confirm your llama.cpp build has Metal enabled (it is on by default; -DGGML_METAL=ON) and that you passed -ngl 99 to offload the layers. For very-large-memory production serving you could instead run the full-precision weights under a server like vLLM; on this 64GB M2 Max the GGUF + llama.cpp path is simplest, with bf16 reserved for the opt-in wired-limit path above.

Model or GPU 404 on /check

Mistral Small 3.2 24B is a new addition; if the /check/mistral-small-3-2-24b/m2-max link 404s, the catalogue row is still being registered. The recipe's install and run steps are independent of the benchmark endpoint.