How much VRAM does Devstral Small 2 (24B) need?

About 64 GB — the minimum this recipe targets.

How hard is this setup?

Intermediate — follow the steps above.

Devstral Small 2 (24B) on Apple M2 Max: Local Agentic Coding via llama.cpp Metal + OpenHands (64GB Apple / Q8_0 default, bf16 opt-in)

What You'll Build

A fully local, private agentic-coding setup: Devstral Small 2 (24B) — Mistral's dedicated agentic-coding model, and the first Mistral in this catalogue — served as an OpenAI-compatible endpoint by llama.cpp built with Metal on an Apple M2 Max (64GB unified memory), driven by a coding agent (OpenHands as this catalogue's house choice, or Mistral's own Mistral Vibe CLI). Devstral is fine-tuned for terminal-based coding agents: it plans, runs shell commands, reads your repo, and edits files through native tool calls. The vendor names Apple as an explicit target — "With its compact size of just 24 billion parameters, Devstral is light enough to run on a single RTX 4090 or a Mac with 32GB RAM" (Devstral-Small-2-24B-Instruct-2512 model card). On a 64GB Mac the near-lossless Q8_0 is the comfortable default, with a tight, opt-in path to full bf16 for power users.

Hardware data: Apple M2 Max (64GB unified memory, Metal) · Devstral Small 2 (24B), GGUF Q8_0 (25.06GB, recommended) or bf16 (47.15GB, opt-in) · See benchmark data

ℹ️ This is a coding LLM (with a vision tower), not a chat generalist. Devstral Small 2 is Mistral's agentic-coding model, fine-tuned from Mistral-Small-3.1-24B-Base. It is a dense 24B transformer (32 query / 8 KV heads GQA, hidden size 5120, 40 layers) — not a Mixture-of-Experts, so its footprint is simply the quant file you load plus the KV cache; there is no "active-parameters" shortcut that shrinks memory. The checkpoint is a Mistral3ForConditionalGeneration with a pixtral vision tower, so it can also analyze images and provide insights based on visual content, in addition to text (per the card) — it is not text-only — but it is positioned and used here as a coding model. Vendor coding evals (README table): SWE-bench Verified 68.0%, SWE-bench Multilingual 55.7%, Terminal-Bench 2 22.5% — a 24B matching much larger models on SWE-bench Verified.

⚠️ CRITICAL — you need a recent llama.cpp (PR #17945). There is no first-party GGUF for this 2512 release; you use the community GGUFs the official README itself links (bartowski or unsloth). The README is explicit that these need llama.cpp changes from PR ggml-org/llama.cpp#17945 to run correctly — that PR ("models : fix the attn_factor for mistral3 graphs + improve consistency", merged 2025-12-12) fixes the RoPE/YaRN attention factor for Mistral 3 graphs, which Devstral 2 depends on. Use a llama.cpp build newer than that merge. Wrappers such as Ollama and LM Studio bundle their own llama.cpp and may lag until they ship a build that includes #17945; if the model loads but produces garbled or degraded output on those, that lag is the likely cause — prefer an up-to-date llama-server (Metal) for now.

Requirements

Component	Minimum	Tested target
GPU	Apple Silicon with Metal, 64GB unified memory (this card's floor)	Apple M2 Max (64GB unified memory)
Memory	Unified memory shared with the OS — see the ceiling note below	64GB unified (recommend Q8_0 at 25.06GB)
Storage	~26GB (Q8_0) up to ~48GB (bf16)	~26GB for Q8_0
Software	llama.cpp incl. PR #17945 (Metal) or Ollama once it ships #17945; OpenHands or Mistral Vibe client	`llama-server` (Metal), OpenHands

Model weights (community GGUF — the README-linked bartowski/mistralai_Devstral-Small-2-24B-Instruct-2512-GGUF, byte-verified sizes):

Quant	On-disk size	Fit on M2 Max (64GB unified)
Q4_K_M	14.33GB	Lighter option — frees the most memory for very long context
Q5_K_M	16.76GB	Lighter option — small fidelity bump over Q4_K_M
Q6_K	19.35GB	Lighter option — near-lossless weights with a smaller footprint than Q8_0
Q8_0	25.06GB	Recommended — comfortable default, ~21GB of headroom under the ~46GB GPU-usable ceiling; the near-lossless quality pick
bf16	47.15GB	Opt-in only — 47GB sits above the ~46GB default GPU-usable ceiling; fits only by raising `iogpu.wired_limit_mb` and closing other apps (tight; not the default)

The bartowski/...-imatrix.gguf (~10 MB) is calibration data, not a model — never load it as a quant. unsloth/mistralai_Devstral-Small-2-24B-Instruct-2512-GGUF is the other README-linked source if you prefer it.

ℹ️ Unified memory is shared with the OS. On Apple Silicon the GPU draws from the same pool as the system; macOS caps the GPU-usable slice at roughly 70–75% of total (about 46GB on a 64GB machine) unless you raise iogpu.wired_limit_mb. Q8_0 (25.06GB) leaves ~21GB of headroom under that ceiling — the comfortable default. bf16 (47.15GB) is a tight, opt-in power-user path: it sits just above the ~46GB default ceiling, so it fits only if you raise the wired limit (sudo sysctl iogpu.wired_limit_mb=<value>) and close other apps to leave room for the KV cache and the OS. Q8_0 is the recommended quant; bf16 is an opt-in option on 64GB, not the default.

Licensing. Devstral Small 2 is Apache-2.0 — free for commercial and non-commercial use, no revenue caps (model card).

Installation

You have two GGUF runtimes; pick one. For this release, the safe path is a current llama.cpp build with Metal (Option A) because of the PR #17945 requirement above.

Option A — llama.cpp with Metal (recommended for this release)

On Apple Silicon, llama.cpp builds with the Metal backend by default (-DGGML_METAL=ON is the macOS default). Build a recent llama.cpp (one whose master is after the 2025-12-12 merge of PR #17945) so the Mistral 3 attention-factor fix is present, per the official build guide:

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
# Confirm your checkout includes PR #17945 (merged 2025-12-12) — pull latest master.
# Metal is on by default on macOS; -DGGML_METAL=ON is explicit here.
cmake -B build -DGGML_METAL=ON
cmake --build build --config Release -j 8

If you use a prebuilt llama.cpp release instead, pick a macOS-arm64 build published after 2025-12-12 from the releases page so it contains the fix. You need Xcode command-line tools (xcode-select --install) for the Metal build; no CUDA toolkit is involved on Apple Silicon.

Option B — Ollama / LM Studio (only once they ship #17945)

Ollama and LM Studio both list Devstral Small 2 and are built on llama.cpp. They are the fastest to stand up, but each bundles its own llama.cpp — use them only after their bundled engine includes PR #17945. If output looks broken on either, that engine lag is the first thing to check; fall back to an up-to-date llama-server (Option A) meanwhile.

Running

With llama.cpp

Serve an OpenAI-compatible API on port 8000. The -hf flag pulls the GGUF from Hugging Face; append :Q8_0 (case-insensitive) to pick the quant — without a tag, llama-server defaults to Q4_K_M (llama-server docs):

# Q8_0 (recommended on 64GB) — near-lossless, offload all layers to the Metal GPU
llama-server -hf bartowski/mistralai_Devstral-Small-2-24B-Instruct-2512-GGUF:Q8_0 \
    --port 8000 \
    -ngl 99 \
    -c 65536 \
    --jinja

-ngl 99 (--n-gpu-layers) offloads every layer to the Metal GPU — the dense 24B quant file (25.06GB at Q8_0) is held in unified memory.
-c 65536 sets a 64K context. On 64GB, Q8_0 weights (25.06GB) plus a 64K KV cache sit far under the ~46GB GPU-usable ceiling; raise or lower -c while watching memory in Activity Monitor (or sudo powermetrics --samplers gpu_power for GPU-side detail).
--jinja applies the GGUF's built-in chat template so reasoning/tool-call blocks parse.

Push toward the vendor's 256K context. Devstral advertises a 256K context window (the vendor figure; the base config's max_position_embeddings is larger via YaRN, but 256K is what Mistral states). With Q8_0's ~21GB of headroom you can hold a large window, but the full 256K KV cache at f16 is still very large — to reach the longest windows, quantize the KV cache: add -fa on (Flash Attention, required for a quantized cache) and -ctk q8_0 -ctv q8_0, which roughly halves KV-cache memory versus f16 with minimal quality impact (llama-server docs):

# Longer context on Q8_0 by 8-bit-quantizing the KV cache
llama-server -hf bartowski/mistralai_Devstral-Small-2-24B-Instruct-2512-GGUF:Q8_0 \
    --port 8000 -ngl 99 -c 131072 --jinja \
    -fa on -ctk q8_0 -ctv q8_0

Opt into bf16 (power users). If you want full-precision weights, :bf16 (47.15GB) is possible on 64GB — but only by raising the GPU-usable ceiling first, since 47GB sits above the ~46GB default. Raise it and close other apps, then serve with a modest context so the KV cache still fits:

# One-time: raise the GPU-usable memory ceiling (leave headroom for the OS)
sudo sysctl iogpu.wired_limit_mb=57344   # ~56GB; adjust for your free memory

# bf16 (opt-in, tight) — full-precision weights, keep context modest
llama-server -hf bartowski/mistralai_Devstral-Small-2-24B-Instruct-2512-GGUF:bf16 \
    --port 8000 -ngl 99 -c 16384 --jinja

This is a tight power-user path, not the default. For everyday use stay on Q8_0 — it is near-lossless and leaves far more room for context.

With Ollama

Only after Ollama's bundled llama.cpp includes PR #17945 (see Installation), pull and run the community GGUF directly from Hugging Face; append a :quant tag to choose the quant (HF × Ollama docs):

ollama run hf.co/bartowski/mistralai_Devstral-Small-2-24B-Instruct-2512-GGUF:Q8_0

Ollama serves an OpenAI-compatible API at http://localhost:11434/v1 for agent clients.

Connect a coding agent

Point any OpenAI-compatible coding client at your local endpoint by setting its base URL and a dummy API key.

OpenHands (this catalogue's house choice). The README lists OpenHands among Devstral's supported agent clients. Point it at your local server:

pip install openhands-ai

# OpenHands routes through LiteLLM; the "openai/" prefix selects the OpenAI-compatible path.
export LLM_MODEL="openai/mistralai/Devstral-Small-2-24B-Instruct-2512"
export LLM_BASE_URL="http://localhost:8000/v1"
export LLM_API_KEY="EMPTY"   # any non-empty string; local servers don't check it

openhands

Mistral Vibe (Mistral's own first-party CLI). The README recommends its own agentic CLI for this model. Install and launch it, then point it at your local endpoint:

uv tool install mistral-vibe   # or: pip install mistral-vibe
vibe

The README also lists Cline, Kilo Code, SWE-agent, and Claude Code as compatible clients — all connect the same way, via the OpenAI-compatible base URL. Devstral's tool-call format is Mistral-specific (see the tokenizer note in Troubleshooting), so the --jinja/built-in-template path above is what makes tool calls parse in llama.cpp.

Results

Memory usage: The dense 24B loads entirely as its GGUF file — Q8_0 is 25.06GB on disk (byte-verified from the bartowski GGUF tree). On the M2 Max's 64GB unified memory, Q8_0 is very comfortable — ~21GB of headroom under the ~46GB GPU-usable ceiling, enough for a large coding-session KV cache. Q6_K (19.35GB), Q5_K_M (16.76GB), and Q4_K_M (14.33GB) are lighter options; bf16 (47.15GB) fits only as an opt-in path — it sits above the ~46GB default ceiling, so it needs a raised iogpu.wired_limit_mb and other apps closed.
Model capability: The vendor's README reports SWE-bench Verified 68.0%, SWE-bench Multilingual 55.7%, and Terminal-Bench 2 22.5% — a 24B matching much larger models on SWE-bench Verified. These are Mistral's own agentic-coding evals, not hardware throughput on this GPU.
Speed: No local throughput benchmark for Devstral Small 2 on the Apple M2 Max exists yet — this is a new model and /check/devstral-small-24b/m2-max has no benchmark rows. We would rather omit a tok/s figure than invent one or borrow one from different hardware; live measurements will appear at that link once contributed.

For the full benchmark data, see /check/devstral-small-24b/m2-max.

Troubleshooting

Output is garbled, degraded, or the model won't load correctly

This is the PR #17945 trap. The 2512 release has no first-party GGUF; the community GGUFs need llama.cpp changes from PR ggml-org/llama.cpp#17945 (the Mistral 3 attention-factor fix, merged 2025-12-12) to run correctly. If you built or downloaded llama.cpp before that merge — or you're on an Ollama/LM Studio whose bundled engine predates it — pull/update to a build that includes it. Confirm your llama.cpp checkout is newer than 2025-12-12 (git log on master), or use a prebuilt macOS-arm64 release published after that date.

Tool calls come back as raw text / the agent can't call tools

Devstral uses Mistral's own tokenizer and tool-call format — the Mistral Common tokenizer (tekken.json), which needs mistral-common >= 1.8.6 on the Python serving paths, not the generic ChatML/HF path. On the llama.cpp path, pass --jinja so the GGUF's built-in chat template is applied — a correctly-templated server surfaces tool calls as OpenAI-style tool_calls. If your client shows raw tool-call text, the template isn't being applied.

Out of memory when raising the context (or trying bf16)

Unified memory is shared with the OS, and macOS caps the GPU-usable slice at roughly 70–75% of total (~46GB on a 64GB machine). Q8_0 weights (25.06GB) leave ~21GB for the KV cache — a very long window can still exhaust it. If you OOM after raising -c, either lower the context length, quantize the KV cache with -fa on -ctk q8_0 -ctv q8_0 (see Running), or drop to a lighter quant. For bf16 the ceiling is the whole story: at 47.15GB it sits above the ~46GB default, so it OOMs unless you first raise the ceiling with sudo sysctl iogpu.wired_limit_mb=<value> and close other apps — and even then keep context modest. Devstral is a coding agent — a long agent session with a large repo in context grows the KV cache mid-task, so size for the peak, not idle.

`torch` / a Python ML stack not needed — this is llama.cpp

Serving Devstral via llama.cpp or Ollama does not require PyTorch, flash-attn wheels, or a Python ML stack — the Metal GGUF path needs only the compiled llama-server. On Apple Silicon there is no CUDA toolkit; if cmake can't find Metal support, install Xcode command-line tools (xcode-select --install) and rebuild with -DGGML_METAL=ON.

Model or GPU 404 on /check

Devstral Small 2 (24B) is a new addition; if the /check/devstral-small-24b/m2-max link 404s, the catalogue row is still being registered. The recipe's install and run steps are independent of the benchmark endpoint.