What You'll Build
A fully local, private agentic-coding setup: Devstral Small 2 (24B) — Mistral's dedicated agentic-coding model, and the first Mistral in this catalogue — served as an OpenAI-compatible endpoint by llama.cpp built with Metal on an Apple M2 Max (64GB unified memory), driven by a coding agent (OpenHands as this catalogue's house choice, or Mistral's own Mistral Vibe CLI). Devstral is fine-tuned for terminal-based coding agents: it plans, runs shell commands, reads your repo, and edits files through native tool calls. The vendor names Apple as an explicit target — "With its compact size of just 24 billion parameters, Devstral is light enough to run on a single RTX 4090 or a Mac with 32GB RAM" (Devstral-Small-2-24B-Instruct-2512 model card). On a 64GB Mac the near-lossless Q8_0 is the comfortable default, with a tight, opt-in path to full bf16 for power users.
Hardware data: Apple M2 Max (64GB unified memory, Metal) · Devstral Small 2 (24B), GGUF Q8_0 (25.06GB, recommended) or bf16 (47.15GB, opt-in) · See benchmark data
ℹ️ This is a coding LLM (with a vision tower), not a chat generalist. Devstral Small 2 is Mistral's agentic-coding model, fine-tuned from Mistral-Small-3.1-24B-Base. It is a dense 24B transformer (32 query / 8 KV heads GQA, hidden size 5120, 40 layers) — not a Mixture-of-Experts, so its footprint is simply the quant file you load plus the KV cache; there is no "active-parameters" shortcut that shrinks memory. The checkpoint is a
Mistral3ForConditionalGenerationwith a pixtral vision tower, so it can also analyze images and provide insights based on visual content, in addition to text (per the card) — it is not text-only — but it is positioned and used here as a coding model. Vendor coding evals (README table): SWE-bench Verified 68.0%, SWE-bench Multilingual 55.7%, Terminal-Bench 2 22.5% — a 24B matching much larger models on SWE-bench Verified.
⚠️ CRITICAL — you need a recent llama.cpp (PR #17945). There is no first-party GGUF for this 2512 release; you use the community GGUFs the official README itself links (bartowski or unsloth). The README is explicit that these need llama.cpp changes from PR ggml-org/llama.cpp#17945 to run correctly — that PR ("models : fix the attn_factor for mistral3 graphs + improve consistency", merged 2025-12-12) fixes the RoPE/YaRN attention factor for Mistral 3 graphs, which Devstral 2 depends on. Use a llama.cpp build newer than that merge. Wrappers such as Ollama and LM Studio bundle their own llama.cpp and may lag until they ship a build that includes #17945; if the model loads but produces garbled or degraded output on those, that lag is the likely cause — prefer an up-to-date
llama-server(Metal) for now.
Requirements
| Component | Minimum | Tested target |
|---|---|---|
| GPU | Apple Silicon with Metal, 64GB unified memory (this card's floor) | Apple M2 Max (64GB unified memory) |
| Memory | Unified memory shared with the OS — see the ceiling note below | 64GB unified (recommend Q8_0 at 25.06GB) |
| Storage | ~26GB (Q8_0) up to ~48GB (bf16) | ~26GB for Q8_0 |
| Software | llama.cpp incl. PR #17945 (Metal) or Ollama once it ships #17945; OpenHands or Mistral Vibe client | llama-server (Metal), OpenHands |
Model weights (community GGUF — the README-linked bartowski/mistralai_Devstral-Small-2-24B-Instruct-2512-GGUF, byte-verified sizes):
| Quant | On-disk size | Fit on M2 Max (64GB unified) |
|---|---|---|
| Q4_K_M | 14.33GB | Lighter option — frees the most memory for very long context |
| Q5_K_M | 16.76GB | Lighter option — small fidelity bump over Q4_K_M |
| Q6_K | 19.35GB | Lighter option — near-lossless weights with a smaller footprint than Q8_0 |
| Q8_0 | 25.06GB | Recommended — comfortable default, ~21GB of headroom under the ~46GB GPU-usable ceiling; the near-lossless quality pick |
| bf16 | 47.15GB | Opt-in only — 47GB sits above the ~46GB default GPU-usable ceiling; fits only by raising iogpu.wired_limit_mb and closing other apps (tight; not the default) |
The bartowski/...-imatrix.gguf (~10 MB) is calibration data, not a model — never load it as a quant. unsloth/mistralai_Devstral-Small-2-24B-Instruct-2512-GGUF is the other README-linked source if you prefer it.
ℹ️ Unified memory is shared with the OS. On Apple Silicon the GPU draws from the same pool as the system; macOS caps the GPU-usable slice at roughly 70–75% of total (about 46GB on a 64GB machine) unless you raise
iogpu.wired_limit_mb. Q8_0 (25.06GB) leaves ~21GB of headroom under that ceiling — the comfortable default. bf16 (47.15GB) is a tight, opt-in power-user path: it sits just above the ~46GB default ceiling, so it fits only if you raise the wired limit (sudo sysctl iogpu.wired_limit_mb=<value>) and close other apps to leave room for the KV cache and the OS. Q8_0 is the recommended quant; bf16 is an opt-in option on 64GB, not the default.
Licensing. Devstral Small 2 is Apache-2.0 — free for commercial and non-commercial use, no revenue caps (model card).
Installation
You have two GGUF runtimes; pick one. For this release, the safe path is a current llama.cpp build with Metal (Option A) because of the PR #17945 requirement above.
Option A — llama.cpp with Metal (recommended for this release)
On Apple Silicon, llama.cpp builds with the Metal backend by default (-DGGML_METAL=ON is the macOS default). Build a recent llama.cpp (one whose master is after the 2025-12-12 merge of PR #17945) so the Mistral 3 attention-factor fix is present, per the official build guide:
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
# Confirm your checkout includes PR #17945 (merged 2025-12-12) — pull latest master.
# Metal is on by default on macOS; -DGGML_METAL=ON is explicit here.
cmake -B build -DGGML_METAL=ON
cmake --build build --config Release -j 8
If you use a prebuilt llama.cpp release instead, pick a macOS-arm64 build published after 2025-12-12 from the releases page so it contains the fix. You need Xcode command-line tools (xcode-select --install) for the Metal build; no CUDA toolkit is involved on Apple Silicon.
Option B — Ollama / LM Studio (only once they ship #17945)
Ollama and LM Studio both list Devstral Small 2 and are built on llama.cpp. They are the fastest to stand up, but each bundles its own llama.cpp — use them only after their bundled engine includes PR #17945. If output looks broken on either, that engine lag is the first thing to check; fall back to an up-to-date llama-server (Option A) meanwhile.
Running
With llama.cpp
Serve an OpenAI-compatible API on port 8000. The -hf flag pulls the GGUF from Hugging Face; append :Q8_0 (case-insensitive) to pick the quant — without a tag, llama-server defaults to Q4_K_M (llama-server docs):
# Q8_0 (recommended on 64GB) — near-lossless, offload all layers to the Metal GPU
llama-server -hf bartowski/mistralai_Devstral-Small-2-24B-Instruct-2512-GGUF:Q8_0 \
--port 8000 \
-ngl 99 \
-c 65536 \
--jinja
-ngl 99(--n-gpu-layers) offloads every layer to the Metal GPU — the dense 24B quant file (25.06GB at Q8_0) is held in unified memory.-c 65536sets a 64K context. On 64GB, Q8_0 weights (25.06GB) plus a 64K KV cache sit far under the ~46GB GPU-usable ceiling; raise or lower-cwhile watching memory in Activity Monitor (orsudo powermetrics --samplers gpu_powerfor GPU-side detail).--jinjaapplies the GGUF's built-in chat template so reasoning/tool-call blocks parse.
Push toward the vendor's 256K context. Devstral advertises a 256K context window (the vendor figure; the base config's max_position_embeddings is larger via YaRN, but 256K is what Mistral states). With Q8_0's ~21GB of headroom you can hold a large window, but the full 256K KV cache at f16 is still very large — to reach the longest windows, quantize the KV cache: add -fa on (Flash Attention, required for a quantized cache) and -ctk q8_0 -ctv q8_0, which roughly halves KV-cache memory versus f16 with minimal quality impact (llama-server docs):
# Longer context on Q8_0 by 8-bit-quantizing the KV cache
llama-server -hf bartowski/mistralai_Devstral-Small-2-24B-Instruct-2512-GGUF:Q8_0 \
--port 8000 -ngl 99 -c 131072 --jinja \
-fa on -ctk q8_0 -ctv q8_0
Opt into bf16 (power users). If you want full-precision weights, :bf16 (47.15GB) is possible on 64GB — but only by raising the GPU-usable ceiling first, since 47GB sits above the ~46GB default. Raise it and close other apps, then serve with a modest context so the KV cache still fits:
# One-time: raise the GPU-usable memory ceiling (leave headroom for the OS)
sudo sysctl iogpu.wired_limit_mb=57344 # ~56GB; adjust for your free memory
# bf16 (opt-in, tight) — full-precision weights, keep context modest
llama-server -hf bartowski/mistralai_Devstral-Small-2-24B-Instruct-2512-GGUF:bf16 \
--port 8000 -ngl 99 -c 16384 --jinja
This is a tight power-user path, not the default. For everyday use stay on Q8_0 — it is near-lossless and leaves far more room for context.
With Ollama
Only after Ollama's bundled llama.cpp includes PR #17945 (see Installation), pull and run the community GGUF directly from Hugging Face; append a :quant tag to choose the quant (HF × Ollama docs):
ollama run hf.co/bartowski/mistralai_Devstral-Small-2-24B-Instruct-2512-GGUF:Q8_0
Ollama serves an OpenAI-compatible API at http://localhost:11434/v1 for agent clients.
Connect a coding agent
Point any OpenAI-compatible coding client at your local endpoint by setting its base URL and a dummy API key.
OpenHands (this catalogue's house choice). The README lists OpenHands among Devstral's supported agent clients. Point it at your local server:
pip install openhands-ai
# OpenHands routes through LiteLLM; the "openai/" prefix selects the OpenAI-compatible path.
export LLM_MODEL="openai/mistralai/Devstral-Small-2-24B-Instruct-2512"
export LLM_BASE_URL="http://localhost:8000/v1"
export LLM_API_KEY="EMPTY" # any non-empty string; local servers don't check it
openhands
Mistral Vibe (Mistral's own first-party CLI). The README recommends its own agentic CLI for this model. Install and launch it, then point it at your local endpoint:
uv tool install mistral-vibe # or: pip install mistral-vibe
vibe
The README also lists Cline, Kilo Code, SWE-agent, and Claude Code as compatible clients — all connect the same way, via the OpenAI-compatible base URL. Devstral's tool-call format is Mistral-specific (see the tokenizer note in Troubleshooting), so the --jinja/built-in-template path above is what makes tool calls parse in llama.cpp.
Results
- Memory usage: The dense 24B loads entirely as its GGUF file — Q8_0 is 25.06GB on disk (byte-verified from the bartowski GGUF tree). On the M2 Max's 64GB unified memory, Q8_0 is very comfortable — ~21GB of headroom under the ~46GB GPU-usable ceiling, enough for a large coding-session KV cache. Q6_K (19.35GB), Q5_K_M (16.76GB), and Q4_K_M (14.33GB) are lighter options; bf16 (47.15GB) fits only as an opt-in path — it sits above the ~46GB default ceiling, so it needs a raised
iogpu.wired_limit_mband other apps closed. - Model capability: The vendor's README reports SWE-bench Verified 68.0%, SWE-bench Multilingual 55.7%, and Terminal-Bench 2 22.5% — a 24B matching much larger models on SWE-bench Verified. These are Mistral's own agentic-coding evals, not hardware throughput on this GPU.
- Speed: No local throughput benchmark for Devstral Small 2 on the Apple M2 Max exists yet — this is a new model and
/check/devstral-small-24b/m2-maxhas no benchmark rows. We would rather omit a tok/s figure than invent one or borrow one from different hardware; live measurements will appear at that link once contributed.
For the full benchmark data, see /check/devstral-small-24b/m2-max.
Troubleshooting
Output is garbled, degraded, or the model won't load correctly
This is the PR #17945 trap. The 2512 release has no first-party GGUF; the community GGUFs need llama.cpp changes from PR ggml-org/llama.cpp#17945 (the Mistral 3 attention-factor fix, merged 2025-12-12) to run correctly. If you built or downloaded llama.cpp before that merge — or you're on an Ollama/LM Studio whose bundled engine predates it — pull/update to a build that includes it. Confirm your llama.cpp checkout is newer than 2025-12-12 (git log on master), or use a prebuilt macOS-arm64 release published after that date.
Tool calls come back as raw text / the agent can't call tools
Devstral uses Mistral's own tokenizer and tool-call format — the Mistral Common tokenizer (tekken.json), which needs mistral-common >= 1.8.6 on the Python serving paths, not the generic ChatML/HF path. On the llama.cpp path, pass --jinja so the GGUF's built-in chat template is applied — a correctly-templated server surfaces tool calls as OpenAI-style tool_calls. If your client shows raw tool-call text, the template isn't being applied.
Out of memory when raising the context (or trying bf16)
Unified memory is shared with the OS, and macOS caps the GPU-usable slice at roughly 70–75% of total (~46GB on a 64GB machine). Q8_0 weights (25.06GB) leave ~21GB for the KV cache — a very long window can still exhaust it. If you OOM after raising -c, either lower the context length, quantize the KV cache with -fa on -ctk q8_0 -ctv q8_0 (see Running), or drop to a lighter quant. For bf16 the ceiling is the whole story: at 47.15GB it sits above the ~46GB default, so it OOMs unless you first raise the ceiling with sudo sysctl iogpu.wired_limit_mb=<value> and close other apps — and even then keep context modest. Devstral is a coding agent — a long agent session with a large repo in context grows the KV cache mid-task, so size for the peak, not idle.
torch / a Python ML stack not needed — this is llama.cpp
Serving Devstral via llama.cpp or Ollama does not require PyTorch, flash-attn wheels, or a Python ML stack — the Metal GGUF path needs only the compiled llama-server. On Apple Silicon there is no CUDA toolkit; if cmake can't find Metal support, install Xcode command-line tools (xcode-select --install) and rebuild with -DGGML_METAL=ON.
Model or GPU 404 on /check
Devstral Small 2 (24B) is a new addition; if the /check/devstral-small-24b/m2-max link 404s, the catalogue row is still being registered. The recipe's install and run steps are independent of the benchmark endpoint.