How much VRAM does Devstral Small 2 (24B) need?

About 48 GB — the minimum this recipe targets.

How hard is this setup?

Intermediate — follow the steps above.

Devstral Small 2 (24B) on Apple M3 Max: Local Agentic Coding via llama.cpp Metal + OpenHands (48GB Apple / Q8_0 near-lossless)

What You'll Build

A fully local, private agentic-coding setup: Devstral Small 2 (24B) — Mistral's dedicated agentic-coding model, and the first Mistral in this catalogue — served as an OpenAI-compatible endpoint by llama.cpp built with Metal on an Apple M3 Max (48GB unified memory), driven by a coding agent (OpenHands as this catalogue's house choice, or Mistral's own Mistral Vibe CLI). Devstral is fine-tuned for terminal-based coding agents: it plans, runs shell commands, reads your repo, and edits files through native tool calls. The vendor names Apple as an explicit target — "With its compact size of just 24 billion parameters, Devstral is light enough to run on a single RTX 4090 or a Mac with 32GB RAM" (Devstral-Small-2-24B-Instruct-2512 model card). On a 48GB Mac you go a full quant step further than the vendor's 32GB floor — big enough for the near-lossless Q8_0.

Hardware data: Apple M3 Max (48GB unified memory, Metal) · Devstral Small 2 (24B), GGUF Q8_0 (25.06GB, recommended) or lighter Q6_K/Q5_K_M/Q4_K_M · See benchmark data

ℹ️ This is a coding LLM (with a vision tower), not a chat generalist. Devstral Small 2 is Mistral's agentic-coding model, fine-tuned from Mistral-Small-3.1-24B-Base. It is a dense 24B transformer (32 query / 8 KV heads GQA, hidden size 5120, 40 layers) — not a Mixture-of-Experts, so its footprint is simply the quant file you load plus the KV cache; there is no "active-parameters" shortcut that shrinks memory. The checkpoint is a Mistral3ForConditionalGeneration with a pixtral vision tower, so it can also analyze images and provide insights based on visual content, in addition to text (per the card) — it is not text-only — but it is positioned and used here as a coding model. Vendor coding evals (README table): SWE-bench Verified 68.0%, SWE-bench Multilingual 55.7%, Terminal-Bench 2 22.5% — a 24B matching much larger models on SWE-bench Verified.

⚠️ CRITICAL — you need a recent llama.cpp (PR #17945). There is no first-party GGUF for this 2512 release; you use the community GGUFs the official README itself links (bartowski or unsloth). The README is explicit that these need llama.cpp changes from PR ggml-org/llama.cpp#17945 to run correctly — that PR ("models : fix the attn_factor for mistral3 graphs + improve consistency", merged 2025-12-12) fixes the RoPE/YaRN attention factor for Mistral 3 graphs, which Devstral 2 depends on. Use a llama.cpp build newer than that merge. Wrappers such as Ollama and LM Studio bundle their own llama.cpp and may lag until they ship a build that includes #17945; if the model loads but produces garbled or degraded output on those, that lag is the likely cause — prefer an up-to-date llama-server (Metal) for now.

Requirements

Component	Minimum	Tested target
GPU	Apple Silicon with Metal, 48GB unified memory (this card's floor)	Apple M3 Max (48GB unified memory)
Memory	Unified memory shared with the OS — see the ceiling note below	48GB unified (recommend Q8_0 at 25.06GB)
Storage	~15GB (Q4_K_M) up to ~26GB (Q8_0)	~26GB for Q8_0
Software	llama.cpp incl. PR #17945 (Metal) or Ollama once it ships #17945; OpenHands or Mistral Vibe client	`llama-server` (Metal), OpenHands

Model weights (community GGUF — the README-linked bartowski/mistralai_Devstral-Small-2-24B-Instruct-2512-GGUF, byte-verified sizes):

Quant	On-disk size	Fit on M3 Max (48GB unified)
Q4_K_M	14.33GB	Lighter option — frees the most memory for very long context
Q5_K_M	16.76GB	Lighter option — small fidelity bump over Q4_K_M, still leaves ample context room
Q6_K	19.35GB	Lighter option — near-lossless weights with a smaller footprint than Q8_0
Q8_0	25.06GB	Recommended — comfortable on 48GB (well under the ~34–36GB GPU-usable ceiling); the near-lossless quality pick
bf16	47.15GB	Does not fit 48GB — exceeds the default GPU-usable ceiling (~34–36GB of 48); needs a larger Mac

The bartowski/...-imatrix.gguf (~10 MB) is calibration data, not a model — never load it as a quant. unsloth/mistralai_Devstral-Small-2-24B-Instruct-2512-GGUF is the other README-linked source if you prefer it.

ℹ️ Unified memory is shared with the OS. On Apple Silicon the GPU draws from the same pool as the system; macOS caps the GPU-usable slice at roughly 70–75% of total (about 34–36GB on a 48GB machine) unless you raise iogpu.wired_limit_mb. Q8_0 (25.06GB) sits comfortably under that ceiling with room for the KV cache; bf16 (47.15GB) does not fit — it exceeds the default GPU-usable ceiling even before the KV cache, so 48GB is a Q8_0-and-below machine.

Licensing. Devstral Small 2 is Apache-2.0 — free for commercial and non-commercial use, no revenue caps (model card).

Installation

You have two GGUF runtimes; pick one. For this release, the safe path is a current llama.cpp build with Metal (Option A) because of the PR #17945 requirement above.

Option A — llama.cpp with Metal (recommended for this release)

On Apple Silicon, llama.cpp builds with the Metal backend by default (-DGGML_METAL=ON is the macOS default). Build a recent llama.cpp (one whose master is after the 2025-12-12 merge of PR #17945) so the Mistral 3 attention-factor fix is present, per the official build guide:

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
# Confirm your checkout includes PR #17945 (merged 2025-12-12) — pull latest master.
# Metal is on by default on macOS; -DGGML_METAL=ON is explicit here.
cmake -B build -DGGML_METAL=ON
cmake --build build --config Release -j 8

If you use a prebuilt llama.cpp release instead, pick a macOS-arm64 build published after 2025-12-12 from the releases page so it contains the fix. You need Xcode command-line tools (xcode-select --install) for the Metal build; no CUDA toolkit is involved on Apple Silicon.

Option B — Ollama / LM Studio (only once they ship #17945)

Ollama and LM Studio both list Devstral Small 2 and are built on llama.cpp. They are the fastest to stand up, but each bundles its own llama.cpp — use them only after their bundled engine includes PR #17945. If output looks broken on either, that engine lag is the first thing to check; fall back to an up-to-date llama-server (Option A) meanwhile.

Running

With llama.cpp

Serve an OpenAI-compatible API on port 8000. The -hf flag pulls the GGUF from Hugging Face; append :Q8_0 (case-insensitive) to pick the quant — without a tag, llama-server defaults to Q4_K_M (llama-server docs):

# Q8_0 (recommended on 48GB) — near-lossless, offload all layers to the Metal GPU
llama-server -hf bartowski/mistralai_Devstral-Small-2-24B-Instruct-2512-GGUF:Q8_0 \
    --port 8000 \
    -ngl 99 \
    -c 65536 \
    --jinja

-ngl 99 (--n-gpu-layers) offloads every layer to the Metal GPU — the dense 24B quant file (25.06GB at Q8_0) is held in unified memory.
-c 65536 sets a 64K context. On 48GB, Q8_0 weights (25.06GB) plus a 64K KV cache sit under the ~34–36GB GPU-usable ceiling; raise or lower -c while watching memory in Activity Monitor (or sudo powermetrics --samplers gpu_power for GPU-side detail).
--jinja applies the GGUF's built-in chat template so reasoning/tool-call blocks parse.

Push toward the vendor's 256K context. Devstral advertises a 256K context window (the vendor figure; the base config's max_position_embeddings is larger via YaRN, but 256K is what Mistral states). Holding the full 256K KV cache alongside Q8_0 weights will exceed the GPU-usable slice — to reach much longer windows, quantize the KV cache: add -fa on (Flash Attention, required for a quantized cache) and -ctk q8_0 -ctv q8_0, which roughly halves KV-cache memory versus f16 with minimal quality impact (llama-server docs):

# Longer context on Q8_0 by 8-bit-quantizing the KV cache
llama-server -hf bartowski/mistralai_Devstral-Small-2-24B-Instruct-2512-GGUF:Q8_0 \
    --port 8000 -ngl 99 -c 131072 --jinja \
    -fa on -ctk q8_0 -ctv q8_0

To free the most memory for a very long context, drop to a lighter quant — :Q6_K (19.35GB, still near-lossless weights), :Q5_K_M (16.76GB), or :Q4_K_M (14.33GB) — each frees several GB for the KV cache at a small fidelity cost.

With Ollama

Only after Ollama's bundled llama.cpp includes PR #17945 (see Installation), pull and run the community GGUF directly from Hugging Face; append a :quant tag to choose the quant (HF × Ollama docs):

ollama run hf.co/bartowski/mistralai_Devstral-Small-2-24B-Instruct-2512-GGUF:Q8_0

Ollama serves an OpenAI-compatible API at http://localhost:11434/v1 for agent clients.

Connect a coding agent

Point any OpenAI-compatible coding client at your local endpoint by setting its base URL and a dummy API key.

OpenHands (this catalogue's house choice). The README lists OpenHands among Devstral's supported agent clients. Point it at your local server:

pip install openhands-ai

# OpenHands routes through LiteLLM; the "openai/" prefix selects the OpenAI-compatible path.
export LLM_MODEL="openai/mistralai/Devstral-Small-2-24B-Instruct-2512"
export LLM_BASE_URL="http://localhost:8000/v1"
export LLM_API_KEY="EMPTY"   # any non-empty string; local servers don't check it

openhands

Mistral Vibe (Mistral's own first-party CLI). The README recommends its own agentic CLI for this model. Install and launch it, then point it at your local endpoint:

uv tool install mistral-vibe   # or: pip install mistral-vibe
vibe

The README also lists Cline, Kilo Code, SWE-agent, and Claude Code as compatible clients — all connect the same way, via the OpenAI-compatible base URL. Devstral's tool-call format is Mistral-specific (see the tokenizer note in Troubleshooting), so the --jinja/built-in-template path above is what makes tool calls parse in llama.cpp.

Results

Memory usage: The dense 24B loads entirely as its GGUF file — Q8_0 is 25.06GB on disk (byte-verified from the bartowski GGUF tree). On the M3 Max's 48GB unified memory, Q8_0 is comfortable — it sits well under the ~34–36GB GPU-usable ceiling, leaving room for the KV cache at a large coding-session context. Q6_K (19.35GB), Q5_K_M (16.76GB), and Q4_K_M (14.33GB) are lighter options that free more memory for very long context; bf16 (47.15GB) does not fit — it exceeds the default GPU-usable ceiling on a 48GB machine.
Model capability: The vendor's README reports SWE-bench Verified 68.0%, SWE-bench Multilingual 55.7%, and Terminal-Bench 2 22.5% — a 24B matching much larger models on SWE-bench Verified. These are Mistral's own agentic-coding evals, not hardware throughput on this GPU.
Speed: No local throughput benchmark for Devstral Small 2 on the Apple M3 Max exists yet — this is a new model and /check/devstral-small-24b/m3-max has no benchmark rows. We would rather omit a tok/s figure than invent one or borrow one from different hardware; live measurements will appear at that link once contributed.

For the full benchmark data, see /check/devstral-small-24b/m3-max.

Troubleshooting

Output is garbled, degraded, or the model won't load correctly

This is the PR #17945 trap. The 2512 release has no first-party GGUF; the community GGUFs need llama.cpp changes from PR ggml-org/llama.cpp#17945 (the Mistral 3 attention-factor fix, merged 2025-12-12) to run correctly. If you built or downloaded llama.cpp before that merge — or you're on an Ollama/LM Studio whose bundled engine predates it — pull/update to a build that includes it. Confirm your llama.cpp checkout is newer than 2025-12-12 (git log on master), or use a prebuilt macOS-arm64 release published after that date.

Tool calls come back as raw text / the agent can't call tools

Devstral uses Mistral's own tokenizer and tool-call format — the Mistral Common tokenizer (tekken.json), which needs mistral-common >= 1.8.6 on the Python serving paths, not the generic ChatML/HF path. On the llama.cpp path, pass --jinja so the GGUF's built-in chat template is applied — a correctly-templated server surfaces tool calls as OpenAI-style tool_calls. If your client shows raw tool-call text, the template isn't being applied.

Out of memory when raising the context

Unified memory is shared with the OS, and macOS caps the GPU-usable slice at roughly 70–75% of total (~34–36GB on a 48GB machine). Q8_0 weights (25.06GB) leave a few GB under that ceiling for the KV cache; a very long window can still exhaust it. If you OOM after raising -c, either lower the context length, quantize the KV cache with -fa on -ctk q8_0 -ctv q8_0 (see Running), or drop to a lighter quant (Q6_K/Q5_K_M/Q4_K_M) to free several GB. You can also raise the GPU-usable ceiling with sudo sysctl iogpu.wired_limit_mb=<value> — but leave headroom for the OS. Devstral is a coding agent — a long agent session with a large repo in context grows the KV cache mid-task, so size for the peak, not idle.

`torch` / a Python ML stack not needed — this is llama.cpp

Serving Devstral via llama.cpp or Ollama does not require PyTorch, flash-attn wheels, or a Python ML stack — the Metal GGUF path needs only the compiled llama-server. On Apple Silicon there is no CUDA toolkit; if cmake can't find Metal support, install Xcode command-line tools (xcode-select --install) and rebuild with -DGGML_METAL=ON.

Model or GPU 404 on /check

Devstral Small 2 (24B) is a new addition; if the /check/devstral-small-24b/m3-max link 404s, the catalogue row is still being registered. The recipe's install and run steps are independent of the benchmark endpoint.