self-hosted/ai
§01·recipe · llm

Devstral Small 2 (24B) on Apple M3 Max: Local Agentic Coding via llama.cpp Metal + OpenHands (48GB Apple / Q8_0 near-lossless)

llmintermediate48GB+ VRAMJul 3, 2026

This intermediate recipe sets up Devstral Small 2 (24B) on the Apple M3 Max, needing about 48 GB of VRAM.

models
tools
prerequisites
  • Apple M3 Max with 48GB unified memory (Metal GPU) — the vendor names 'a Mac with 32GB RAM' as a target; 48GB comfortably runs the near-lossless Q8_0
  • macOS with Xcode command-line tools (for building llama.cpp with Metal)
  • ~26GB free disk for the Q8_0 GGUF (down to ~15GB for Q4_K_M)
  • A llama.cpp build that includes PR ggml-org/llama.cpp#17945 (see the critical note below); Python 3.10+ for the OpenHands agent

What You'll Build

A fully local, private agentic-coding setup: Devstral Small 2 (24B) — Mistral's dedicated agentic-coding model, and the first Mistral in this catalogue — served as an OpenAI-compatible endpoint by llama.cpp built with Metal on an Apple M3 Max (48GB unified memory), driven by a coding agent (OpenHands as this catalogue's house choice, or Mistral's own Mistral Vibe CLI). Devstral is fine-tuned for terminal-based coding agents: it plans, runs shell commands, reads your repo, and edits files through native tool calls. The vendor names Apple as an explicit target — "With its compact size of just 24 billion parameters, Devstral is light enough to run on a single RTX 4090 or a Mac with 32GB RAM" (Devstral-Small-2-24B-Instruct-2512 model card). On a 48GB Mac you go a full quant step further than the vendor's 32GB floor — big enough for the near-lossless Q8_0.

Hardware data: Apple M3 Max (48GB unified memory, Metal) · Devstral Small 2 (24B), GGUF Q8_0 (25.06GB, recommended) or lighter Q6_K/Q5_K_M/Q4_K_M · See benchmark data

ℹ️ This is a coding LLM (with a vision tower), not a chat generalist. Devstral Small 2 is Mistral's agentic-coding model, fine-tuned from Mistral-Small-3.1-24B-Base. It is a dense 24B transformer (32 query / 8 KV heads GQA, hidden size 5120, 40 layers) — not a Mixture-of-Experts, so its footprint is simply the quant file you load plus the KV cache; there is no "active-parameters" shortcut that shrinks memory. The checkpoint is a Mistral3ForConditionalGeneration with a pixtral vision tower, so it can also analyze images and provide insights based on visual content, in addition to text (per the card) — it is not text-only — but it is positioned and used here as a coding model. Vendor coding evals (README table): SWE-bench Verified 68.0%, SWE-bench Multilingual 55.7%, Terminal-Bench 2 22.5% — a 24B matching much larger models on SWE-bench Verified.

⚠️ CRITICAL — you need a recent llama.cpp (PR #17945). There is no first-party GGUF for this 2512 release; you use the community GGUFs the official README itself links (bartowski or unsloth). The README is explicit that these need llama.cpp changes from PR ggml-org/llama.cpp#17945 to run correctly — that PR ("models : fix the attn_factor for mistral3 graphs + improve consistency", merged 2025-12-12) fixes the RoPE/YaRN attention factor for Mistral 3 graphs, which Devstral 2 depends on. Use a llama.cpp build newer than that merge. Wrappers such as Ollama and LM Studio bundle their own llama.cpp and may lag until they ship a build that includes #17945; if the model loads but produces garbled or degraded output on those, that lag is the likely cause — prefer an up-to-date llama-server (Metal) for now.

Requirements

ComponentMinimumTested target
GPUApple Silicon with Metal, 48GB unified memory (this card's floor)Apple M3 Max (48GB unified memory)
MemoryUnified memory shared with the OS — see the ceiling note below48GB unified (recommend Q8_0 at 25.06GB)
Storage~15GB (Q4_K_M) up to ~26GB (Q8_0)~26GB for Q8_0
Softwarellama.cpp incl. PR #17945 (Metal) or Ollama once it ships #17945; OpenHands or Mistral Vibe clientllama-server (Metal), OpenHands

Model weights (community GGUF — the README-linked bartowski/mistralai_Devstral-Small-2-24B-Instruct-2512-GGUF, byte-verified sizes):

QuantOn-disk sizeFit on M3 Max (48GB unified)
Q4_K_M14.33GBLighter option — frees the most memory for very long context
Q5_K_M16.76GBLighter option — small fidelity bump over Q4_K_M, still leaves ample context room
Q6_K19.35GBLighter option — near-lossless weights with a smaller footprint than Q8_0
Q8_025.06GBRecommended — comfortable on 48GB (well under the ~34–36GB GPU-usable ceiling); the near-lossless quality pick
bf1647.15GBDoes not fit 48GB — exceeds the default GPU-usable ceiling (~34–36GB of 48); needs a larger Mac

The bartowski/...-imatrix.gguf (~10 MB) is calibration data, not a model — never load it as a quant. unsloth/mistralai_Devstral-Small-2-24B-Instruct-2512-GGUF is the other README-linked source if you prefer it.

ℹ️ Unified memory is shared with the OS. On Apple Silicon the GPU draws from the same pool as the system; macOS caps the GPU-usable slice at roughly 70–75% of total (about 34–36GB on a 48GB machine) unless you raise iogpu.wired_limit_mb. Q8_0 (25.06GB) sits comfortably under that ceiling with room for the KV cache; bf16 (47.15GB) does not fit — it exceeds the default GPU-usable ceiling even before the KV cache, so 48GB is a Q8_0-and-below machine.

Licensing. Devstral Small 2 is Apache-2.0 — free for commercial and non-commercial use, no revenue caps (model card).

Installation

You have two GGUF runtimes; pick one. For this release, the safe path is a current llama.cpp build with Metal (Option A) because of the PR #17945 requirement above.

Option A — llama.cpp with Metal (recommended for this release)

On Apple Silicon, llama.cpp builds with the Metal backend by default (-DGGML_METAL=ON is the macOS default). Build a recent llama.cpp (one whose master is after the 2025-12-12 merge of PR #17945) so the Mistral 3 attention-factor fix is present, per the official build guide:

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
# Confirm your checkout includes PR #17945 (merged 2025-12-12) — pull latest master.
# Metal is on by default on macOS; -DGGML_METAL=ON is explicit here.
cmake -B build -DGGML_METAL=ON
cmake --build build --config Release -j 8

If you use a prebuilt llama.cpp release instead, pick a macOS-arm64 build published after 2025-12-12 from the releases page so it contains the fix. You need Xcode command-line tools (xcode-select --install) for the Metal build; no CUDA toolkit is involved on Apple Silicon.

Option B — Ollama / LM Studio (only once they ship #17945)

Ollama and LM Studio both list Devstral Small 2 and are built on llama.cpp. They are the fastest to stand up, but each bundles its own llama.cpp — use them only after their bundled engine includes PR #17945. If output looks broken on either, that engine lag is the first thing to check; fall back to an up-to-date llama-server (Option A) meanwhile.

Running

With llama.cpp

Serve an OpenAI-compatible API on port 8000. The -hf flag pulls the GGUF from Hugging Face; append :Q8_0 (case-insensitive) to pick the quant — without a tag, llama-server defaults to Q4_K_M (llama-server docs):

# Q8_0 (recommended on 48GB) — near-lossless, offload all layers to the Metal GPU
llama-server -hf bartowski/mistralai_Devstral-Small-2-24B-Instruct-2512-GGUF:Q8_0 \
    --port 8000 \
    -ngl 99 \
    -c 65536 \
    --jinja
  • -ngl 99 (--n-gpu-layers) offloads every layer to the Metal GPU — the dense 24B quant file (25.06GB at Q8_0) is held in unified memory.
  • -c 65536 sets a 64K context. On 48GB, Q8_0 weights (25.06GB) plus a 64K KV cache sit under the ~34–36GB GPU-usable ceiling; raise or lower -c while watching memory in Activity Monitor (or sudo powermetrics --samplers gpu_power for GPU-side detail).
  • --jinja applies the GGUF's built-in chat template so reasoning/tool-call blocks parse.

Push toward the vendor's 256K context. Devstral advertises a 256K context window (the vendor figure; the base config's max_position_embeddings is larger via YaRN, but 256K is what Mistral states). Holding the full 256K KV cache alongside Q8_0 weights will exceed the GPU-usable slice — to reach much longer windows, quantize the KV cache: add -fa on (Flash Attention, required for a quantized cache) and -ctk q8_0 -ctv q8_0, which roughly halves KV-cache memory versus f16 with minimal quality impact (llama-server docs):

# Longer context on Q8_0 by 8-bit-quantizing the KV cache
llama-server -hf bartowski/mistralai_Devstral-Small-2-24B-Instruct-2512-GGUF:Q8_0 \
    --port 8000 -ngl 99 -c 131072 --jinja \
    -fa on -ctk q8_0 -ctv q8_0

To free the most memory for a very long context, drop to a lighter quant — :Q6_K (19.35GB, still near-lossless weights), :Q5_K_M (16.76GB), or :Q4_K_M (14.33GB) — each frees several GB for the KV cache at a small fidelity cost.

With Ollama

Only after Ollama's bundled llama.cpp includes PR #17945 (see Installation), pull and run the community GGUF directly from Hugging Face; append a :quant tag to choose the quant (HF × Ollama docs):

ollama run hf.co/bartowski/mistralai_Devstral-Small-2-24B-Instruct-2512-GGUF:Q8_0

Ollama serves an OpenAI-compatible API at http://localhost:11434/v1 for agent clients.

Connect a coding agent

Point any OpenAI-compatible coding client at your local endpoint by setting its base URL and a dummy API key.

OpenHands (this catalogue's house choice). The README lists OpenHands among Devstral's supported agent clients. Point it at your local server:

pip install openhands-ai

# OpenHands routes through LiteLLM; the "openai/" prefix selects the OpenAI-compatible path.
export LLM_MODEL="openai/mistralai/Devstral-Small-2-24B-Instruct-2512"
export LLM_BASE_URL="http://localhost:8000/v1"
export LLM_API_KEY="EMPTY"   # any non-empty string; local servers don't check it

openhands

Mistral Vibe (Mistral's own first-party CLI). The README recommends its own agentic CLI for this model. Install and launch it, then point it at your local endpoint:

uv tool install mistral-vibe   # or: pip install mistral-vibe
vibe

The README also lists Cline, Kilo Code, SWE-agent, and Claude Code as compatible clients — all connect the same way, via the OpenAI-compatible base URL. Devstral's tool-call format is Mistral-specific (see the tokenizer note in Troubleshooting), so the --jinja/built-in-template path above is what makes tool calls parse in llama.cpp.

Results

  • Memory usage: The dense 24B loads entirely as its GGUF file — Q8_0 is 25.06GB on disk (byte-verified from the bartowski GGUF tree). On the M3 Max's 48GB unified memory, Q8_0 is comfortable — it sits well under the ~34–36GB GPU-usable ceiling, leaving room for the KV cache at a large coding-session context. Q6_K (19.35GB), Q5_K_M (16.76GB), and Q4_K_M (14.33GB) are lighter options that free more memory for very long context; bf16 (47.15GB) does not fit — it exceeds the default GPU-usable ceiling on a 48GB machine.
  • Model capability: The vendor's README reports SWE-bench Verified 68.0%, SWE-bench Multilingual 55.7%, and Terminal-Bench 2 22.5% — a 24B matching much larger models on SWE-bench Verified. These are Mistral's own agentic-coding evals, not hardware throughput on this GPU.
  • Speed: No local throughput benchmark for Devstral Small 2 on the Apple M3 Max exists yet — this is a new model and /check/devstral-small-24b/m3-max has no benchmark rows. We would rather omit a tok/s figure than invent one or borrow one from different hardware; live measurements will appear at that link once contributed.

For the full benchmark data, see /check/devstral-small-24b/m3-max.

Troubleshooting

Output is garbled, degraded, or the model won't load correctly

This is the PR #17945 trap. The 2512 release has no first-party GGUF; the community GGUFs need llama.cpp changes from PR ggml-org/llama.cpp#17945 (the Mistral 3 attention-factor fix, merged 2025-12-12) to run correctly. If you built or downloaded llama.cpp before that merge — or you're on an Ollama/LM Studio whose bundled engine predates it — pull/update to a build that includes it. Confirm your llama.cpp checkout is newer than 2025-12-12 (git log on master), or use a prebuilt macOS-arm64 release published after that date.

Tool calls come back as raw text / the agent can't call tools

Devstral uses Mistral's own tokenizer and tool-call format — the Mistral Common tokenizer (tekken.json), which needs mistral-common >= 1.8.6 on the Python serving paths, not the generic ChatML/HF path. On the llama.cpp path, pass --jinja so the GGUF's built-in chat template is applied — a correctly-templated server surfaces tool calls as OpenAI-style tool_calls. If your client shows raw tool-call text, the template isn't being applied.

Out of memory when raising the context

Unified memory is shared with the OS, and macOS caps the GPU-usable slice at roughly 70–75% of total (~34–36GB on a 48GB machine). Q8_0 weights (25.06GB) leave a few GB under that ceiling for the KV cache; a very long window can still exhaust it. If you OOM after raising -c, either lower the context length, quantize the KV cache with -fa on -ctk q8_0 -ctv q8_0 (see Running), or drop to a lighter quant (Q6_K/Q5_K_M/Q4_K_M) to free several GB. You can also raise the GPU-usable ceiling with sudo sysctl iogpu.wired_limit_mb=<value> — but leave headroom for the OS. Devstral is a coding agent — a long agent session with a large repo in context grows the KV cache mid-task, so size for the peak, not idle.

torch / a Python ML stack not needed — this is llama.cpp

Serving Devstral via llama.cpp or Ollama does not require PyTorch, flash-attn wheels, or a Python ML stack — the Metal GGUF path needs only the compiled llama-server. On Apple Silicon there is no CUDA toolkit; if cmake can't find Metal support, install Xcode command-line tools (xcode-select --install) and rebuild with -DGGML_METAL=ON.

Model or GPU 404 on /check

Devstral Small 2 (24B) is a new addition; if the /check/devstral-small-24b/m3-max link 404s, the catalogue row is still being registered. The recipe's install and run steps are independent of the benchmark endpoint.

common questions
How much VRAM does Devstral Small 2 (24B) need?

About 48 GB — the minimum this recipe targets.

Which GPUs is Devstral Small 2 (24B) tested on?

Apple M3 Max (48 GB).

How hard is this setup?

Intermediate — follow the steps above.