What You'll Build
A fully local, private agentic-coding setup: Devstral Small 2 (24B) — Mistral's dedicated agentic-coding model, and the first Mistral in this catalogue — served as an OpenAI-compatible endpoint by llama.cpp on a single 32GB RTX 5090, driven by a coding agent (OpenHands as this catalogue's house choice, or Mistral's own Mistral Vibe CLI). Devstral is fine-tuned for terminal-based coding agents: it plans, runs shell commands, reads your repo, and edits files through native tool calls. The RTX 5090 is the quality tier: its 32GB is the first consumer card that fits the near-lossless Q8_0 quant (25.06GB) with ~7GB to spare for the KV cache — a real fidelity step over the 24GB tier's Q6_K, and the reason to reach for 32GB.
Hardware data: RTX 5090 (32GB VRAM) · Devstral Small 2 (24B), GGUF Q8_0 (25.06GB) · See benchmark data
ℹ️ This is a coding LLM (with a vision tower), not a chat generalist. Devstral Small 2 is Mistral's agentic-coding model, fine-tuned from Mistral-Small-3.1-24B-Base. It is a dense 24B transformer (32 query / 8 KV heads GQA, hidden size 5120, 40 layers) — not a Mixture-of-Experts, so its footprint is simply the quant file you load plus the KV cache; there is no "active-parameters" shortcut that shrinks VRAM. The checkpoint is a
Mistral3ForConditionalGenerationwith a pixtral vision tower, so it can also analyze images and provide insights based on visual content, in addition to text (per the card) — it is not text-only — but it is positioned and used here as a coding model. Vendor coding evals (README table): SWE-bench Verified 68.0%, SWE-bench Multilingual 55.7%, Terminal-Bench 2 22.5% — a 24B matching much larger models on SWE-bench Verified.
⚠️ CRITICAL — you need a recent llama.cpp (PR #17945). There is no first-party GGUF for this 2512 release; you use the community GGUFs the official README itself links (bartowski or unsloth). The README is explicit that these need llama.cpp changes from PR ggml-org/llama.cpp#17945 to run correctly — that PR ("models : fix the attn_factor for mistral3 graphs + improve consistency", merged 2025-12-12) fixes the RoPE/YaRN attention factor for Mistral 3 graphs, which Devstral 2 depends on. Use a llama.cpp build newer than that merge. Wrappers such as Ollama and LM Studio bundle their own llama.cpp and may lag until they ship a build that includes #17945; if the model loads but produces garbled or degraded output on those, that lag is the likely cause — prefer an up-to-date
llama-serverfor now.
Requirements
| Component | Minimum | Tested target |
|---|---|---|
| GPU | 32GB VRAM (for Q8_0) | RTX 5090 (32GB, Blackwell GB202, sm_120) |
| RAM | 16GB system RAM | 32GB comfortable (agent + repo + OS) |
| Storage | ~25GB (Q8_0), ~15GB (Q4_K_M) | ~25GB for Q8_0 |
| Software | llama.cpp incl. PR #17945 (CUDA) or Ollama/LM Studio once they ship it; OpenHands or Mistral Vibe client | llama-server, OpenHands |
Model weights (community GGUF — the README-linked bartowski/mistralai_Devstral-Small-2-24B-Instruct-2512-GGUF, byte-verified sizes):
| Quant | On-disk size | Fit on RTX 5090 (32GB) |
|---|---|---|
| Q4_K_M | 14.33GB | Fits with huge headroom — leaves ~17GB for a very large KV cache |
| Q5_K_M | 16.76GB | Fits comfortably — leaves ~15GB for context |
| Q6_K | 19.35GB | Fits comfortably — near-lossless weights, ~12GB left for the KV cache |
| Q8_0 | 25.06GB | Recommended — near-lossless; leaves ~7GB for the KV cache. The reason to have 32GB |
| bf16 | 47.15GB | Does not fit 32GB — needs multi-GPU / datacenter |
The bartowski/...-imatrix.gguf (~10 MB) is calibration data, not a model — never load it as a quant. unsloth/mistralai_Devstral-Small-2-24B-Instruct-2512-GGUF is the other README-linked source if you prefer it.
ℹ️ 32GB unlocks the near-lossless Q8_0 — the top of the quant ladder. Because Devstral is dense (one full quant file, not an MoE with a fixed active slice), a bigger card lets you load a bigger quant, not just more context. The 24GB tier tops out at Q6_K; the RTX 5090's 32GB is the first consumer card that fits Q8_0 (25.06GB) — near-lossless weights, a genuine fidelity step over Q6_K — with ~7GB still left for the KV cache. Full bf16 (47.15GB) still does not fit 32GB; that needs multi-GPU or a datacenter card. Q8_0 is the practical ceiling here, and the reason to have 32GB.
Licensing. Devstral Small 2 is Apache-2.0 — free for commercial and non-commercial use, no revenue caps (model card).
Installation
You have two GGUF runtimes; pick one. For this release, the safe path is a current llama.cpp build (Option A) because of the PR #17945 requirement above.
Option A — llama.cpp with CUDA (recommended for this release)
The RTX 5090 is Blackwell (GB202, sm_120). Build a recent llama.cpp (one whose master is after the 2025-12-12 merge of PR #17945) so the Mistral 3 attention-factor fix is present, then compile for sm_120, per the official build guide:
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
# Confirm your checkout includes PR #17945 (merged 2025-12-12) — pull latest master.
# RTX 5090 is Blackwell = compute capability 12.0 (sm_120); needs CUDA 12.8+
cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=120
cmake --build build --config Release -j 8
If you use a prebuilt llama.cpp release instead, pick one published after 2025-12-12 from the releases page so it contains the fix. The CUDA backend flag is -DGGML_CUDA=ON on current llama.cpp (the old LLAMA_CUDA name was retired in late 2024); Blackwell (sm_120) needs the CUDA 12.8+ toolkit installed first.
Option B — Ollama / LM Studio (only once they ship #17945)
Ollama and LM Studio both list Devstral Small 2 and are built on llama.cpp. They are the fastest to stand up, but each bundles its own llama.cpp — use them only after their bundled engine includes PR #17945. If output looks broken on either, that engine lag is the first thing to check; fall back to an up-to-date llama-server (Option A) meanwhile.
Running
With llama.cpp
Serve an OpenAI-compatible API on port 8000. The -hf flag pulls the GGUF from Hugging Face; append :Q8_0 (case-insensitive) to pick the quant — without a tag, llama-server defaults to Q4_K_M (llama-server docs):
# Q8_0 (recommended on 32GB — near-lossless), offload all layers to the 5090, large context
llama-server -hf bartowski/mistralai_Devstral-Small-2-24B-Instruct-2512-GGUF:Q8_0 \
--port 8000 \
-ngl 99 \
-c 49152 \
--jinja
-ngl 99(--n-gpu-layers) offloads every layer to the GPU — the dense 24B quant file (25.06GB at Q8_0) must sit in VRAM.-c 49152sets a 48K context. Q8_0 leaves ~7GB of the 32GB for the KV cache after the weights; for a larger window at the same near-lossless quality, quantize the KV cache (below), or step down to Q6_K (19.35GB) to free ~5GB more. Watchnvidia-smiand adjust-c.--jinjaapplies the GGUF's built-in chat template so reasoning/tool-call blocks parse.
Push toward the vendor's 256K context. Devstral advertises a 256K context window (the vendor figure; the base config's max_position_embeddings is larger via YaRN, but 256K is what Mistral states). You cannot hold the full 256K KV cache and the Q8_0 weights on 32GB at f16 — to reach much longer windows, quantize the KV cache: add -fa on (Flash Attention, required for a quantized cache) and -ctk q8_0 -ctv q8_0, which roughly halves KV-cache VRAM versus f16 with minimal quality impact (llama-server docs):
# Longer context on Q8_0 by 8-bit-quantizing the KV cache
llama-server -hf bartowski/mistralai_Devstral-Small-2-24B-Instruct-2512-GGUF:Q8_0 \
--port 8000 -ngl 99 -c 98304 --jinja \
-fa on -ctk q8_0 -ctv q8_0
Want the longest possible window instead of maximum fidelity? Step down the quant: :Q6_K (19.35GB) leaves ~12GB for the KV cache, :Q5_K_M (16.76GB) ~15GB, :Q4_K_M (14.33GB) ~17GB. Full bf16 (47.15GB) does not fit 32GB — that's a multi-GPU / datacenter path.
With Ollama
Only after Ollama's bundled llama.cpp includes PR #17945 (see Installation), pull and run the community GGUF directly from Hugging Face; append a :quant tag to choose the quant (HF × Ollama docs):
ollama run hf.co/bartowski/mistralai_Devstral-Small-2-24B-Instruct-2512-GGUF:Q8_0
Ollama serves an OpenAI-compatible API at http://localhost:11434/v1 for agent clients.
Connect a coding agent
Point any OpenAI-compatible coding client at your local endpoint by setting its base URL and a dummy API key.
OpenHands (this catalogue's house choice). The README lists OpenHands among Devstral's supported agent clients. Point it at your local server:
pip install openhands-ai
# OpenHands routes through LiteLLM; the "openai/" prefix selects the OpenAI-compatible path.
export LLM_MODEL="openai/mistralai/Devstral-Small-2-24B-Instruct-2512"
export LLM_BASE_URL="http://localhost:8000/v1"
export LLM_API_KEY="EMPTY" # any non-empty string; local servers don't check it
openhands
Mistral Vibe (Mistral's own first-party CLI). The README recommends its own agentic CLI for this model. Install and launch it, then point it at your local endpoint:
uv tool install mistral-vibe # or: pip install mistral-vibe
vibe
The README also lists Cline, Kilo Code, SWE-agent, and Claude Code as compatible clients — all connect the same way, via the OpenAI-compatible base URL. Devstral's tool-call format is Mistral-specific (see the tokenizer note in Troubleshooting), so the --jinja/built-in-template path above is what makes tool calls parse in llama.cpp.
If you serve with vLLM instead (multi-GPU / large-VRAM path)
vLLM is the vendor-recommended reliable server and the cleanest path for Mistral's tokenizer and tool-call parsing — but it runs the model unquantized, so even a single 32GB 5090 is not enough (bf16 weights are ~47GB). The vendor's own example is a two-GPU invocation, shown here for completeness only; on a single 5090 stay on the GGUF + llama.cpp path above (Q8_0 gives you near-lossless quality anyway):
uv pip install -U vllm
pip install "mistral_common>=1.8.6"
vllm serve mistralai/Devstral-Small-2-24B-Instruct-2512 \
--max-model-len 262144 --tensor-parallel-size 2 \
--tool-call-parser mistral --enable-auto-tool-choice
Results
- VRAM usage: The dense 24B loads entirely as its GGUF file — Q8_0 is 25.06GB on disk (byte-verified from the bartowski GGUF tree). On the RTX 5090's 32GB, Q8_0 is the recommended near-lossless choice — roughly ~7GB left for the KV cache; step down to Q6_K (19.35GB, ~12GB free), Q5_K_M (16.76GB, ~15GB free), or Q4_K_M (14.33GB, ~17GB free) for a larger window. bf16 (47.15GB) does not fit 32GB — that's a multi-GPU / datacenter path.
- Model capability: The vendor's README reports SWE-bench Verified 68.0%, SWE-bench Multilingual 55.7%, and Terminal-Bench 2 22.5% — a 24B matching much larger models on SWE-bench Verified. These are Mistral's own agentic-coding evals, not hardware throughput on this GPU.
- Speed: No local throughput benchmark for Devstral Small 2 on the RTX 5090 exists yet — this is a new model and
/check/devstral-small-24b/rtx-5090has no benchmark rows. We would rather omit a tok/s figure than invent one or borrow one from different hardware; live measurements will appear at that link once contributed.
For the full benchmark data, see /check/devstral-small-24b/rtx-5090.
Troubleshooting
Output is garbled, degraded, or the model won't load correctly
This is the PR #17945 trap. The 2512 release has no first-party GGUF; the community GGUFs need llama.cpp changes from PR ggml-org/llama.cpp#17945 (the Mistral 3 attention-factor fix, merged 2025-12-12) to run correctly. If you built or downloaded llama.cpp before that merge — or you're on an Ollama/LM Studio whose bundled engine predates it — pull/update to a build that includes it. Confirm your llama.cpp checkout is newer than 2025-12-12 (git log on master), or use a prebuilt release published after that date. On Blackwell (sm_120) also confirm you built against CUDA 12.8+; an older toolkit can fail to compile the sm_120 kernels.
Tool calls come back as raw text / the agent can't call tools
Devstral uses Mistral's own tokenizer and tool-call format — the Mistral Common tokenizer (tekken.json), which needs mistral-common >= 1.8.6 on the Python serving paths, not the generic ChatML/HF path. On the vLLM path this means passing --tool-call-parser mistral --enable-auto-tool-choice (as in the vendor example above). On the llama.cpp path, pass --jinja so the GGUF's built-in chat template is applied — a correctly-templated server surfaces tool calls as OpenAI-style tool_calls. If your client shows raw tool-call text, the template/parser isn't being applied.
Out of memory when raising the context
Q8_0 weights (25.06GB) leave ~7GB for the KV cache on 32GB; a very long window can still exhaust it. If you OOM after raising -c, either lower the context length or quantize the KV cache with -fa on -ctk q8_0 -ctv q8_0 (see Running) to reach toward the vendor's 256K window. Stepping down from Q8_0 to Q6_K frees ~5GB more for context (at a small fidelity cost). Devstral is a coding agent — a long agent session with a large repo in context grows the KV cache mid-task, so size for the peak, not idle.
torch / CUDA not needed — this is llama.cpp
Serving Devstral via llama.cpp or Ollama does not require PyTorch, flash-attn wheels, or a Python ML stack — those belong to the vLLM/SGLang paths on the card, which target large-VRAM or multi-GPU rigs (the vendor's vllm serve example uses --tensor-parallel-size 2). On a single RTX 5090 the GGUF + llama.cpp path is the right one; if you hit a CUDA error, confirm you installed the CUDA-enabled llama.cpp build (Option A, CUDA 12.8+ for Blackwell) rather than a CPU-only binary.
Model or GPU 404 on /check
Devstral Small 2 (24B) is a new addition; if the /check/devstral-small-24b/rtx-5090 link 404s, the catalogue row is still being registered. The recipe's install and run steps are independent of the benchmark endpoint.