How much VRAM does Devstral Small 2 (24B) need?

About 16 GB — the minimum this recipe targets.

How hard is this setup?

Intermediate — follow the steps above.

Devstral Small 2 (24B) on RTX 4080: Local Agentic Coding via llama.cpp + OpenHands (16GB Entry Tier)

What You'll Build

A fully local, private agentic-coding setup: Devstral Small 2 (24B) — Mistral's dedicated agentic-coding model, and the first Mistral in this catalogue — served as an OpenAI-compatible endpoint by llama.cpp on a single 16GB RTX 4080, driven by a coding agent (OpenHands as this catalogue's house choice, or Mistral's own Mistral Vibe CLI). Devstral is fine-tuned for terminal-based coding agents: it plans, runs shell commands, reads your repo, and edits files through native tool calls. The RTX 4080's 16GB is the model's entry tier — only the Q4_K_M quant fits, and context is tight — so this card is about getting the 24B running well within a bounded window; step up to a 24GB card for the larger Q5/Q6 quants and room to grow.

Hardware data: RTX 4080 (16GB VRAM) · Devstral Small 2 (24B), GGUF Q4_K_M (14.33GB) · See benchmark data

ℹ️ This is a coding LLM (with a vision tower), not a chat generalist. Devstral Small 2 is Mistral's agentic-coding model, fine-tuned from Mistral-Small-3.1-24B-Base. It is a dense 24B transformer (32 query / 8 KV heads GQA, hidden size 5120, 40 layers) — not a Mixture-of-Experts, so its footprint is simply the quant file you load plus the KV cache; there is no "active-parameters" shortcut that shrinks VRAM. The checkpoint is a Mistral3ForConditionalGeneration with a pixtral vision tower, so it can also analyze images and provide insights based on visual content, in addition to text (per the card) — it is not text-only — but it is positioned and used here as a coding model. Vendor coding evals (README table): SWE-bench Verified 68.0%, SWE-bench Multilingual 55.7%, Terminal-Bench 2 22.5% — a 24B matching much larger models on SWE-bench Verified.

⚠️ CRITICAL — you need a recent llama.cpp (PR #17945). There is no first-party GGUF for this 2512 release; you use the community GGUFs the official README itself links (bartowski or unsloth). The README is explicit that these need llama.cpp changes from PR ggml-org/llama.cpp#17945 to run correctly — that PR ("models : fix the attn_factor for mistral3 graphs + improve consistency", merged 2025-12-12) fixes the RoPE/YaRN attention factor for Mistral 3 graphs, which Devstral 2 depends on. Use a llama.cpp build newer than that merge. Wrappers such as Ollama and LM Studio bundle their own llama.cpp and may lag until they ship a build that includes #17945; if the model loads but produces garbled or degraded output on those, that lag is the likely cause — prefer an up-to-date llama-server for now.

Requirements

Component	Minimum	Tested target
GPU	16GB VRAM (this model's floor)	RTX 4080 (16GB, Ada Lovelace AD103, sm_89)
RAM	16GB system RAM	32GB comfortable (agent + repo + OS)
Storage	~15GB (Q4_K_M)	~15GB for Q4_K_M
Software	llama.cpp incl. PR #17945 (CUDA) or Ollama/LM Studio once they ship it; OpenHands or Mistral Vibe client	`llama-server`, OpenHands

Model weights (community GGUF — the README-linked bartowski/mistralai_Devstral-Small-2-24B-Instruct-2512-GGUF, byte-verified sizes):

Quant	On-disk size	Fit on RTX 4080 (16GB)
Q4_K_M	14.33GB	Recommended — the only quant that fits — leaves ~1.5–2GB for the KV cache, so context is tight; bound it (see Running)
Q5_K_M	16.76GB	Does not fit 16GB — exceeds the RTX 4080's VRAM; needs a 24GB+ card
Q6_K	19.35GB	Does not fit 16GB — needs a 24GB+ card
Q8_0	25.06GB	Does not fit 16GB — needs a 32GB+ card
bf16	47.15GB	Does not fit 16GB — datacenter-only

The bartowski/...-imatrix.gguf (~10 MB) is calibration data, not a model — never load it as a quant. unsloth/mistralai_Devstral-Small-2-24B-Instruct-2512-GGUF is the other README-linked source if you prefer it.

ℹ️ Only Q4_K_M fits 16GB — a dense-model tradeoff. Because Devstral is dense (one full quant file, not an MoE with a fixed active slice), the whole weight file must sit in VRAM. On 16GB, Q4_K_M (14.33GB) is the only quant that leaves room to run; Q5_K_M (16.76GB) already exceeds the card. That is the story of this tier: 16GB is workable but context-constrained. A 24GB card (e.g. RTX 3090/4090) fits Q5_K_M and Q6_K for a fidelity bump plus a larger KV cache; a 32GB card (RTX 5090) reaches the near-lossless Q8_0.

Licensing. Devstral Small 2 is Apache-2.0 — free for commercial and non-commercial use, no revenue caps (model card).

Installation

You have two GGUF runtimes; pick one. For this release, the safe path is a current llama.cpp build (Option A) because of the PR #17945 requirement above.

Option A — llama.cpp with CUDA (recommended for this release)

The RTX 4080 is Ada Lovelace (AD103, sm_89). Build a recent llama.cpp (one whose master is after the 2025-12-12 merge of PR #17945) so the Mistral 3 attention-factor fix is present, then compile for sm_89, per the official build guide:

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
# Confirm your checkout includes PR #17945 (merged 2025-12-12) — pull latest master.
# RTX 4080 is Ada Lovelace = compute capability 8.9 (sm_89)
cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=89
cmake --build build --config Release -j 8

If you use a prebuilt llama.cpp release instead, pick one published after 2025-12-12 from the releases page so it contains the fix. The CUDA backend flag is -DGGML_CUDA=ON on current llama.cpp (the old LLAMA_CUDA name was retired in late 2024); the NVIDIA CUDA toolkit must be installed first.

Option B — Ollama / LM Studio (only once they ship #17945)

Ollama and LM Studio both list Devstral Small 2 and are built on llama.cpp. They are the fastest to stand up, but each bundles its own llama.cpp — use them only after their bundled engine includes PR #17945. If output looks broken on either, that engine lag is the first thing to check; fall back to an up-to-date llama-server (Option A) meanwhile.

Running

With llama.cpp

Serve an OpenAI-compatible API on port 8000. The -hf flag pulls the GGUF from Hugging Face; append :Q4_K_M (case-insensitive) to pick the quant — without a tag, llama-server defaults to Q4_K_M (llama-server docs):

# Q4_K_M (the only fit on 16GB), offload all layers, BOUNDED context
llama-server -hf bartowski/mistralai_Devstral-Small-2-24B-Instruct-2512-GGUF:Q4_K_M \
    --port 8000 \
    -ngl 99 \
    -c 8192 \
    --jinja

-ngl 99 (--n-gpu-layers) offloads every layer to the GPU — the dense 24B quant file (14.33GB at Q4_K_M) must sit in VRAM.
-c 8192 starts with a small context. On 16GB the Q4_K_M weights leave only ~1.5–2GB for the KV cache, so context is tight — begin at -c 8192 (or -c 16384) and raise it only while watching nvidia-smi for headroom.
--jinja applies the GGUF's built-in chat template so reasoning/tool-call blocks parse.

Stretch the context with a quantized KV cache. Devstral advertises a 256K context window (the vendor figure; the base config's max_position_embeddings is larger via YaRN, but 256K is what Mistral states). On 16GB you cannot get near that — but you can roughly double what the ~1.5–2GB KV budget holds by quantizing the KV cache: add -fa on (Flash Attention, required for a quantized cache) and -ctk q8_0 -ctv q8_0, which roughly halves KV-cache VRAM versus f16 with minimal quality impact (llama-server docs):

# More context on 16GB by 8-bit-quantizing the KV cache
llama-server -hf bartowski/mistralai_Devstral-Small-2-24B-Instruct-2512-GGUF:Q4_K_M \
    --port 8000 -ngl 99 -c 16384 --jinja \
    -fa on -ctk q8_0 -ctv q8_0

On 16GB, stay on Q4_K_M — there is no headroom to move up a quant here. For Q5_K_M or Q6_K you need a 24GB card; for the near-lossless Q8_0, a 32GB card.

With Ollama

Only after Ollama's bundled llama.cpp includes PR #17945 (see Installation), pull and run the community GGUF directly from Hugging Face; append a :quant tag to choose the quant (HF × Ollama docs):

ollama run hf.co/bartowski/mistralai_Devstral-Small-2-24B-Instruct-2512-GGUF:Q4_K_M

Ollama serves an OpenAI-compatible API at http://localhost:11434/v1 for agent clients.

Connect a coding agent

Point any OpenAI-compatible coding client at your local endpoint by setting its base URL and a dummy API key.

OpenHands (this catalogue's house choice). The README lists OpenHands among Devstral's supported agent clients. Point it at your local server:

pip install openhands-ai

# OpenHands routes through LiteLLM; the "openai/" prefix selects the OpenAI-compatible path.
export LLM_MODEL="openai/mistralai/Devstral-Small-2-24B-Instruct-2512"
export LLM_BASE_URL="http://localhost:8000/v1"
export LLM_API_KEY="EMPTY"   # any non-empty string; local servers don't check it

openhands

Mistral Vibe (Mistral's own first-party CLI). The README recommends its own agentic CLI for this model. Install and launch it, then point it at your local endpoint:

uv tool install mistral-vibe   # or: pip install mistral-vibe
vibe

The README also lists Cline, Kilo Code, SWE-agent, and Claude Code as compatible clients — all connect the same way, via the OpenAI-compatible base URL. Devstral's tool-call format is Mistral-specific (see the tokenizer note in Troubleshooting), so the --jinja/built-in-template path above is what makes tool calls parse in llama.cpp.

If you serve with vLLM instead (multi-GPU / large-VRAM path)

vLLM is the vendor-recommended reliable server and the cleanest path for Mistral's tokenizer and tool-call parsing — but it runs the model unquantized, so a single 16GB 4080 is nowhere near enough (bf16 weights are ~47GB). The vendor's own example is a two-GPU invocation, shown here for completeness only; on a single 4080 stay on the GGUF + llama.cpp path above:

uv pip install -U vllm
pip install "mistral_common>=1.8.6"

vllm serve mistralai/Devstral-Small-2-24B-Instruct-2512 \
    --max-model-len 262144 --tensor-parallel-size 2 \
    --tool-call-parser mistral --enable-auto-tool-choice

Results

VRAM usage: The dense 24B loads entirely as its GGUF file — Q4_K_M is 14.33GB on disk (byte-verified from the bartowski GGUF tree). On the RTX 4080's 16GB, Q4_K_M is the only quant that fits — roughly ~1.5–2GB left for the KV cache, so context is tight and should be bounded (start -c 8192/16384, quantize the KV cache to stretch it). Q5_K_M (16.76GB), Q6_K (19.35GB), Q8_0 (25.06GB), and bf16 (47.15GB) all do not fit 16GB.
Model capability: The vendor's README reports SWE-bench Verified 68.0%, SWE-bench Multilingual 55.7%, and Terminal-Bench 2 22.5% — a 24B matching much larger models on SWE-bench Verified. These are Mistral's own agentic-coding evals, not hardware throughput on this GPU.
Speed: No local throughput benchmark for Devstral Small 2 on the RTX 4080 exists yet — this is a new model and /check/devstral-small-24b/rtx-4080 has no benchmark rows. We would rather omit a tok/s figure than invent one or borrow one from different hardware; live measurements will appear at that link once contributed.

For the full benchmark data, see /check/devstral-small-24b/rtx-4080.

Troubleshooting

Output is garbled, degraded, or the model won't load correctly

This is the PR #17945 trap. The 2512 release has no first-party GGUF; the community GGUFs need llama.cpp changes from PR ggml-org/llama.cpp#17945 (the Mistral 3 attention-factor fix, merged 2025-12-12) to run correctly. If you built or downloaded llama.cpp before that merge — or you're on an Ollama/LM Studio whose bundled engine predates it — pull/update to a build that includes it. Confirm your llama.cpp checkout is newer than 2025-12-12 (git log on master), or use a prebuilt release published after that date.

Tool calls come back as raw text / the agent can't call tools

Devstral uses Mistral's own tokenizer and tool-call format — the Mistral Common tokenizer (tekken.json), which needs mistral-common >= 1.8.6 on the Python serving paths, not the generic ChatML/HF path. On the vLLM path this means passing --tool-call-parser mistral --enable-auto-tool-choice (as in the vendor example above). On the llama.cpp path, pass --jinja so the GGUF's built-in chat template is applied — a correctly-templated server surfaces tool calls as OpenAI-style tool_calls. If your client shows raw tool-call text, the template/parser isn't being applied.

Out of memory — the tight-16GB trap

On 16GB, Q4_K_M weights (14.33GB) leave only ~1.5–2GB for the KV cache — the smallest headroom of any tier for this model. If you OOM, first lower -c (the context length), then quantize the KV cache with -fa on -ctk q8_0 -ctv q8_0 (see Running) to roughly double what that budget holds. There is no lower quant to fall back to here — Q4_K_M is already the floor. Devstral is a coding agent, and a long agent session with a large repo in context grows the KV cache mid-task, so size for the peak, not idle; if you routinely need a big window, a 24GB card is the real fix.

`torch` / CUDA not needed — this is llama.cpp

Serving Devstral via llama.cpp or Ollama does not require PyTorch, flash-attn wheels, or a Python ML stack — those belong to the vLLM/SGLang paths on the card, which target large-VRAM or multi-GPU rigs (the vendor's vllm serve example uses --tensor-parallel-size 2). On a single RTX 4080 the GGUF + llama.cpp path is the right one; if you hit a CUDA error, confirm you installed the CUDA-enabled llama.cpp build (Option A) rather than a CPU-only binary.

Model or GPU 404 on /check

Devstral Small 2 (24B) is a new addition; if the /check/devstral-small-24b/rtx-4080 link 404s, the catalogue row is still being registered. The recipe's install and run steps are independent of the benchmark endpoint.