How much VRAM does Devstral Small 2 (24B) need?

About 24 GB — the minimum this recipe targets.

How hard is this setup?

Intermediate — follow the steps above.

Devstral Small 2 (24B) on RTX 4090: Local Agentic Coding via llama.cpp + OpenHands (24GB, the Vendor's Named Target)

What You'll Build

A fully local, private agentic-coding setup: Devstral Small 2 (24B) — Mistral's dedicated agentic-coding model, and the first Mistral in this catalogue — served as an OpenAI-compatible endpoint by llama.cpp on a single 24GB RTX 4090, driven by a coding agent (OpenHands as this catalogue's house choice, or Mistral's own Mistral Vibe CLI). Devstral is fine-tuned for terminal-based coding agents: it plans, runs shell commands, reads your repo, and edits files through native tool calls. The vendor names this exact GPU as a target — "With its compact size of just 24 billion parameters, Devstral is light enough to run on a single RTX 4090 or a Mac with 32GB RAM" (Devstral-Small-2-24B-Instruct-2512 model card).

Hardware data: RTX 4090 (24GB VRAM) · Devstral Small 2 (24B), GGUF Q4_K_M (14.33GB) or Q5_K_M (16.76GB) · See benchmark data

ℹ️ This is a coding LLM (with a vision tower), not a chat generalist. Devstral Small 2 is Mistral's agentic-coding model, fine-tuned from Mistral-Small-3.1-24B-Base. It is a dense 24B transformer (32 query / 8 KV heads GQA, hidden size 5120, 40 layers) — not a Mixture-of-Experts, so its footprint is simply the quant file you load plus the KV cache; there is no "active-parameters" shortcut that shrinks VRAM. The checkpoint is a Mistral3ForConditionalGeneration with a pixtral vision tower, so it can also analyze images and provide insights based on visual content, in addition to text (per the card) — it is not text-only — but it is positioned and used here as a coding model. Vendor coding evals (README table): SWE-bench Verified 68.0%, SWE-bench Multilingual 55.7%, Terminal-Bench 2 22.5% — a 24B matching much larger models on SWE-bench Verified.

⚠️ CRITICAL — you need a recent llama.cpp (PR #17945). There is no first-party GGUF for this 2512 release; you use the community GGUFs the official README itself links (bartowski or unsloth). The README is explicit that these need llama.cpp changes from PR ggml-org/llama.cpp#17945 to run correctly — that PR ("models : fix the attn_factor for mistral3 graphs + improve consistency", merged 2025-12-12) fixes the RoPE/YaRN attention factor for Mistral 3 graphs, which Devstral 2 depends on. Use a llama.cpp build newer than that merge. Wrappers such as Ollama and LM Studio bundle their own llama.cpp and may lag until they ship a build that includes #17945; if the model loads but produces garbled or degraded output on those, that lag is the likely cause — prefer an up-to-date llama-server for now.

Requirements

Component	Minimum	Tested target
GPU	24GB VRAM (this starter's floor)	RTX 4090 (24GB, Ada Lovelace AD102, sm_89)
RAM	16GB system RAM	32GB comfortable (agent + repo + OS)
Storage	~15GB (Q4_K_M) up to ~17GB (Q5_K_M)	~15GB for Q4_K_M
Software	llama.cpp incl. PR #17945 (CUDA) or Ollama/LM Studio once they ship it; OpenHands or Mistral Vibe client	`llama-server`, OpenHands

Model weights (community GGUF — the README-linked bartowski/mistralai_Devstral-Small-2-24B-Instruct-2512-GGUF, byte-verified sizes):

Quant	On-disk size	Fit on RTX 4090 (24GB)
Q4_K_M	14.33GB	Recommended — very comfortable; leaves ~9GB for a large KV cache / context
Q5_K_M	16.76GB	Comfortable — leaves ~7GB for context; a small fidelity bump over Q4_K_M
Q6_K	19.35GB	Fits with modest context — near-lossless weights, but only ~4GB left for the KV cache
Q8_0	25.06GB	Does not fit 24GB — exceeds the RTX 4090's VRAM; needs a 32GB+ card
bf16	47.15GB	Does not fit 24GB — datacenter-only

The bartowski/...-imatrix.gguf (~10 MB) is calibration data, not a model — never load it as a quant. unsloth/mistralai_Devstral-Small-2-24B-Instruct-2512-GGUF is the other README-linked source if you prefer it.

Licensing. Devstral Small 2 is Apache-2.0 — free for commercial and non-commercial use, no revenue caps (model card).

Installation

You have two GGUF runtimes; pick one. For this release, the safe path is a current llama.cpp build (Option A) because of the PR #17945 requirement above.

Option A — llama.cpp with CUDA (recommended for this release)

The RTX 4090 is Ada Lovelace (AD102, sm_89). Build a recent llama.cpp (one whose master is after the 2025-12-12 merge of PR #17945) so the Mistral 3 attention-factor fix is present, then compile for sm_89, per the official build guide:

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
# Confirm your checkout includes PR #17945 (merged 2025-12-12) — pull latest master.
# RTX 4090 is Ada Lovelace = compute capability 8.9 (sm_89)
cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=89
cmake --build build --config Release -j 8

If you use a prebuilt llama.cpp release instead, pick one published after 2025-12-12 from the releases page so it contains the fix. The CUDA backend flag is -DGGML_CUDA=ON on current llama.cpp (the old LLAMA_CUDA name was retired in late 2024); the NVIDIA CUDA toolkit must be installed first.

Option B — Ollama / LM Studio (only once they ship #17945)

Ollama and LM Studio both list Devstral Small 2 and are built on llama.cpp. They are the fastest to stand up, but each bundles its own llama.cpp — use them only after their bundled engine includes PR #17945. If output looks broken on either, that engine lag is the first thing to check; fall back to an up-to-date llama-server (Option A) meanwhile.

Running

With llama.cpp

Serve an OpenAI-compatible API on port 8000. The -hf flag pulls the GGUF from Hugging Face; append :Q4_K_M (case-insensitive) to pick the quant — without a tag, llama-server defaults to Q4_K_M (llama-server docs):

# Q4_K_M (recommended), offload all layers to the 4090, large context
llama-server -hf bartowski/mistralai_Devstral-Small-2-24B-Instruct-2512-GGUF:Q4_K_M \
    --port 8000 \
    -ngl 99 \
    -c 65536 \
    --jinja

-ngl 99 (--n-gpu-layers) offloads every layer to the GPU — the dense 24B quant file (14.33GB at Q4_K_M) must sit in VRAM.
-c 65536 sets a 64K context. Q4_K_M leaves ~9GB of the 24GB for the KV cache after the weights, comfortably enough for a large coding-session window; raise or lower -c while watching nvidia-smi.
--jinja applies the GGUF's built-in chat template so reasoning/tool-call blocks parse.

Push toward the vendor's 256K context. Devstral advertises a 256K context window (the vendor figure; the base config's max_position_embeddings is larger via YaRN, but 256K is what Mistral states). You cannot hold the full 256K KV cache and the weights on 24GB at f16 — to reach much longer windows, quantize the KV cache: add -fa on (Flash Attention, required for a quantized cache) and -ctk q8_0 -ctv q8_0, which roughly halves KV-cache VRAM versus f16 with minimal quality impact (llama-server docs):

# Longer context on Q4_K_M by 8-bit-quantizing the KV cache
llama-server -hf bartowski/mistralai_Devstral-Small-2-24B-Instruct-2512-GGUF:Q4_K_M \
    --port 8000 -ngl 99 -c 131072 --jinja \
    -fa on -ctk q8_0 -ctv q8_0

For a small fidelity bump at the cost of context headroom, swap :Q4_K_M for :Q5_K_M (16.76GB) — it still fits 24GB but leaves ~7GB for the KV cache instead of ~9GB.

With Ollama

Only after Ollama's bundled llama.cpp includes PR #17945 (see Installation), pull and run the community GGUF directly from Hugging Face; append a :quant tag to choose the quant (HF × Ollama docs):

ollama run hf.co/bartowski/mistralai_Devstral-Small-2-24B-Instruct-2512-GGUF:Q4_K_M

Ollama serves an OpenAI-compatible API at http://localhost:11434/v1 for agent clients.

Connect a coding agent

Point any OpenAI-compatible coding client at your local endpoint by setting its base URL and a dummy API key.

OpenHands (this catalogue's house choice). The README lists OpenHands among Devstral's supported agent clients. Point it at your local server:

pip install openhands-ai

# OpenHands routes through LiteLLM; the "openai/" prefix selects the OpenAI-compatible path.
export LLM_MODEL="openai/mistralai/Devstral-Small-2-24B-Instruct-2512"
export LLM_BASE_URL="http://localhost:8000/v1"
export LLM_API_KEY="EMPTY"   # any non-empty string; local servers don't check it

openhands

Mistral Vibe (Mistral's own first-party CLI). The README recommends its own agentic CLI for this model. Install and launch it, then point it at your local endpoint:

uv tool install mistral-vibe   # or: pip install mistral-vibe
vibe

The README also lists Cline, Kilo Code, SWE-agent, and Claude Code as compatible clients — all connect the same way, via the OpenAI-compatible base URL. Devstral's tool-call format is Mistral-specific (see the tokenizer note in Troubleshooting), so the --jinja/built-in-template path above is what makes tool calls parse in llama.cpp.

If you serve with vLLM instead (multi-GPU / large-VRAM path)

vLLM is the vendor-recommended reliable server and the cleanest path for Mistral's tokenizer and tool-call parsing — but it runs the model unquantized, so a single 24GB 4090 is not enough (bf16 weights are ~47GB). The vendor's own example is a two-GPU invocation, shown here for completeness only; on a single 4090 stay on the GGUF + llama.cpp path above:

uv pip install -U vllm
pip install "mistral_common>=1.8.6"

vllm serve mistralai/Devstral-Small-2-24B-Instruct-2512 \
    --max-model-len 262144 --tensor-parallel-size 2 \
    --tool-call-parser mistral --enable-auto-tool-choice

Results

VRAM usage: The dense 24B loads entirely as its GGUF file — Q4_K_M is 14.33GB and Q5_K_M is 16.76GB on disk (byte-verified from the bartowski GGUF tree). On the RTX 4090's 24GB, Q4_K_M is very comfortable — roughly ~9GB left over for the KV cache, enough for a large coding-session context at f16 without quantization. Q6_K (19.35GB) fits with only modest context; Q8_0 (25.06GB) and bf16 (47.15GB) do not fit 24GB.
Model capability: The vendor's README reports SWE-bench Verified 68.0%, SWE-bench Multilingual 55.7%, and Terminal-Bench 2 22.5% — a 24B matching much larger models on SWE-bench Verified. These are Mistral's own agentic-coding evals, not hardware throughput on this GPU.
Speed: No local throughput benchmark for Devstral Small 2 on the RTX 4090 exists yet — this is a new model and /check/devstral-small-24b/rtx-4090 has no benchmark rows. We would rather omit a tok/s figure than invent one or borrow one from different hardware; live measurements will appear at that link once contributed.

For the full benchmark data, see /check/devstral-small-24b/rtx-4090.

Troubleshooting

Output is garbled, degraded, or the model won't load correctly

This is the PR #17945 trap. The 2512 release has no first-party GGUF; the community GGUFs need llama.cpp changes from PR ggml-org/llama.cpp#17945 (the Mistral 3 attention-factor fix, merged 2025-12-12) to run correctly. If you built or downloaded llama.cpp before that merge — or you're on an Ollama/LM Studio whose bundled engine predates it — pull/update to a build that includes it. Confirm your llama.cpp checkout is newer than 2025-12-12 (git log on master), or use a prebuilt release published after that date.

Tool calls come back as raw text / the agent can't call tools

Devstral uses Mistral's own tokenizer and tool-call format — the Mistral Common tokenizer (tekken.json), which needs mistral-common >= 1.8.6 on the Python serving paths, not the generic ChatML/HF path. On the vLLM path this means passing --tool-call-parser mistral --enable-auto-tool-choice (as in the vendor example above). On the llama.cpp path, pass --jinja so the GGUF's built-in chat template is applied — a correctly-templated server surfaces tool calls as OpenAI-style tool_calls. If your client shows raw tool-call text, the template/parser isn't being applied.

Out of memory when raising the context

Q4_K_M weights (14.33GB) leave ~9GB for the KV cache; a very long window can still exhaust it. If you OOM after raising -c, either lower the context length or quantize the KV cache with -fa on -ctk q8_0 -ctv q8_0 (see Running) to reach toward the vendor's 256K window. Dropping from Q5_K_M to Q4_K_M also frees ~2GB for context. Devstral is a coding agent — a long agent session with a large repo in context grows the KV cache mid-task, so size for the peak, not idle.

`torch` / CUDA not needed — this is llama.cpp

Serving Devstral via llama.cpp or Ollama does not require PyTorch, flash-attn wheels, or a Python ML stack — those belong to the vLLM/SGLang paths on the card, which target large-VRAM or multi-GPU rigs (the vendor's vllm serve example uses --tensor-parallel-size 2). On a single RTX 4090 the GGUF + llama.cpp path is the right one; if you hit a CUDA error, confirm you installed the CUDA-enabled llama.cpp build (Option A) rather than a CPU-only binary.

Model or GPU 404 on /check

Devstral Small 2 (24B) is a new addition; if the /check/devstral-small-24b/rtx-4090 link 404s, the catalogue row is still being registered. The recipe's install and run steps are independent of the benchmark endpoint.