How much VRAM does Devstral Small 2 (24B) need?

About 24 GB — the minimum this recipe targets.

How hard is this setup?

Intermediate — follow the steps above.

Devstral Small 2 (24B) on RTX 3090 Ti: Local Agentic Coding via llama.cpp + OpenHands (24GB, Faster Ampere)

What You'll Build

A fully local, private agentic-coding setup: Devstral Small 2 (24B) — Mistral's dedicated agentic-coding model, and the first Mistral in this catalogue — served as an OpenAI-compatible endpoint by llama.cpp on a single 24GB RTX 3090 Ti, driven by a coding agent (OpenHands as this catalogue's house choice, or Mistral's own Mistral Vibe CLI). Devstral is fine-tuned for terminal-based coding agents: it plans, runs shell commands, reads your repo, and edits files through native tool calls. The RTX 3090 Ti is the top of the Ampere 24GB tier: the same VRAM envelope and quant fit as the RTX 3090 — Q4_K_M very comfortable, and Q5_K_M and Q6_K also fit — with modestly higher clocks and memory bandwidth, so the same quant runs a little faster.

Hardware data: RTX 3090 Ti (24GB VRAM) · Devstral Small 2 (24B), GGUF Q5_K_M (16.76GB) or Q6_K (19.35GB) · See benchmark data

ℹ️ This is a coding LLM (with a vision tower), not a chat generalist. Devstral Small 2 is Mistral's agentic-coding model, fine-tuned from Mistral-Small-3.1-24B-Base. It is a dense 24B transformer (32 query / 8 KV heads GQA, hidden size 5120, 40 layers) — not a Mixture-of-Experts, so its footprint is simply the quant file you load plus the KV cache; there is no "active-parameters" shortcut that shrinks VRAM. The checkpoint is a Mistral3ForConditionalGeneration with a pixtral vision tower, so it can also analyze images and provide insights based on visual content, in addition to text (per the card) — it is not text-only — but it is positioned and used here as a coding model. Vendor coding evals (README table): SWE-bench Verified 68.0%, SWE-bench Multilingual 55.7%, Terminal-Bench 2 22.5% — a 24B matching much larger models on SWE-bench Verified.

⚠️ CRITICAL — you need a recent llama.cpp (PR #17945). There is no first-party GGUF for this 2512 release; you use the community GGUFs the official README itself links (bartowski or unsloth). The README is explicit that these need llama.cpp changes from PR ggml-org/llama.cpp#17945 to run correctly — that PR ("models : fix the attn_factor for mistral3 graphs + improve consistency", merged 2025-12-12) fixes the RoPE/YaRN attention factor for Mistral 3 graphs, which Devstral 2 depends on. Use a llama.cpp build newer than that merge. Wrappers such as Ollama and LM Studio bundle their own llama.cpp and may lag until they ship a build that includes #17945; if the model loads but produces garbled or degraded output on those, that lag is the likely cause — prefer an up-to-date llama-server for now.

Requirements

Component	Minimum	Tested target
GPU	24GB VRAM	RTX 3090 Ti (24GB, Ampere GA102, sm_86)
RAM	16GB system RAM	32GB comfortable (agent + repo + OS)
Storage	~15GB (Q4_K_M) up to ~20GB (Q6_K)	~17GB for Q5_K_M
Software	llama.cpp incl. PR #17945 (CUDA) or Ollama/LM Studio once they ship it; OpenHands or Mistral Vibe client	`llama-server`, OpenHands

Model weights (community GGUF — the README-linked bartowski/mistralai_Devstral-Small-2-24B-Instruct-2512-GGUF, byte-verified sizes):

Quant	On-disk size	Fit on RTX 3090 Ti (24GB)
Q4_K_M	14.33GB	Very comfortable — leaves ~9GB for a large KV cache / context
Q5_K_M	16.76GB	Recommended — leaves ~7GB for context; a small fidelity bump over Q4_K_M
Q6_K	19.35GB	Fits with modest context — near-lossless weights, but only ~4GB left for the KV cache
Q8_0	25.06GB	Does not fit 24GB — exceeds the RTX 3090 Ti's VRAM; needs a 32GB+ card
bf16	47.15GB	Does not fit 24GB — datacenter-only

The bartowski/...-imatrix.gguf (~10 MB) is calibration data, not a model — never load it as a quant. unsloth/mistralai_Devstral-Small-2-24B-Instruct-2512-GGUF is the other README-linked source if you prefer it.

ℹ️ A dense model with a full quant ladder — 24GB fits Q4 through Q6. Because Devstral is dense (one full quant file, not an MoE with a fixed active slice), you pick the best quant that fits your VRAM. On 24GB you have real choice: Q4_K_M for maximum context, Q5_K_M as the balanced default, or Q6_K for near-lossless weights with a modest window. Q8_0 (25.06GB) does not fit 24GB — a 32GB card (RTX 5090) is what unlocks the near-lossless Q8_0.

ℹ️ Ampere has no FP8 tensor cores. The RTX 3090 Ti (GA102) predates the FP8 path that Ada/Hopper/Blackwell add — do not expect an FP8 tensor-core speedup here. This recipe uses GGUF, whose K-quants are integer formats that run on the standard CUDA/tensor path on Ampere; there is no special FP8 route to enable or miss. Versus the plain RTX 3090, the Ti's edge is its higher clocks and memory bandwidth on the same GA102 silicon and same 24GB — a modest throughput bump on identical quants, not a new capability. The fit table above is unchanged.

Licensing. Devstral Small 2 is Apache-2.0 — free for commercial and non-commercial use, no revenue caps (model card).

Installation

You have two GGUF runtimes; pick one. For this release, the safe path is a current llama.cpp build (Option A) because of the PR #17945 requirement above.

Option A — llama.cpp with CUDA (recommended for this release)

The RTX 3090 Ti is Ampere (GA102, sm_86). Build a recent llama.cpp (one whose master is after the 2025-12-12 merge of PR #17945) so the Mistral 3 attention-factor fix is present, then compile for sm_86, per the official build guide:

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
# Confirm your checkout includes PR #17945 (merged 2025-12-12) — pull latest master.
# RTX 3090 Ti is Ampere = compute capability 8.6 (sm_86)
cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=86
cmake --build build --config Release -j 8

If you use a prebuilt llama.cpp release instead, pick one published after 2025-12-12 from the releases page so it contains the fix. The CUDA backend flag is -DGGML_CUDA=ON on current llama.cpp (the old LLAMA_CUDA name was retired in late 2024); the NVIDIA CUDA toolkit must be installed first.

Option B — Ollama / LM Studio (only once they ship #17945)

Ollama and LM Studio both list Devstral Small 2 and are built on llama.cpp. They are the fastest to stand up, but each bundles its own llama.cpp — use them only after their bundled engine includes PR #17945. If output looks broken on either, that engine lag is the first thing to check; fall back to an up-to-date llama-server (Option A) meanwhile.

Running

With llama.cpp

Serve an OpenAI-compatible API on port 8000. The -hf flag pulls the GGUF from Hugging Face; append :Q5_K_M (case-insensitive) to pick the quant — without a tag, llama-server defaults to Q4_K_M (llama-server docs):

# Q5_K_M (recommended on 24GB), offload all layers to the 3090 Ti, large context
llama-server -hf bartowski/mistralai_Devstral-Small-2-24B-Instruct-2512-GGUF:Q5_K_M \
    --port 8000 \
    -ngl 99 \
    -c 49152 \
    --jinja

-ngl 99 (--n-gpu-layers) offloads every layer to the GPU — the dense 24B quant file (16.76GB at Q5_K_M) must sit in VRAM.
-c 49152 sets a 48K context. Q5_K_M leaves ~7GB of the 24GB for the KV cache after the weights; drop to :Q4_K_M (14.33GB) for ~9GB and a larger window, or watch nvidia-smi and adjust -c.
--jinja applies the GGUF's built-in chat template so reasoning/tool-call blocks parse.

Push toward the vendor's 256K context. Devstral advertises a 256K context window (the vendor figure; the base config's max_position_embeddings is larger via YaRN, but 256K is what Mistral states). You cannot hold the full 256K KV cache and the weights on 24GB at f16 — to reach much longer windows, quantize the KV cache: add -fa on (Flash Attention, required for a quantized cache) and -ctk q8_0 -ctv q8_0, which roughly halves KV-cache VRAM versus f16 with minimal quality impact (llama-server docs):

# Longer context on Q5_K_M by 8-bit-quantizing the KV cache
llama-server -hf bartowski/mistralai_Devstral-Small-2-24B-Instruct-2512-GGUF:Q5_K_M \
    --port 8000 -ngl 99 -c 98304 --jinja \
    -fa on -ctk q8_0 -ctv q8_0

Want maximum context instead of maximum fidelity? Swap :Q5_K_M for :Q4_K_M (14.33GB) — it leaves ~9GB for the KV cache. Want near-lossless weights? :Q6_K (19.35GB) fits with a modest window (~4GB left for KV). Q8_0 (25.06GB) needs a 32GB card.

With Ollama

Only after Ollama's bundled llama.cpp includes PR #17945 (see Installation), pull and run the community GGUF directly from Hugging Face; append a :quant tag to choose the quant (HF × Ollama docs):

ollama run hf.co/bartowski/mistralai_Devstral-Small-2-24B-Instruct-2512-GGUF:Q5_K_M

Ollama serves an OpenAI-compatible API at http://localhost:11434/v1 for agent clients.

Connect a coding agent

Point any OpenAI-compatible coding client at your local endpoint by setting its base URL and a dummy API key.

OpenHands (this catalogue's house choice). The README lists OpenHands among Devstral's supported agent clients. Point it at your local server:

pip install openhands-ai

# OpenHands routes through LiteLLM; the "openai/" prefix selects the OpenAI-compatible path.
export LLM_MODEL="openai/mistralai/Devstral-Small-2-24B-Instruct-2512"
export LLM_BASE_URL="http://localhost:8000/v1"
export LLM_API_KEY="EMPTY"   # any non-empty string; local servers don't check it

openhands

Mistral Vibe (Mistral's own first-party CLI). The README recommends its own agentic CLI for this model. Install and launch it, then point it at your local endpoint:

uv tool install mistral-vibe   # or: pip install mistral-vibe
vibe

The README also lists Cline, Kilo Code, SWE-agent, and Claude Code as compatible clients — all connect the same way, via the OpenAI-compatible base URL. Devstral's tool-call format is Mistral-specific (see the tokenizer note in Troubleshooting), so the --jinja/built-in-template path above is what makes tool calls parse in llama.cpp.

If you serve with vLLM instead (multi-GPU / large-VRAM path)

vLLM is the vendor-recommended reliable server and the cleanest path for Mistral's tokenizer and tool-call parsing — but it runs the model unquantized, so a single 24GB 3090 Ti is not enough (bf16 weights are ~47GB). The vendor's own example is a two-GPU invocation, shown here for completeness only; on a single 3090 Ti stay on the GGUF + llama.cpp path above:

uv pip install -U vllm
pip install "mistral_common>=1.8.6"

vllm serve mistralai/Devstral-Small-2-24B-Instruct-2512 \
    --max-model-len 262144 --tensor-parallel-size 2 \
    --tool-call-parser mistral --enable-auto-tool-choice

Results

VRAM usage: The dense 24B loads entirely as its GGUF file — Q4_K_M is 14.33GB, Q5_K_M is 16.76GB, and Q6_K is 19.35GB on disk (byte-verified from the bartowski GGUF tree). On the RTX 3090 Ti's 24GB, Q5_K_M is the recommended balance — roughly ~7GB left for the KV cache; Q4_K_M frees ~9GB for a larger window, and Q6_K fits with only modest context (~4GB left). Q8_0 (25.06GB) and bf16 (47.15GB) do not fit 24GB.
Model capability: The vendor's README reports SWE-bench Verified 68.0%, SWE-bench Multilingual 55.7%, and Terminal-Bench 2 22.5% — a 24B matching much larger models on SWE-bench Verified. These are Mistral's own agentic-coding evals, not hardware throughput on this GPU.
Speed: No local throughput benchmark for Devstral Small 2 on the RTX 3090 Ti exists yet — this is a new model and /check/devstral-small-24b/rtx-3090-ti has no benchmark rows. We would rather omit a tok/s figure than invent one or borrow one from different hardware; live measurements will appear at that link once contributed.

For the full benchmark data, see /check/devstral-small-24b/rtx-3090-ti.

Troubleshooting

Output is garbled, degraded, or the model won't load correctly

This is the PR #17945 trap. The 2512 release has no first-party GGUF; the community GGUFs need llama.cpp changes from PR ggml-org/llama.cpp#17945 (the Mistral 3 attention-factor fix, merged 2025-12-12) to run correctly. If you built or downloaded llama.cpp before that merge — or you're on an Ollama/LM Studio whose bundled engine predates it — pull/update to a build that includes it. Confirm your llama.cpp checkout is newer than 2025-12-12 (git log on master), or use a prebuilt release published after that date.

Tool calls come back as raw text / the agent can't call tools

Devstral uses Mistral's own tokenizer and tool-call format — the Mistral Common tokenizer (tekken.json), which needs mistral-common >= 1.8.6 on the Python serving paths, not the generic ChatML/HF path. On the vLLM path this means passing --tool-call-parser mistral --enable-auto-tool-choice (as in the vendor example above). On the llama.cpp path, pass --jinja so the GGUF's built-in chat template is applied — a correctly-templated server surfaces tool calls as OpenAI-style tool_calls. If your client shows raw tool-call text, the template/parser isn't being applied.

Out of memory when raising the context

Q5_K_M weights (16.76GB) leave ~7GB for the KV cache; a very long window can still exhaust it. If you OOM after raising -c, either lower the context length or quantize the KV cache with -fa on -ctk q8_0 -ctv q8_0 (see Running) to reach toward the vendor's 256K window. Dropping from Q5_K_M to Q4_K_M also frees ~2GB for context (or from Q6_K to Q5_K_M). Devstral is a coding agent — a long agent session with a large repo in context grows the KV cache mid-task, so size for the peak, not idle.

`torch` / CUDA not needed — this is llama.cpp

Serving Devstral via llama.cpp or Ollama does not require PyTorch, flash-attn wheels, or a Python ML stack — those belong to the vLLM/SGLang paths on the card, which target large-VRAM or multi-GPU rigs (the vendor's vllm serve example uses --tensor-parallel-size 2). On a single RTX 3090 Ti the GGUF + llama.cpp path is the right one; if you hit a CUDA error, confirm you installed the CUDA-enabled llama.cpp build (Option A) rather than a CPU-only binary.

Model or GPU 404 on /check

Devstral Small 2 (24B) is a new addition; if the /check/devstral-small-24b/rtx-3090-ti link 404s, the catalogue row is still being registered. The recipe's install and run steps are independent of the benchmark endpoint.