self-hosted/ai
§01·recipe · llm

Devstral Small 2 (24B) on RX 7900 XTX: Local Agentic Coding via llama.cpp-HIP + OpenHands (24GB ROCm)

llmadvanced24GB+ VRAMJul 3, 2026

This advanced recipe sets up Devstral Small 2 (24B) on the RX 7900 XTX, needing about 24 GB of VRAM.

models
tools
prerequisites
  • AMD Radeon RX 7900 XTX (24GB VRAM, RDNA3 / Navi 31 / gfx1100) — a 24GB card matches the vendor's own RTX 4090 fit envelope; Q4_K_M is very comfortable and Q5_K_M / Q6_K also fit (see the quant ladder below)
  • Linux (Ubuntu 24.04 / 22.04 or RHEL) with the AMD ROCm v7 driver installed via `amdgpu-install` — ROCm is NOT bundled with Ollama or the llama.cpp binaries
  • llama.cpp built from source with the HIP/ROCm backend (`-DGGML_HIP=ON -DGPU_TARGETS=gfx1100`) from a tree recent enough to include **PR ggml-org/llama.cpp#17945** (the Mistral 3 attention fix) — see the critical note below; a pre-#17945 build produces garbled output
  • Python 3.10+ (for the OpenHands agent client)
  • ~15GB free disk for the Q4_K_M GGUF (up to ~20GB for Q6_K)

What You'll Build

A fully local, private agentic-coding setup: Devstral Small 2 (24B) — Mistral's dedicated agentic-coding model, and the first Mistral in this catalogue — served as an OpenAI-compatible endpoint by llama.cpp (built against AMD's HIP/ROCm backend) on a single 24GB Radeon RX 7900 XTX (RDNA3, Navi 31, gfx1100), driven by a coding agent (OpenHands as this catalogue's house choice, or Mistral's own Mistral Vibe CLI). Devstral is fine-tuned for terminal-based coding agents: it plans, runs shell commands, reads your repo, and edits files through native tool calls. The vendor names the 24GB RTX 4090 as a single-GPU target — "With its compact size of just 24 billion parameters, Devstral is light enough to run on a single RTX 4090 or a Mac with 32GB RAM" (Devstral-Small-2-24B-Instruct-2512 model card) — and the RX 7900 XTX is the 24GB AMD equivalent, running the same GGUF quants on ROCm instead of CUDA.

Hardware data: RX 7900 XTX (24GB VRAM) · Devstral Small 2 (24B), GGUF Q4_K_M (14.33GB) or Q5_K_M (16.76GB) · ROCm 7 · See benchmark data

⚠️ This is a ROCm recipe, not CUDA. The RX 7900 XTX runs on AMD's ROCm/HIP stack — there is no cu124/cu128 wheel and no FlashAttention prebuilt-wheel step here. For this coding LLM the reliable path is GGUF via llama.cpp-HIP (or Ollama, which bundles llama.cpp). Do not follow a guide that tells you to pip install flash-attn, pick a cu12x wheel, or use ExLlamaV2/Marlin for this card — those are NVIDIA-only.

ℹ️ This is a coding LLM (with a vision tower), not a chat generalist. Devstral Small 2 is Mistral's agentic-coding model, fine-tuned from Mistral-Small-3.1-24B-Base. It is a dense 24B transformer (32 query / 8 KV heads GQA, hidden size 5120, 40 layers) — not a Mixture-of-Experts, so its footprint is simply the quant file you load plus the KV cache; there is no "active-parameters" shortcut that shrinks VRAM. The checkpoint is a Mistral3ForConditionalGeneration with a pixtral vision tower, so it can also analyze images and provide insights based on visual content, in addition to text (per the card) — it is not text-only — but it is positioned and used here as a coding model. Vendor coding evals (README table): SWE-bench Verified 68.0%, SWE-bench Multilingual 55.7%, Terminal-Bench 2 22.5% — a 24B matching much larger models on SWE-bench Verified.

⚠️ CRITICAL — your HIP build must include PR #17945. There is no first-party GGUF for this 2512 release; you use the community GGUFs the official README itself links (bartowski or unsloth). The README is explicit that these need llama.cpp changes from PR ggml-org/llama.cpp#17945 to run correctly — that PR ("models : fix the attn_factor for mistral3 graphs + improve consistency", merged 2025-12-12) fixes the RoPE/YaRN attention factor for Mistral 3 graphs, which Devstral 2 depends on. This compounds with the AMD build step below: the ROCm/HIP binary you compile must come from a source tree newer than that merge. If the model loads but produces garbled or degraded output, a pre-#17945 build is the likely cause. Ollama bundles its own llama.cpp and may lag — use it only once its bundled engine includes #17945; until then prefer an up-to-date, HIP-built llama-server.

ℹ️ Apache-2.0 — commercial use is allowed. Devstral Small 2 ships under Apache-2.0 — free for commercial and non-commercial use, no revenue caps (model card).

Requirements

ComponentMinimumTested
GPU24GB VRAM (this starter's floor)RX 7900 XTX (24GB, RDNA3 Navi 31, gfx1100)
RAM16GB system RAM32GB comfortable (agent + repo + OS)
Storage~15GB (Q4_K_M) up to ~20GB (Q6_K)~15GB for Q4_K_M
DriverAMD ROCm v7 (installed via amdgpu-install) on Linux
Softwarellama.cpp (HIP build) incl. PR #17945, or Ollama once it ships it; OpenHands or Mistral Vibe clientllama-server (HIP), OpenHands

Model weights (community GGUF — the README-linked bartowski/mistralai_Devstral-Small-2-24B-Instruct-2512-GGUF, byte-verified sizes):

QuantOn-disk sizeFit on RX 7900 XTX (24GB)
Q4_K_M14.33GBFits comfortably — very comfortable; leaves ~9GB for a large KV cache / context
Q5_K_M16.76GBRecommended — comfortable; leaves ~7GB for context; a fidelity bump over Q4_K_M
Q6_K19.35GBFits with modest context — near-lossless weights, but only ~4GB left for the KV cache
Q8_025.06GBDoes not fit 24GB — exceeds the RX 7900 XTX's VRAM; needs a 32GB+ card
bf1647.15GBDoes not fit 24GB — datacenter-only

The bartowski/...-imatrix.gguf (~10 MB) is calibration data, not a model — never load it as a quant. unsloth/mistralai_Devstral-Small-2-24B-Instruct-2512-GGUF is the other README-linked source if you prefer it. On a 24GB card the extra headroom over the vendor's floor means Q5_K_M (16.76GB) is the sweet spot — a fidelity bump over Q4_K_M while still leaving ~7GB for context — and Q6_K (19.35GB) is available if you accept a smaller KV cache. Q8_0 (25.06GB) and bf16 (47.15GB) do not fit.

Installation

Prerequisite — install the AMD ROCm v7 driver

The RX 7900 XTX (gfx1100) is an officially ROCm-supported GPU, but ROCm is not bundled with Ollama or the llama.cpp release binaries — you install it once at the OS level. Per the Ollama AMD GPU docs, Ollama requires the AMD ROCm v7 driver on Linux, installed or upgraded with the amdgpu-install utility. On Ubuntu 24.04 (Noble), install ROCm 7.2.1 via the standard amdgpu-install flow (AMD's Radeon ROCm install docs cover the current packages):

# 1. Add the amdgpu-install package and install ROCm
wget https://repo.radeon.com/amdgpu-install/7.2.1/ubuntu/noble/amdgpu-install_7.2.1.70201-1_all.deb
sudo apt install ./amdgpu-install_7.2.1.70201-1_all.deb
sudo apt update
sudo amdgpu-install -y --usecase=graphics,rocm

# 2. Add yourself to the render/video groups (log out/in afterward)
sudo usermod -a -G render,video $LOGNAME

The RX 7900 XTX is on Ollama's supported AMD Radeon RX list and gfx1100 is in its supported LLVM-target list — so no HSA_OVERRIDE_GFX_VERSION masquerade is needed for this card (that override is only for cards ROCm doesn't ship kernels for).

1. Build llama.cpp with the HIP/ROCm backend and PR #17945

You have two GGUF runtimes; the safe path for this release is a current, HIP-built llama.cpp (this section) because of the PR #17945 requirement above. This compounds with the AMD build step: the ROCm/HIP binary must come from a source tree new enough to include PR #17945 (the Mistral 3 attention fix, merged 2025-12-12) — a stock or older ROCm llama.cpp will load the model but emit garbled output. Since you are building from source with HIP anyway, this is free: clone the current master (which already contains #17945) so the fix is present. Per the llama.cpp build docs, the Linux HIP build for an RDNA3 card like the RX 7900 XTX is:

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
# Pull the current master — it already includes PR #17945 (Mistral 3 attn fix, merged 2025-12-12)
HIPCXX="$(hipconfig -l)/clang" HIP_PATH="$(hipconfig -R)" \
    cmake -S . -B build -DGGML_HIP=ON -DGPU_TARGETS=gfx1100 -DCMAKE_BUILD_TYPE=Release \
    && cmake --build build --config Release -- -j 16

-DGGML_HIP=ON selects the ROCm backend; -DGPU_TARGETS=gfx1100 pins the kernels to the 7900 XTX's architecture (the build docs use gfx1100 as the explicit example for the "Radeon RX 7900XTX"). The GGUF quants are integer formats, so the absence of FP8 tensor hardware on RDNA3 is irrelevant here — Devstral's Q4_K_M/Q5_K_M runs on standard HIP kernels. Building from a recent master is what guarantees the #17945 fix is present; confirm your checkout is newer than 2025-12-12 (git log on master).

2. (Alternative) Ollama — only once its bundled llama.cpp includes #17945

Ollama lists Devstral Small 2 and is built on llama.cpp; it uses the ROCm runtime you installed above and runs the gfx1100 card natively. It is the fastest to stand up, but it bundles its own llama.cpp — use it only after that bundled engine includes PR #17945. If output looks broken, that engine lag is the first thing to check; fall back to an up-to-date, HIP-built llama-server meanwhile.

3. Install the OpenHands coding agent

OpenHands is an open-source agentic-coding client that drives any OpenAI-compatible endpoint:

pip install openhands-ai

Alternatively, run the OpenHands Docker image — in that case point its base URL at http://host.docker.internal:8000/v1 so the container can reach the llama.cpp server running on your host. Devstral's README also lists Mistral Vibe (Mistral's own first-party CLI), plus Cline, Kilo Code, SWE-agent, and Claude Code as compatible clients — all connect the same way.

Running

1. Serve Devstral with llama.cpp

Serve an OpenAI-compatible API on port 8000. The -hf flag pulls the GGUF from Hugging Face and caches it locally; append :Q5_K_M (case-insensitive) to pick the quant (llama-server docs):

# Q5_K_M (recommended for a 24GB card), offload all layers to the GPU, large context
./build/bin/llama-server -hf bartowski/mistralai_Devstral-Small-2-24B-Instruct-2512-GGUF:Q5_K_M \
    --port 8000 \
    -ngl 99 \
    -c 65536 \
    --jinja
  • -ngl 99 (--n-gpu-layers) offloads every layer to the GPU — the dense 24B quant file (16.76GB at Q5_K_M) must sit in VRAM.
  • -c 65536 sets a 64K context. Q5_K_M leaves ~7GB of the 24GB for the KV cache after the weights, comfortably enough for a large coding-session window; raise or lower -c while watching rocm-smi (see Troubleshooting).
  • --jinja applies the GGUF's built-in chat template so reasoning/tool-call blocks parse (see the tokenizer note in Troubleshooting).

Prefer more context headroom? Swap :Q5_K_M for :Q4_K_M (14.33GB) — it leaves ~9GB for the KV cache instead of ~7GB. Prefer near-lossless weights? :Q6_K (19.35GB) fits with only modest context (~4GB left).

Push toward the vendor's 256K context. Devstral advertises a 256K context window (the vendor figure; the base config's max_position_embeddings is larger via YaRN, but 256K is what Mistral states). You cannot hold the full 256K KV cache and the weights on 24GB at f16 — to reach much longer windows, quantize the KV cache: add -fa on (Flash Attention, required for a quantized cache) and -ctk q8_0 -ctv q8_0, which roughly halves KV-cache VRAM versus f16 with minimal quality impact (llama-server docs):

# Longer context on Q5_K_M by 8-bit-quantizing the KV cache
./build/bin/llama-server -hf bartowski/mistralai_Devstral-Small-2-24B-Instruct-2512-GGUF:Q5_K_M \
    --port 8000 -ngl 99 -c 131072 --jinja \
    -fa on -ctk q8_0 -ctv q8_0

Note: Flash Attention on ROCm/RDNA3 is less mature than on CUDA — it works on this card, but test it on your build rather than assuming parity with NVIDIA. If a quantized KV cache misbehaves, drop back to the plain -c 65536 launch above and lower the context instead.

This exposes an OpenAI-compatible API at http://localhost:8000/v1. Ollama is a valid alternative — it too needs a build with PR #17945 (see Installation), uses the ROCm runtime you installed above, and applies its own template, so verify tool-calling works before relying on it for agent edits.

2. Point OpenHands at the local server

OpenHands routes through LiteLLM, so a custom OpenAI-compatible endpoint uses an openai/ model prefix, per the OpenHands local-LLM docs:

export LLM_MODEL="openai/mistralai/Devstral-Small-2-24B-Instruct-2512"
export LLM_BASE_URL="http://localhost:8000/v1"
export LLM_API_KEY="EMPTY"   # any non-empty string; local servers don't check it

openhands

OpenHands will now use Devstral to plan, run shell commands, and edit files in your workspace via native tool calls. For Mistral's own first-party CLI instead, install and launch Mistral Vibe (uv tool install mistral-vibe — or pip install mistral-vibe — then vibe) and point it at the same local endpoint.

Results

  • VRAM usage: The dense 24B loads entirely as its GGUF file — Q4_K_M is 14.33GB, Q5_K_M is 16.76GB, and Q6_K is 19.35GB on disk (byte-verified from the bartowski GGUF tree). On the RX 7900 XTX's 24GB, Q5_K_M (the recommended quant) leaves roughly ~7GB for the KV cache — enough for a large coding-session context — and Q4_K_M leaves ~9GB. Q6_K fits with only modest context; Q8_0 (25.06GB) and bf16 (47.15GB) do not fit 24GB.
  • Model capability: The vendor's README reports SWE-bench Verified 68.0%, SWE-bench Multilingual 55.7%, and Terminal-Bench 2 22.5% — a 24B matching much larger models on SWE-bench Verified. These are Mistral's own agentic-coding evals, not hardware throughput on this GPU.
  • Speed: No local throughput benchmark for Devstral Small 2 on the RX 7900 XTX exists yet — this is a new model and /check/devstral-small-24b/rx-7900-xtx has no benchmark rows. We would rather omit a tok/s figure than invent one or borrow one from different hardware; live measurements will appear at that link once contributed.

For the full benchmark data, see /check/devstral-small-24b/rx-7900-xtx.

Troubleshooting

Output is garbled, degraded, or the model won't load correctly

This is the PR #17945 trap. The 2512 release has no first-party GGUF; the community GGUFs need llama.cpp changes from PR ggml-org/llama.cpp#17945 (the Mistral 3 attention-factor fix, merged 2025-12-12) to run correctly. On AMD this is the single most common failure, because a stock or distro-packaged ROCm llama.cpp is often behind. If you built your HIP binary before that merge — or you're on an Ollama whose bundled engine predates it — rebuild from a recent git pull of llama.cpp (keep -DGGML_HIP=ON -DGPU_TARGETS=gfx1100) and confirm your checkout is newer than 2025-12-12 (git log on master). Note the tell: Devstral is a standard Mistral 3 architecture that llama.cpp already supports, so the model loads fine — the symptom is garbled or degraded output, not a load/arch failure.

Tool calls come back as raw text / the agent can't call tools

Devstral uses Mistral's own tokenizer and tool-call format — the Mistral Common tokenizer (tekken.json / tekken), which needs mistral-common >= 1.8.6 on the Python serving paths, not the generic ChatML/HF path. On the llama.cpp path, pass --jinja so the GGUF's built-in chat template is applied — a correctly-templated server surfaces tool calls as OpenAI-style tool_calls. If your client shows raw tool-call text, the template isn't being applied. (The mistral-common requirement bites the Python/vLLM serving paths; on the GGUF + llama.cpp path here, --jinja is what matters.)

Out of memory when raising the context

Q5_K_M weights (16.76GB) leave ~7GB for the KV cache; a very long window can still exhaust it. If you OOM after raising -c, either lower the context length or quantize the KV cache with -fa on -ctk q8_0 -ctv q8_0 (see Running) to reach toward the vendor's 256K window — but note Flash Attention on ROCm/RDNA3 is less mature than on CUDA, so test it. Dropping from Q5_K_M to Q4_K_M also frees ~2GB for context. Devstral is a coding agent — a long agent session with a large repo in context grows the KV cache mid-task, so size for the peak, not idle.

Ollama or llama.cpp runs on the CPU instead of the GPU

Confirm the ROCm v7 driver is installed (rocm-smi should list the 7900 XTX) and that your user is in the render and video groups (groups should show both — log out and back in after the usermod step). Per the Ollama AMD GPU docs, ROCm is a separate install from Ollama; if it's missing, Ollama silently falls back to CPU. For a source llama.cpp build, confirm you compiled with -DGGML_HIP=ON and that -ngl 99 is offloading layers. The RX 7900 XTX (gfx1100) is natively supported, so you should not need HSA_OVERRIDE_GFX_VERSION — only unsupported cards need that masquerade.

Token generation feels slower than expected — try the Vulkan backend

On RDNA3 the ROCm/HIP backend can be slower at token generation than llama.cpp's Vulkan backend. Per llama.cpp issue #20934, on the RX 7900 XTX (gfx1100) Vulkan (RADV) reached ~167–177 tok/s on Llama 7B Q4_0 while ROCm landed at ~129–144 tok/s across ROCm 6.4.4–7.x. If your generation rate disappoints under ROCm, build llama.cpp with -DGGML_VULKAN=ON instead of -DGGML_HIP=ON and re-benchmark with llama-bench — Vulkan often wins for pure generation on this card. (These are Llama-7B figures cited only to show the ROCm-vs-Vulkan gap on this GPU, not Devstral numbers. The Vulkan backend must also be built from a recent tree so it includes the PR #17945 Mistral 3 fix.)

Model or GPU 404 on /check

Devstral Small 2 (24B) is a new addition; if the /check/devstral-small-24b/rx-7900-xtx link 404s, the catalogue row is still being registered. The recipe's install and run steps are independent of the benchmark endpoint.

common questions
How much VRAM does Devstral Small 2 (24B) need?

About 24 GB — the minimum this recipe targets.

Which GPUs is Devstral Small 2 (24B) tested on?

RX 7900 XTX (24 GB).

How hard is this setup?

Advanced — follow the steps above.