self-hosted/ai
§01·recipe · llm

Mistral Small 3.2 24B on RX 7900 XTX: Local Private Assistant via llama.cpp-HIP / Ollama (24GB ROCm)

llmintermediate24GB+ VRAMJul 3, 2026

This intermediate recipe sets up Mistral Small 3.2 24B on the RX 7900 XTX, needing about 24 GB of VRAM.

models
tools
prerequisites
  • AMD Radeon RX 7900 XTX (24GB VRAM, RDNA3 / Navi 31 / gfx1100) — a 24GB card matches the vendor's own 24GB fit envelope; Q4_K_M is very comfortable and Q5_K_M / Q6_K also fit (see the quant ladder below)
  • Linux (Ubuntu 24.04 / 22.04 or RHEL) with the AMD ROCm v7 driver installed via `amdgpu-install` — ROCm is NOT bundled with Ollama or the llama.cpp binaries
  • llama.cpp built from source with the HIP/ROCm backend (`-DGGML_HIP=ON -DGPU_TARGETS=gfx1100`), or Ollama with ROCm — a recent build is all you need; this June-2025 model has NO special patch requirement
  • 16GB+ system RAM (32GB comfortable)
  • ~15-20GB free disk for the GGUF (Q4_K_M ~14GB up to Q6_K ~19GB)
  • Optional: Open WebUI (or any OpenAI-compatible chat client) for a local chat front-end; +~0.9GB and mistral-common >=1.6.2 only if you want image input

What You'll Build

A fully local, private general assistant: Mistral Small 3.2 24B — Mistral's newest generalist Small (release 2506, superseding 3.1 from 2503) — served as an OpenAI-compatible endpoint by llama.cpp (built against AMD's HIP/ROCm backend) or Ollama on a single 24GB Radeon RX 7900 XTX (RDNA3, Navi 31, gfx1100), then used from a chat UI (Open WebUI is a good local front-end) or directly via the API. This is a chat/reasoning/writing model, not a coding agent: general Q&A, drafting and editing, multi-step reasoning, 23-language multilingual support, and — because the checkpoint carries a Pixtral vision tower — optional image understanding (send it an image, it answers in text). Everything runs on your own hardware on ROCm instead of CUDA, so prompts and documents never leave the machine.

Hardware data: RX 7900 XTX (24GB VRAM) · Mistral Small 3.2 24B, GGUF Q6_K (19.35GB, recommended) — or Q4_K_M (14.33GB) / Q5_K_M (16.76GB) for more context headroom · ROCm 7 · See benchmark data

⚠️ This is a ROCm recipe, not CUDA. The RX 7900 XTX runs on AMD's ROCm/HIP stack — there is no cu124/cu128 wheel and no FlashAttention prebuilt-wheel step here. For this generalist LLM the reliable path is GGUF via llama.cpp-HIP (or Ollama, which bundles llama.cpp). Do not follow a guide that tells you to pip install flash-attn, pick a cu12x wheel, or use ExLlamaV2/Marlin for this card — those are NVIDIA-only.

ℹ️ This is a dense 24B generalist, not a MoE and not text-only. Mistral Small 3.2 is a Mistral3ForConditionalGeneration (model_type: mistral3) — hidden size 5120, 40 layers, GQA with 32 query / 8 KV heads. Because it is dense, its footprint is simply the quant file you load plus the KV cache; there is no "active-parameters" shortcut that shrinks VRAM. The Pixtral vision tower means it can analyze images in addition to text, but it is positioned and used here as a general assistant (vertical llm), not a coding agent. Context window is 128K (max_position_embeddings 131072). It uses Mistral's Tekken tokenizer (tekken.json), which needs mistral-common >= 1.6.2 on the Python serving paths.

ℹ️ Runs on a current llama.cpp/Ollama out of the box — no special patch. Unlike some later Mistral 3 releases, this June-2025 model needs no source-branch gate: bartowski quantized it with llama.cpp release b5697 (June 2025), and Mistral3/Pixtral text support has been mainline since mid-2025. On AMD you still build llama.cpp yourself for the HIP backend (below), but you build from a plain recent master — there is no patch to cherry-pick. Just use a recent HIP-built llama-server (or a recent Ollama). Pass --jinja so the chat template applies; if tool-calling misbehaves, additionally pass the bundled --chat-template-file Mistral-Small-3.2-24B-Instruct-2506.jinja.

ℹ️ Apache-2.0 — commercial use is allowed. Mistral Small 3.2 24B ships under Apache-2.0 — free for commercial and non-commercial use, no revenue caps (model card).

Requirements

ComponentMinimumTested target
GPU24GB VRAM (this starter's floor)RX 7900 XTX (24GB, RDNA3 Navi 31, gfx1100)
RAM16GB system RAM32GB comfortable
Storage~15GB (Q4_K_M) up to ~20GB (Q6_K)~19GB for Q6_K
DriverAMD ROCm v7 (installed via amdgpu-install) on Linux
Softwarellama.cpp (HIP build) or Ollama with ROCm; optional Open WebUI chat clientllama-server (HIP), Open WebUI

Model weights (community GGUF — there is NO first-party GGUF). Mistral publishes only the full-precision weights (mistralai/Mistral-Small-3.2-24B-Instruct-2506); the model is quantized to GGUF by the community. Primary source is bartowski/mistralai_Mistral-Small-3.2-24B-Instruct-2506-GGUF; unsloth/Mistral-Small-3.2-24B-Instruct-2506-GGUF is a good alternative that also ships UD-*_XL "dynamic" quants. Byte-verified on-disk sizes (bartowski):

QuantOn-disk sizeFit on RX 7900 XTX (24GB)
Q4_K_M14.33GBComfortable — leaves ~9GB for a large KV cache / context
Q5_K_M16.76GBComfortable — leaves ~7GB for context; a small fidelity bump over Q4_K_M
Q6_K19.35GBRecommended — near-lossless weights that still fit well; ~4GB left for the KV cache (modest context, extend it by quantizing the cache — see Running)
Q8_025.05GBDoes not fit 24GB — exceeds the RX 7900 XTX's VRAM; needs a 32GB+ card
bf1647.15GBDoes not fit 24GB — datacenter-only

Not model weights — don't count these in the VRAM math:

  • The mmproj-* file (~0.88GB) is the vision projector, not the LLM. It is loaded alongside a quant via --mmproj only if you want image input, and adds ~0.88GB on top of the quant — exclude it from the weight/VRAM budget unless you actually enable vision.
  • The .imatrix (~10 MB) is calibration data used to produce the quants — never load it as a model.

Licensing. Mistral Small 3.2 24B is Apache-2.0 — free for commercial and non-commercial use, no revenue caps (model card).

Installation

Prerequisite — install the AMD ROCm v7 driver

The RX 7900 XTX (gfx1100) is an officially ROCm-supported GPU, but ROCm is not bundled with Ollama or the llama.cpp release binaries — you install it once at the OS level. Per the Ollama AMD GPU docs, Ollama requires the AMD ROCm v7 driver on Linux, installed or upgraded with the amdgpu-install utility. On Ubuntu 24.04 (Noble), install ROCm 7.2.1 via the standard amdgpu-install flow (AMD's Radeon ROCm install docs cover the current packages):

# 1. Add the amdgpu-install package and install ROCm
wget https://repo.radeon.com/amdgpu-install/7.2.1/ubuntu/noble/amdgpu-install_7.2.1.70201-1_all.deb
sudo apt install ./amdgpu-install_7.2.1.70201-1_all.deb
sudo apt update
sudo amdgpu-install -y --usecase=graphics,rocm

# 2. Add yourself to the render/video groups (log out/in afterward)
sudo usermod -a -G render,video $LOGNAME

The RX 7900 XTX is on Ollama's supported AMD Radeon RX list and gfx1100 is in its supported LLVM-target list — so no HSA_OVERRIDE_GFX_VERSION masquerade is needed for this card (that override is only for cards ROCm doesn't ship kernels for).

You have two GGUF runtimes; pick one. Both are fine for this model — there is no patch requirement — so choose Ollama for the fastest start, or llama.cpp for the most control over context and KV-cache quantization.

Option A — llama.cpp with the HIP/ROCm backend

Build a recent llama.cpp against the HIP backend for the RX 7900 XTX (RDNA3, gfx1100). Per the llama.cpp build docs, the Linux HIP build for this card is:

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
# RX 7900 XTX is RDNA3 / Navi 31 = gfx1100
HIPCXX="$(hipconfig -l)/clang" HIP_PATH="$(hipconfig -R)" \
    cmake -S . -B build -DGGML_HIP=ON -DGPU_TARGETS=gfx1100 -DCMAKE_BUILD_TYPE=Release \
    && cmake --build build --config Release -- -j 16

-DGGML_HIP=ON selects the ROCm backend; -DGPU_TARGETS=gfx1100 pins the kernels to the 7900 XTX's architecture (the build docs use gfx1100 as the explicit example for the "Radeon RX 7900XTX"). A recent master is all you need — Mistral3/Pixtral text has been mainline in llama.cpp since mid-2025 (bartowski built these GGUFs with release b5697), so no source-branch cherry-pick or special patch is required for this model. The GGUF quants are integer formats, so the absence of FP8 tensor hardware on RDNA3 is irrelevant here — Mistral Small's Q4_K_M/Q5_K_M/Q6_K runs on standard HIP kernels.

Option B — Ollama with ROCm

Ollama is built on llama.cpp, uses the ROCm runtime you installed above, and runs the gfx1100 card natively — it is the fastest way to stand this model up. Use a recent Ollama release and pull the community GGUF straight from Hugging Face (HF × Ollama docs):

ollama run hf.co/bartowski/mistralai_Mistral-Small-3.2-24B-Instruct-2506-GGUF:Q6_K

Swap the :Q6_K tag for :Q4_K_M or :Q5_K_M if you want more context headroom. Ollama serves an OpenAI-compatible API at http://localhost:11434/v1 for chat clients. (No PR-branch gate applies to this model, so a recent Ollama build works out of the box — just make sure ROCm is installed, or Ollama silently falls back to CPU; see Troubleshooting.)

Running

With llama.cpp

Serve an OpenAI-compatible API on port 8000. The -hf flag pulls the GGUF from Hugging Face; append :Q6_K (case-insensitive) to pick the quant (llama-server docs):

# Q6_K (recommended), offload all layers to the 7900 XTX
./build/bin/llama-server -hf bartowski/mistralai_Mistral-Small-3.2-24B-Instruct-2506-GGUF:Q6_K \
    --port 8000 \
    -ngl 99 \
    -c 16384 \
    --jinja
  • -ngl 99 (--n-gpu-layers) offloads every layer to the GPU — the dense 24B quant file (19.35GB at Q6_K) must sit in VRAM.
  • -c 16384 sets a 16K context. At Q6_K only ~4GB is left after the weights, so keep the context modest at f16, or quantize the KV cache (below) to push it much higher. Raise or lower -c while watching rocm-smi (see Troubleshooting).
  • --jinja applies the GGUF's built-in chat template so the assistant format parses correctly. If tool-calling misbehaves, add --chat-template-file Mistral-Small-3.2-24B-Instruct-2506.jinja (the template bundled with the repo).

Push toward the 128K context window. Mistral Small 3.2 advertises a 128K context (max_position_embeddings 131072). You can't hold a full-length f16 KV cache next to Q6_K weights on 24GB — to reach long windows, quantize the KV cache: add -fa on (Flash Attention, required for a quantized cache) and -ctk q8_0 -ctv q8_0, which roughly halves KV-cache VRAM versus f16 with minimal quality impact:

# Longer context by 8-bit-quantizing the KV cache
./build/bin/llama-server -hf bartowski/mistralai_Mistral-Small-3.2-24B-Instruct-2506-GGUF:Q6_K \
    --port 8000 -ngl 99 -c 65536 --jinja \
    -fa on -ctk q8_0 -ctv q8_0

Note: Flash Attention on ROCm/RDNA3 is less mature than on CUDA — it works on this card, but test it on your build rather than assuming parity with NVIDIA. If a quantized KV cache misbehaves, drop back to the plain -c 16384 launch above and lower the context instead. To trade a little weight fidelity for much more context headroom on 24GB, drop to :Q5_K_M (16.76GB, ~7GB free) or :Q4_K_M (14.33GB, ~9GB free).

Optional — image input. The Pixtral vision tower lets the model read images. Download the mmproj-* file from the same GGUF repo and pass it alongside the quant; it adds ~0.88GB of VRAM on top of the weights:

./build/bin/llama-server -hf bartowski/mistralai_Mistral-Small-3.2-24B-Instruct-2506-GGUF:Q6_K \
    --mmproj mmproj-mistralai_Mistral-Small-3.2-24B-Instruct-2506-f16.gguf \
    --port 8000 -ngl 99 -c 16384 --jinja

With Ollama

Pull and run the community GGUF directly from Hugging Face; append a :quant tag to choose the quant (HF × Ollama docs):

ollama run hf.co/bartowski/mistralai_Mistral-Small-3.2-24B-Instruct-2506-GGUF:Q6_K

Ollama uses the ROCm runtime you installed above and serves an OpenAI-compatible API at http://localhost:11434/v1 for chat clients.

Use it as a chat assistant

Point any OpenAI-compatible chat client at your local endpoint by setting its base URL and a dummy API key — no cloud, no per-token cost.

Open WebUI (optional local chat front-end). A self-hosted, ChatGPT-style UI that talks to any OpenAI-compatible server. Run it and point it at your local endpoint:

# Point Open WebUI at your local llama-server (or Ollama on :11434)
docker run -d -p 3000:8080 \
    -e OPENAI_API_BASE_URL=http://host.docker.internal:8000/v1 \
    -e OPENAI_API_KEY=EMPTY \
    ghcr.io/open-webui/open-webui:main

Then open http://localhost:3000 and chat. (Open WebUI also autodetects a local Ollama install, so with the Ollama path you can skip the base-URL wiring entirely.)

Directly via the API. Any OpenAI SDK or curl works against the same endpoint — use it for scripts, writing tools, or your own app:

curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
      "model": "mistral-small-3.2-24b",
      "messages": [{"role": "user", "content": "Summarize this in three bullet points: ..."}]
    }'

Local servers don't check the key, so any non-empty string (e.g. EMPTY) works where a client requires one.

Results

  • VRAM usage: The dense 24B loads entirely as its GGUF file — Q6_K is 19.35GB on disk (byte-verified from the bartowski GGUF tree). On the RX 7900 XTX's 24GB that leaves ~4GB for the KV cache — enough for a modest context at f16, or a much larger window with an 8-bit-quantized cache (see Running). Q4_K_M (14.33GB, ~9GB free) and Q5_K_M (16.76GB, ~7GB free) trade a little weight fidelity for more context headroom. Q8_0 (25.05GB) and bf16 (47.15GB) do not fit 24GB. Enabling image input adds ~0.88GB for the mmproj projector.
  • Model capability (vendor evals — Mistral's own, NOT hardware throughput): Mistral reports MMLU Pro 5-shot CoT 69.06%, MATH 69.42%, GPQA Diamond 46.13%, and HumanEval Plus pass@5 92.90%, plus a sharp instruction-following jump over 3.1. It handles 23 languages. These are the vendor's benchmarks, not measurements on this GPU.
  • Speed: No community throughput benchmark for Mistral Small 3.2 24B on the RX 7900 XTX exists yet — we would rather omit a tok/s figure than invent one or borrow it from different hardware. Live measurements will appear at /check/mistral-small-3-2-24b/rx-7900-xtx once contributed.

For the full benchmark data, see /check/mistral-small-3-2-24b/rx-7900-xtx.

Troubleshooting

The chat template looks wrong / responses are malformed

Pass --jinja to llama-server so the GGUF's built-in chat template is applied — without it the assistant format won't parse. Mistral Small 3.2 uses Mistral's own Tekken tokenizer (tekken.json), and on the Python serving paths that needs mistral-common >= 1.6.2. If tool-calling in particular misbehaves, additionally pass --chat-template-file Mistral-Small-3.2-24B-Instruct-2506.jinja (the template bundled in the model repo) to override the embedded one.

Out of memory at Q6_K, or when raising the context

Q6_K weights (19.35GB) leave only ~4GB on a 24GB 7900 XTX for the KV cache, so a long f16 context can exhaust VRAM. Options, in order: quantize the KV cache with -fa on -ctk q8_0 -ctv q8_0 (roughly halves cache VRAM — but Flash Attention on ROCm/RDNA3 is less mature than on CUDA, so test it on your build); lower -c; or drop to Q5_K_M (16.76GB, ~7GB free) or Q4_K_M (14.33GB, ~9GB free) for a lot more context headroom at a small fidelity cost. If you enabled --mmproj for images, remember it's another ~0.88GB.

Ollama or llama.cpp runs on the CPU instead of the GPU

Confirm the ROCm v7 driver is installed (rocm-smi should list the 7900 XTX) and that your user is in the render and video groups (groups should show both — log out and back in after the usermod step). Per the Ollama AMD GPU docs, ROCm is a separate install from Ollama; if it's missing, Ollama silently falls back to CPU. For a source llama.cpp build, confirm you compiled with -DGGML_HIP=ON and that -ngl 99 is offloading layers. The RX 7900 XTX (gfx1100) is natively supported, so you should not need HSA_OVERRIDE_GFX_VERSION — only unsupported cards need that masquerade.

Image input doesn't work

Vision needs the mmproj projector loaded alongside the quant via --mmproj (see Running) — the quant alone is text-only. The mmproj-* file lives in the same GGUF repo as the weights; make sure you're on a recent llama.cpp/Ollama build with multimodal support, and that your client actually sends the image in the request. The projector is ~0.88GB of extra VRAM.

Token generation feels slower than expected — try the Vulkan backend

On RDNA3 the ROCm/HIP backend can be slower at token generation than llama.cpp's Vulkan backend. Per llama.cpp issue #20934, on the RX 7900 XTX (gfx1100) Vulkan (RADV) reached ~167–177 tok/s on Llama 7B Q4_0 while ROCm landed at ~129–144 tok/s across ROCm 6.4.4–7.x. If your generation rate disappoints under ROCm, build llama.cpp with -DGGML_VULKAN=ON instead of -DGGML_HIP=ON and re-benchmark with llama-bench — Vulkan often wins for pure generation on this card. (These are Llama-7B figures cited only to show the ROCm-vs-Vulkan gap on this GPU, not Mistral Small numbers.)

torch / ROCm errors — this is llama.cpp, not a Python ML stack

Serving Mistral Small 3.2 via llama.cpp or Ollama does not require PyTorch, flash-attn wheels, or a Python ML stack. If you hit a ROCm error, confirm you built (or are running) a HIP-enabled llama.cpp (Option A, -DGGML_HIP=ON -DGPU_TARGETS=gfx1100) rather than a CPU-only binary, and that the ROCm v7 driver is installed (rocm-smi lists the card). For large-VRAM or multi-GPU production serving you could instead run the full-precision weights under a server like vLLM, but that needs far more than 24GB (bf16 is ~47GB) — on a single 7900 XTX the GGUF + llama.cpp-HIP path is the right one.

Model or GPU 404 on /check

Mistral Small 3.2 24B is a new addition; if the /check/mistral-small-3-2-24b/rx-7900-xtx link 404s, the catalogue row is still being registered. The recipe's install and run steps are independent of the benchmark endpoint.

common questions
How much VRAM does Mistral Small 3.2 24B need?

About 24 GB — the minimum this recipe targets.

Which GPUs is Mistral Small 3.2 24B tested on?

RX 7900 XTX (24 GB).

How hard is this setup?

Intermediate — follow the steps above.