What You'll Build
A fully local, private general assistant: Mistral Small 3.2 24B — Mistral's newest generalist Small (release 2506, superseding 3.1 from 2503) — served as an OpenAI-compatible endpoint by llama.cpp or Ollama on a single 16GB RTX 4080, then used from a chat UI (Open WebUI is a good local front-end) or directly via the API. This is a chat/reasoning/writing model, not a coding agent: general Q&A, drafting and editing, multi-step reasoning, 23-language multilingual support, and — because the checkpoint carries a Pixtral vision tower — optional image understanding (send it an image, it answers in text). Everything runs on your own hardware, so prompts and documents never leave the machine.
Hardware data: RTX 4080 (16GB VRAM) · Mistral Small 3.2 24B, GGUF Q4_K_M (14.33GB, the only quant that fits 16GB) · See benchmark data
ℹ️ This is a dense 24B generalist, not a MoE and not text-only. Mistral Small 3.2 is a
Mistral3ForConditionalGeneration(model_type: mistral3) — hidden size 5120, 40 layers, GQA with 32 query / 8 KV heads — the same base architecture as Devstral, so the quant byte-sizes are identical. Because it is dense, its footprint is simply the quant file you load plus the KV cache; there is no "active-parameters" shortcut that shrinks VRAM. The Pixtral vision tower means it can analyze images in addition to text, but it is positioned and used here as a general assistant (verticalllm), not a coding agent. Context window is 128K (max_position_embeddings131072). It uses Mistral's Tekken tokenizer (tekken.json), which needsmistral-common >= 1.6.2on the Python serving paths.
ℹ️ Runs on current llama.cpp out of the box. Unlike some later Mistral 3 releases, this June-2025 model needs no special patch — bartowski quantized it with llama.cpp release b5697 (June 2025), and Mistral3/Pixtral text support has been mainline since mid-2025. Just use a recent
llama.cpp(or Ollama) build. Pass--jinjaso the chat template applies; if tool-calling misbehaves, additionally pass the bundled--chat-template-file Mistral-Small-3.2-24B-Instruct-2506.jinja.
⚠️ 16GB is the floor for this dense 24B — it is context-constrained. At 16GB VRAM only Q4_K_M (14.33GB) fits; the next step up, Q5_K_M (16.76GB), does NOT fit 16GB (it alone exceeds the card, before any KV cache). After the Q4_K_M weights you have only ~1.5–2GB left for the KV cache, so start with a bounded context (
-c 8192or-c 16384) and stretch it by quantizing the cache (-fa on -ctk q8_0 -ctv q8_0). If you want Q5_K_M / Q6_K, or a comfortable large context, step up to a 24GB card.
Requirements
| Component | Minimum | Tested target |
|---|---|---|
| GPU | 16GB VRAM (this card's floor) | RTX 4080 (16GB, Ada Lovelace AD103, sm_89) |
| RAM | 16GB system RAM | 32GB comfortable |
| Storage | ~15GB (Q4_K_M) | ~15GB for Q4_K_M |
| Software | Recent llama.cpp (CUDA) or Ollama; optional Open WebUI chat client | llama-server, Open WebUI |
Model weights (community GGUF — there is NO first-party GGUF). Mistral publishes only the full-precision weights (mistralai/Mistral-Small-3.2-24B-Instruct-2506); the model is quantized to GGUF by the community. Primary source is bartowski/mistralai_Mistral-Small-3.2-24B-Instruct-2506-GGUF; unsloth/Mistral-Small-3.2-24B-Instruct-2506-GGUF is a good alternative that also ships UD-*_XL "dynamic" quants. Byte-verified on-disk sizes (bartowski):
| Quant | On-disk size | Fit on RTX 4080 (16GB) |
|---|---|---|
| Q4_K_M | 14.33GB | Recommended — the only quant that fits 16GB. Leaves only ~1.5–2GB for the KV cache, so keep context bounded (see Running) |
| Q5_K_M | 16.76GB | Does not fit 16GB — the weights alone exceed the card's VRAM; needs a 24GB+ card |
| Q6_K | 19.35GB | Does not fit 16GB — needs a 24GB card |
| Q8_0 | 25.05GB | Does not fit 16GB — needs a 32GB+ card |
| bf16 | 47.15GB | Does not fit 16GB — datacenter-only |
Not model weights — don't count these in the VRAM math:
- The
mmproj-*file (~0.88GB) is the vision projector, not the LLM. It is loaded alongside a quant via--mmprojonly if you want image input, and adds ~0.88GB on top of the quant — exclude it from the weight/VRAM budget unless you actually enable vision. On 16GB it eats into the already-tight KV-cache headroom, so drop context further if you enable it. - The
.imatrix(~10 MB) is calibration data used to produce the quants — never load it as a model.
Licensing. Mistral Small 3.2 24B is Apache-2.0 — free for commercial and non-commercial use, no revenue caps (model card).
Installation
You have two GGUF runtimes; pick one. Both are fine for this model — there is no patch requirement — so choose Ollama for the fastest start, or llama.cpp for the most control over context and KV-cache quantization.
Option A — llama.cpp with CUDA
The RTX 4080 is Ada Lovelace (AD103, sm_89). Build a recent llama.cpp and compile for sm_89, per the official build guide:
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
# RTX 4080 is Ada Lovelace = compute capability 8.9 (sm_89)
cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=89
cmake --build build --config Release -j 8
A recent release is all you need — Mistral3/Pixtral text has been mainline in llama.cpp since mid-2025 (bartowski built these GGUFs with release b5697). If you prefer a prebuilt binary, grab a current one from the releases page. The CUDA backend flag is -DGGML_CUDA=ON on current llama.cpp (the old LLAMA_CUDA name was retired in late 2024); install the NVIDIA CUDA toolkit first.
Option B — Ollama
Ollama is built on llama.cpp and is the fastest way to stand this model up. Use a recent Ollama release and pull the community GGUF straight from Hugging Face (HF × Ollama docs):
ollama run hf.co/bartowski/mistralai_Mistral-Small-3.2-24B-Instruct-2506-GGUF:Q4_K_M
Q4_K_M is the quant to use on 16GB — Q5_K_M and larger do not fit. Ollama serves an OpenAI-compatible API at http://localhost:11434/v1 for chat clients.
Running
With llama.cpp
Serve an OpenAI-compatible API on port 8000. The -hf flag pulls the GGUF from Hugging Face; append :Q4_K_M (case-insensitive) to pick the quant (llama-server docs):
# Q4_K_M (the only quant that fits 16GB), offload all layers to the 4080
llama-server -hf bartowski/mistralai_Mistral-Small-3.2-24B-Instruct-2506-GGUF:Q4_K_M \
--port 8000 \
-ngl 99 \
-c 8192 \
--jinja
-ngl 99(--n-gpu-layers) offloads every layer to the GPU — the dense 24B quant file (14.33GB at Q4_K_M) must sit in VRAM.-c 8192sets an 8K context. On 16GB the Q4_K_M weights leave only ~1.5–2GB for the KV cache, so keep the f16 context small — start at-c 8192(or-c 16384if it fits), and quantize the cache (below) to go higher.--jinjaapplies the GGUF's built-in chat template so the assistant format parses correctly. If tool-calling misbehaves, add--chat-template-file Mistral-Small-3.2-24B-Instruct-2506.jinja(the template bundled with the repo).
Push toward the 128K context window. Mistral Small 3.2 advertises a 128K context (max_position_embeddings 131072). On 16GB you cannot hold anywhere near a full-length f16 KV cache next to the weights — to stretch the window, quantize the KV cache: add -fa on (Flash Attention, required for a quantized cache) and -ctk q8_0 -ctv q8_0, which roughly halves KV-cache VRAM versus f16 with minimal quality impact:
# Longer context by 8-bit-quantizing the KV cache
llama-server -hf bartowski/mistralai_Mistral-Small-3.2-24B-Instruct-2506-GGUF:Q4_K_M \
--port 8000 -ngl 99 -c 16384 --jinja \
-fa on -ctk q8_0 -ctv q8_0
The 16GB budget is genuinely tight: Q4_K_M is the only quant that fits, and even it leaves little KV-cache room, so long contexts require the quantized cache above. If you need Q5_K_M / Q6_K weights or a comfortably large f16 context, that requires a 24GB card, not the 4080.
Optional — image input. The Pixtral vision tower lets the model read images. Download the mmproj-* file from the same GGUF repo and pass it alongside the quant; it adds ~0.88GB of VRAM on top of the weights (tight on 16GB — lower -c to make room):
llama-server -hf bartowski/mistralai_Mistral-Small-3.2-24B-Instruct-2506-GGUF:Q4_K_M \
--mmproj mmproj-mistralai_Mistral-Small-3.2-24B-Instruct-2506-f16.gguf \
--port 8000 -ngl 99 -c 8192 --jinja
With Ollama
Pull and run the community GGUF directly from Hugging Face; append a :quant tag to choose the quant (HF × Ollama docs):
ollama run hf.co/bartowski/mistralai_Mistral-Small-3.2-24B-Instruct-2506-GGUF:Q4_K_M
Ollama serves an OpenAI-compatible API at http://localhost:11434/v1 for chat clients.
Use it as a chat assistant
Point any OpenAI-compatible chat client at your local endpoint by setting its base URL and a dummy API key — no cloud, no per-token cost.
Open WebUI (optional local chat front-end). A self-hosted, ChatGPT-style UI that talks to any OpenAI-compatible server. Run it and point it at your local endpoint:
# Point Open WebUI at your local llama-server (or Ollama on :11434)
docker run -d -p 3000:8080 \
-e OPENAI_API_BASE_URL=http://host.docker.internal:8000/v1 \
-e OPENAI_API_KEY=EMPTY \
ghcr.io/open-webui/open-webui:main
Then open http://localhost:3000 and chat. (Open WebUI also autodetects a local Ollama install, so with the Ollama path you can skip the base-URL wiring entirely.)
Directly via the API. Any OpenAI SDK or curl works against the same endpoint — use it for scripts, writing tools, or your own app:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "mistral-small-3.2-24b",
"messages": [{"role": "user", "content": "Summarize this in three bullet points: ..."}]
}'
Local servers don't check the key, so any non-empty string (e.g. EMPTY) works where a client requires one.
Results
- VRAM usage: The dense 24B loads entirely as its GGUF file — Q4_K_M is 14.33GB on disk (byte-verified from the bartowski GGUF tree). On the RTX 4080's 16GB that leaves only ~1.5–2GB for the KV cache — enough for a small-to-modest context at f16, or a larger window with an 8-bit-quantized cache (see Running). Q4_K_M is the only quant that fits 16GB: Q5_K_M (16.76GB), Q6_K (19.35GB), Q8_0 (25.05GB) and bf16 (47.15GB) all exceed the card. Enabling image input adds ~0.88GB for the
mmprojprojector, tightening the already-limited context budget. - Model capability (vendor evals — Mistral's own, NOT hardware throughput): Mistral reports MMLU Pro 5-shot CoT 69.06%, MATH 69.42%, GPQA Diamond 46.13%, HumanEval Plus pass@5 92.90%, MBPP Plus 78.33%, plus a sharp instruction-following jump over 3.1 — Wildbench v2 65.33% and Arena Hard v2 43.1%. On vision: MMMU 62.50% and DocVQA 94.86%. It handles 23 languages. These are the vendor's benchmarks, not measurements on this GPU.
- Speed: No community throughput benchmark for Mistral Small 3.2 24B on the RTX 4080 exists yet — we would rather omit a tok/s figure than invent one or borrow it from different hardware. Live measurements will appear at
/check/mistral-small-3-2-24b/rtx-4080once contributed.
For the full benchmark data, see /check/mistral-small-3-2-24b/rtx-4080.
Troubleshooting
The chat template looks wrong / responses are malformed
Pass --jinja to llama-server so the GGUF's built-in chat template is applied — without it the assistant format won't parse. Mistral Small 3.2 uses Mistral's own Tekken tokenizer (tekken.json), and on the Python serving paths that needs mistral-common >= 1.6.2. If tool-calling in particular misbehaves, additionally pass --chat-template-file Mistral-Small-3.2-24B-Instruct-2506.jinja (the template bundled in the model repo) to override the embedded one.
Out of memory at Q4_K_M, or when raising the context
On a 16GB 4080, Q4_K_M weights (14.33GB) leave only ~1.5–2GB for the KV cache, so even a moderate f16 context can exhaust VRAM. Options, in order: quantize the KV cache with -fa on -ctk q8_0 -ctv q8_0 (roughly halves cache VRAM); lower -c. There is no smaller-fitting step on this card — Q4_K_M is already the only quant that fits 16GB, so for Q5_K_M / Q6_K or a large f16 context you need a 24GB card. If you enabled --mmproj for images, remember it's another ~0.88GB.
Image input doesn't work
Vision needs the mmproj projector loaded alongside the quant via --mmproj (see Running) — the quant alone is text-only. The mmproj-* file lives in the same GGUF repo as the weights; make sure you're on a recent llama.cpp/Ollama build with multimodal support, and that your client actually sends the image in the request. The projector is ~0.88GB of extra VRAM, which is significant on this 16GB card — lower -c to make room.
torch / CUDA errors — this is llama.cpp, not a Python ML stack
Serving Mistral Small 3.2 via llama.cpp or Ollama does not require PyTorch, flash-attn wheels, or a Python ML stack. If you hit a CUDA error, confirm you built (or downloaded) the CUDA-enabled llama.cpp (Option A, -DGGML_CUDA=ON) rather than a CPU-only binary. For large-VRAM or multi-GPU production serving you could instead run the full-precision weights under a server like vLLM, but that needs far more than 16GB (bf16 is ~47GB) — on a single 4080 the GGUF + llama.cpp path is the right one.
Model or GPU 404 on /check
Mistral Small 3.2 24B is a new addition; if the /check/mistral-small-3-2-24b/rtx-4080 link 404s, the catalogue row is still being registered. The recipe's install and run steps are independent of the benchmark endpoint.