self-hosted/ai
§01·recipe · llm

Mistral Small 3.2 24B on RTX 4080: Local Private Assistant via llama.cpp / Ollama (16GB)

llmintermediate16GB+ VRAMJul 3, 2026

This intermediate recipe sets up Mistral Small 3.2 24B on the RTX 4080, needing about 16 GB of VRAM.

models
tools
prerequisites
  • NVIDIA RTX 4080 (16GB VRAM, Ada Lovelace AD103, sm_89)
  • 16GB+ system RAM (32GB comfortable)
  • ~15GB free disk for the GGUF (Q4_K_M ~14GB)
  • A recent llama.cpp build (CUDA) or Ollama — no special patch needed for this June-2025 model
  • Optional: Open WebUI (or any OpenAI-compatible chat client) for a local chat front-end; +~0.9GB and mistral-common >=1.6.2 only if you want image input

What You'll Build

A fully local, private general assistant: Mistral Small 3.2 24B — Mistral's newest generalist Small (release 2506, superseding 3.1 from 2503) — served as an OpenAI-compatible endpoint by llama.cpp or Ollama on a single 16GB RTX 4080, then used from a chat UI (Open WebUI is a good local front-end) or directly via the API. This is a chat/reasoning/writing model, not a coding agent: general Q&A, drafting and editing, multi-step reasoning, 23-language multilingual support, and — because the checkpoint carries a Pixtral vision tower — optional image understanding (send it an image, it answers in text). Everything runs on your own hardware, so prompts and documents never leave the machine.

Hardware data: RTX 4080 (16GB VRAM) · Mistral Small 3.2 24B, GGUF Q4_K_M (14.33GB, the only quant that fits 16GB) · See benchmark data

ℹ️ This is a dense 24B generalist, not a MoE and not text-only. Mistral Small 3.2 is a Mistral3ForConditionalGeneration (model_type: mistral3) — hidden size 5120, 40 layers, GQA with 32 query / 8 KV heads — the same base architecture as Devstral, so the quant byte-sizes are identical. Because it is dense, its footprint is simply the quant file you load plus the KV cache; there is no "active-parameters" shortcut that shrinks VRAM. The Pixtral vision tower means it can analyze images in addition to text, but it is positioned and used here as a general assistant (vertical llm), not a coding agent. Context window is 128K (max_position_embeddings 131072). It uses Mistral's Tekken tokenizer (tekken.json), which needs mistral-common >= 1.6.2 on the Python serving paths.

ℹ️ Runs on current llama.cpp out of the box. Unlike some later Mistral 3 releases, this June-2025 model needs no special patch — bartowski quantized it with llama.cpp release b5697 (June 2025), and Mistral3/Pixtral text support has been mainline since mid-2025. Just use a recent llama.cpp (or Ollama) build. Pass --jinja so the chat template applies; if tool-calling misbehaves, additionally pass the bundled --chat-template-file Mistral-Small-3.2-24B-Instruct-2506.jinja.

⚠️ 16GB is the floor for this dense 24B — it is context-constrained. At 16GB VRAM only Q4_K_M (14.33GB) fits; the next step up, Q5_K_M (16.76GB), does NOT fit 16GB (it alone exceeds the card, before any KV cache). After the Q4_K_M weights you have only ~1.5–2GB left for the KV cache, so start with a bounded context (-c 8192 or -c 16384) and stretch it by quantizing the cache (-fa on -ctk q8_0 -ctv q8_0). If you want Q5_K_M / Q6_K, or a comfortable large context, step up to a 24GB card.

Requirements

ComponentMinimumTested target
GPU16GB VRAM (this card's floor)RTX 4080 (16GB, Ada Lovelace AD103, sm_89)
RAM16GB system RAM32GB comfortable
Storage~15GB (Q4_K_M)~15GB for Q4_K_M
SoftwareRecent llama.cpp (CUDA) or Ollama; optional Open WebUI chat clientllama-server, Open WebUI

Model weights (community GGUF — there is NO first-party GGUF). Mistral publishes only the full-precision weights (mistralai/Mistral-Small-3.2-24B-Instruct-2506); the model is quantized to GGUF by the community. Primary source is bartowski/mistralai_Mistral-Small-3.2-24B-Instruct-2506-GGUF; unsloth/Mistral-Small-3.2-24B-Instruct-2506-GGUF is a good alternative that also ships UD-*_XL "dynamic" quants. Byte-verified on-disk sizes (bartowski):

QuantOn-disk sizeFit on RTX 4080 (16GB)
Q4_K_M14.33GBRecommended — the only quant that fits 16GB. Leaves only ~1.5–2GB for the KV cache, so keep context bounded (see Running)
Q5_K_M16.76GBDoes not fit 16GB — the weights alone exceed the card's VRAM; needs a 24GB+ card
Q6_K19.35GBDoes not fit 16GB — needs a 24GB card
Q8_025.05GBDoes not fit 16GB — needs a 32GB+ card
bf1647.15GBDoes not fit 16GB — datacenter-only

Not model weights — don't count these in the VRAM math:

  • The mmproj-* file (~0.88GB) is the vision projector, not the LLM. It is loaded alongside a quant via --mmproj only if you want image input, and adds ~0.88GB on top of the quant — exclude it from the weight/VRAM budget unless you actually enable vision. On 16GB it eats into the already-tight KV-cache headroom, so drop context further if you enable it.
  • The .imatrix (~10 MB) is calibration data used to produce the quants — never load it as a model.

Licensing. Mistral Small 3.2 24B is Apache-2.0 — free for commercial and non-commercial use, no revenue caps (model card).

Installation

You have two GGUF runtimes; pick one. Both are fine for this model — there is no patch requirement — so choose Ollama for the fastest start, or llama.cpp for the most control over context and KV-cache quantization.

Option A — llama.cpp with CUDA

The RTX 4080 is Ada Lovelace (AD103, sm_89). Build a recent llama.cpp and compile for sm_89, per the official build guide:

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
# RTX 4080 is Ada Lovelace = compute capability 8.9 (sm_89)
cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=89
cmake --build build --config Release -j 8

A recent release is all you need — Mistral3/Pixtral text has been mainline in llama.cpp since mid-2025 (bartowski built these GGUFs with release b5697). If you prefer a prebuilt binary, grab a current one from the releases page. The CUDA backend flag is -DGGML_CUDA=ON on current llama.cpp (the old LLAMA_CUDA name was retired in late 2024); install the NVIDIA CUDA toolkit first.

Option B — Ollama

Ollama is built on llama.cpp and is the fastest way to stand this model up. Use a recent Ollama release and pull the community GGUF straight from Hugging Face (HF × Ollama docs):

ollama run hf.co/bartowski/mistralai_Mistral-Small-3.2-24B-Instruct-2506-GGUF:Q4_K_M

Q4_K_M is the quant to use on 16GB — Q5_K_M and larger do not fit. Ollama serves an OpenAI-compatible API at http://localhost:11434/v1 for chat clients.

Running

With llama.cpp

Serve an OpenAI-compatible API on port 8000. The -hf flag pulls the GGUF from Hugging Face; append :Q4_K_M (case-insensitive) to pick the quant (llama-server docs):

# Q4_K_M (the only quant that fits 16GB), offload all layers to the 4080
llama-server -hf bartowski/mistralai_Mistral-Small-3.2-24B-Instruct-2506-GGUF:Q4_K_M \
    --port 8000 \
    -ngl 99 \
    -c 8192 \
    --jinja
  • -ngl 99 (--n-gpu-layers) offloads every layer to the GPU — the dense 24B quant file (14.33GB at Q4_K_M) must sit in VRAM.
  • -c 8192 sets an 8K context. On 16GB the Q4_K_M weights leave only ~1.5–2GB for the KV cache, so keep the f16 context small — start at -c 8192 (or -c 16384 if it fits), and quantize the cache (below) to go higher.
  • --jinja applies the GGUF's built-in chat template so the assistant format parses correctly. If tool-calling misbehaves, add --chat-template-file Mistral-Small-3.2-24B-Instruct-2506.jinja (the template bundled with the repo).

Push toward the 128K context window. Mistral Small 3.2 advertises a 128K context (max_position_embeddings 131072). On 16GB you cannot hold anywhere near a full-length f16 KV cache next to the weights — to stretch the window, quantize the KV cache: add -fa on (Flash Attention, required for a quantized cache) and -ctk q8_0 -ctv q8_0, which roughly halves KV-cache VRAM versus f16 with minimal quality impact:

# Longer context by 8-bit-quantizing the KV cache
llama-server -hf bartowski/mistralai_Mistral-Small-3.2-24B-Instruct-2506-GGUF:Q4_K_M \
    --port 8000 -ngl 99 -c 16384 --jinja \
    -fa on -ctk q8_0 -ctv q8_0

The 16GB budget is genuinely tight: Q4_K_M is the only quant that fits, and even it leaves little KV-cache room, so long contexts require the quantized cache above. If you need Q5_K_M / Q6_K weights or a comfortably large f16 context, that requires a 24GB card, not the 4080.

Optional — image input. The Pixtral vision tower lets the model read images. Download the mmproj-* file from the same GGUF repo and pass it alongside the quant; it adds ~0.88GB of VRAM on top of the weights (tight on 16GB — lower -c to make room):

llama-server -hf bartowski/mistralai_Mistral-Small-3.2-24B-Instruct-2506-GGUF:Q4_K_M \
    --mmproj mmproj-mistralai_Mistral-Small-3.2-24B-Instruct-2506-f16.gguf \
    --port 8000 -ngl 99 -c 8192 --jinja

With Ollama

Pull and run the community GGUF directly from Hugging Face; append a :quant tag to choose the quant (HF × Ollama docs):

ollama run hf.co/bartowski/mistralai_Mistral-Small-3.2-24B-Instruct-2506-GGUF:Q4_K_M

Ollama serves an OpenAI-compatible API at http://localhost:11434/v1 for chat clients.

Use it as a chat assistant

Point any OpenAI-compatible chat client at your local endpoint by setting its base URL and a dummy API key — no cloud, no per-token cost.

Open WebUI (optional local chat front-end). A self-hosted, ChatGPT-style UI that talks to any OpenAI-compatible server. Run it and point it at your local endpoint:

# Point Open WebUI at your local llama-server (or Ollama on :11434)
docker run -d -p 3000:8080 \
    -e OPENAI_API_BASE_URL=http://host.docker.internal:8000/v1 \
    -e OPENAI_API_KEY=EMPTY \
    ghcr.io/open-webui/open-webui:main

Then open http://localhost:3000 and chat. (Open WebUI also autodetects a local Ollama install, so with the Ollama path you can skip the base-URL wiring entirely.)

Directly via the API. Any OpenAI SDK or curl works against the same endpoint — use it for scripts, writing tools, or your own app:

curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
      "model": "mistral-small-3.2-24b",
      "messages": [{"role": "user", "content": "Summarize this in three bullet points: ..."}]
    }'

Local servers don't check the key, so any non-empty string (e.g. EMPTY) works where a client requires one.

Results

  • VRAM usage: The dense 24B loads entirely as its GGUF file — Q4_K_M is 14.33GB on disk (byte-verified from the bartowski GGUF tree). On the RTX 4080's 16GB that leaves only ~1.5–2GB for the KV cache — enough for a small-to-modest context at f16, or a larger window with an 8-bit-quantized cache (see Running). Q4_K_M is the only quant that fits 16GB: Q5_K_M (16.76GB), Q6_K (19.35GB), Q8_0 (25.05GB) and bf16 (47.15GB) all exceed the card. Enabling image input adds ~0.88GB for the mmproj projector, tightening the already-limited context budget.
  • Model capability (vendor evals — Mistral's own, NOT hardware throughput): Mistral reports MMLU Pro 5-shot CoT 69.06%, MATH 69.42%, GPQA Diamond 46.13%, HumanEval Plus pass@5 92.90%, MBPP Plus 78.33%, plus a sharp instruction-following jump over 3.1 — Wildbench v2 65.33% and Arena Hard v2 43.1%. On vision: MMMU 62.50% and DocVQA 94.86%. It handles 23 languages. These are the vendor's benchmarks, not measurements on this GPU.
  • Speed: No community throughput benchmark for Mistral Small 3.2 24B on the RTX 4080 exists yet — we would rather omit a tok/s figure than invent one or borrow it from different hardware. Live measurements will appear at /check/mistral-small-3-2-24b/rtx-4080 once contributed.

For the full benchmark data, see /check/mistral-small-3-2-24b/rtx-4080.

Troubleshooting

The chat template looks wrong / responses are malformed

Pass --jinja to llama-server so the GGUF's built-in chat template is applied — without it the assistant format won't parse. Mistral Small 3.2 uses Mistral's own Tekken tokenizer (tekken.json), and on the Python serving paths that needs mistral-common >= 1.6.2. If tool-calling in particular misbehaves, additionally pass --chat-template-file Mistral-Small-3.2-24B-Instruct-2506.jinja (the template bundled in the model repo) to override the embedded one.

Out of memory at Q4_K_M, or when raising the context

On a 16GB 4080, Q4_K_M weights (14.33GB) leave only ~1.5–2GB for the KV cache, so even a moderate f16 context can exhaust VRAM. Options, in order: quantize the KV cache with -fa on -ctk q8_0 -ctv q8_0 (roughly halves cache VRAM); lower -c. There is no smaller-fitting step on this card — Q4_K_M is already the only quant that fits 16GB, so for Q5_K_M / Q6_K or a large f16 context you need a 24GB card. If you enabled --mmproj for images, remember it's another ~0.88GB.

Image input doesn't work

Vision needs the mmproj projector loaded alongside the quant via --mmproj (see Running) — the quant alone is text-only. The mmproj-* file lives in the same GGUF repo as the weights; make sure you're on a recent llama.cpp/Ollama build with multimodal support, and that your client actually sends the image in the request. The projector is ~0.88GB of extra VRAM, which is significant on this 16GB card — lower -c to make room.

torch / CUDA errors — this is llama.cpp, not a Python ML stack

Serving Mistral Small 3.2 via llama.cpp or Ollama does not require PyTorch, flash-attn wheels, or a Python ML stack. If you hit a CUDA error, confirm you built (or downloaded) the CUDA-enabled llama.cpp (Option A, -DGGML_CUDA=ON) rather than a CPU-only binary. For large-VRAM or multi-GPU production serving you could instead run the full-precision weights under a server like vLLM, but that needs far more than 16GB (bf16 is ~47GB) — on a single 4080 the GGUF + llama.cpp path is the right one.

Model or GPU 404 on /check

Mistral Small 3.2 24B is a new addition; if the /check/mistral-small-3-2-24b/rtx-4080 link 404s, the catalogue row is still being registered. The recipe's install and run steps are independent of the benchmark endpoint.

common questions
How much VRAM does Mistral Small 3.2 24B need?

About 16 GB — the minimum this recipe targets.

Which GPUs is Mistral Small 3.2 24B tested on?

RTX 4080 (16 GB).

How hard is this setup?

Intermediate — follow the steps above.