How much VRAM does Mistral Small 3.2 24B need?

About 32 GB — the minimum this recipe targets.

How hard is this setup?

Intermediate — follow the steps above.

Mistral Small 3.2 24B on RTX 5090: Local Private Assistant via llama.cpp / Ollama (32GB)

What You'll Build

A fully local, private general assistant: Mistral Small 3.2 24B — Mistral's newest generalist Small (release 2506, superseding 3.1 from 2503) — served as an OpenAI-compatible endpoint by llama.cpp or Ollama on a single 32GB RTX 5090, then used from a chat UI (Open WebUI is a good local front-end) or directly via the API. This is a chat/reasoning/writing model, not a coding agent: general Q&A, drafting and editing, multi-step reasoning, 23-language multilingual support, and — because the checkpoint carries a Pixtral vision tower — optional image understanding (send it an image, it answers in text). Everything runs on your own hardware, so prompts and documents never leave the machine.

Hardware data: RTX 5090 (32GB VRAM) · Mistral Small 3.2 24B, GGUF Q8_0 (25.05GB, recommended — near-lossless) — or Q6_K (19.35GB) / Q5_K_M (16.76GB) / Q4_K_M (14.33GB) for even more context headroom · See benchmark data

ℹ️ This is a dense 24B generalist, not a MoE and not text-only. Mistral Small 3.2 is a Mistral3ForConditionalGeneration (model_type: mistral3) — hidden size 5120, 40 layers, GQA with 32 query / 8 KV heads — the same base architecture as Devstral, so the quant byte-sizes are identical. Because it is dense, its footprint is simply the quant file you load plus the KV cache; there is no "active-parameters" shortcut that shrinks VRAM. The Pixtral vision tower means it can analyze images in addition to text, but it is positioned and used here as a general assistant (vertical llm), not a coding agent. Context window is 128K (max_position_embeddings 131072). It uses Mistral's Tekken tokenizer (tekken.json), which needs mistral-common >= 1.6.2 on the Python serving paths.

ℹ️ Runs on current llama.cpp out of the box. Unlike some later Mistral 3 releases, this June-2025 model needs no special patch — bartowski quantized it with llama.cpp release b5697 (June 2025), and Mistral3/Pixtral text support has been mainline since mid-2025. Just use a recent llama.cpp (or Ollama) build. Pass --jinja so the chat template applies; if tool-calling misbehaves, additionally pass the bundled --chat-template-file Mistral-Small-3.2-24B-Instruct-2506.jinja.

ℹ️ 32GB unlocks Q8_0 — a real step over the 24GB tier's Q6_K. The RTX 5090's 32GB fits Q8_0 (25.05GB), the near-lossless integer quant that a 24GB card cannot hold, leaving ~7GB for the KV cache. That is the reason to reach for this card: essentially full weight fidelity plus a comfortable context. Note the ceiling: full-precision bf16 (47.15GB) still does NOT fit 32GB — that remains datacenter-only. The 5090 is Blackwell (GB202, sm_120) and needs a CUDA 12.8+ toolchain to build for sm_120 (see Installation).

Requirements

Component	Minimum	Tested target
GPU	32GB VRAM	RTX 5090 (32GB, Blackwell GB202, sm_120)
RAM	16GB system RAM	32GB comfortable
Storage	~26GB (Q8_0)	~25GB for Q8_0
Software	Recent llama.cpp (CUDA 12.8+, sm_120) or Ollama; optional Open WebUI chat client	`llama-server`, Open WebUI

Model weights (community GGUF — there is NO first-party GGUF). Mistral publishes only the full-precision weights (mistralai/Mistral-Small-3.2-24B-Instruct-2506); the model is quantized to GGUF by the community. Primary source is bartowski/mistralai_Mistral-Small-3.2-24B-Instruct-2506-GGUF; unsloth/Mistral-Small-3.2-24B-Instruct-2506-GGUF is a good alternative that also ships UD-*_XL "dynamic" quants. Byte-verified on-disk sizes (bartowski):

Quant	On-disk size	Fit on RTX 5090 (32GB)
Q4_K_M	14.33GB	Comfortable — leaves ~17GB for a very large KV cache / context
Q5_K_M	16.76GB	Comfortable — leaves ~15GB for context
Q6_K	19.35GB	Comfortable — near-lossless weights with ~12GB free for context
Q8_0	25.05GB	Recommended — near-lossless integer quant that only fits here, not on 24GB; ~7GB left for the KV cache (a comfortable context; extend it by quantizing the cache — see Running)
bf16	47.15GB	Does not fit 32GB — full precision is datacenter-only

Not model weights — don't count these in the VRAM math:

The mmproj-* file (~0.88GB) is the vision projector, not the LLM. It is loaded alongside a quant via --mmproj only if you want image input, and adds ~0.88GB on top of the quant — exclude it from the weight/VRAM budget unless you actually enable vision.
The .imatrix (~10 MB) is calibration data used to produce the quants — never load it as a model.

Licensing. Mistral Small 3.2 24B is Apache-2.0 — free for commercial and non-commercial use, no revenue caps (model card).

Installation

You have two GGUF runtimes; pick one. Both are fine for this model — there is no patch requirement — so choose Ollama for the fastest start, or llama.cpp for the most control over context and KV-cache quantization.

Option A — llama.cpp with CUDA

The RTX 5090 is Blackwell (GB202, sm_120). Build a recent llama.cpp with a CUDA 12.8+ toolkit (the first CUDA release with sm_120 support) and compile for sm_120, per the official build guide:

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
# RTX 5090 is Blackwell = compute capability 12.0 (sm_120); needs CUDA 12.8+
cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=120
cmake --build build --config Release -j 8

A recent release is all you need — Mistral3/Pixtral text has been mainline in llama.cpp since mid-2025 (bartowski built these GGUFs with release b5697). Blackwell (sm_120) requires the CUDA 12.8+ toolkit; an older toolkit won't emit sm_120 code and the GPU will fall back or fail. If you prefer a prebuilt binary, grab a current CUDA build from the releases page. The CUDA backend flag is -DGGML_CUDA=ON on current llama.cpp (the old LLAMA_CUDA name was retired in late 2024).

Option B — Ollama

Ollama is built on llama.cpp and is the fastest way to stand this model up. Use a recent Ollama release (one new enough to ship Blackwell/sm_120 CUDA kernels) and pull the community GGUF straight from Hugging Face (HF × Ollama docs):

ollama run hf.co/bartowski/mistralai_Mistral-Small-3.2-24B-Instruct-2506-GGUF:Q8_0

Swap the :Q8_0 tag for :Q6_K, :Q5_K_M, or :Q4_K_M if you want even more context headroom. Ollama serves an OpenAI-compatible API at http://localhost:11434/v1 for chat clients.

Running

With llama.cpp

Serve an OpenAI-compatible API on port 8000. The -hf flag pulls the GGUF from Hugging Face; append :Q8_0 (case-insensitive) to pick the quant (llama-server docs):

# Q8_0 (recommended, near-lossless), offload all layers to the 5090
llama-server -hf bartowski/mistralai_Mistral-Small-3.2-24B-Instruct-2506-GGUF:Q8_0 \
    --port 8000 \
    -ngl 99 \
    -c 32768 \
    --jinja

-ngl 99 (--n-gpu-layers) offloads every layer to the GPU — the dense 24B quant file (25.05GB at Q8_0) must sit in VRAM.
-c 32768 sets a 32K context. At Q8_0 ~7GB is left after the weights, comfortably more room than the 24GB tier's Q6_K; keep it here at f16, or quantize the KV cache (below) to push it much higher.
--jinja applies the GGUF's built-in chat template so the assistant format parses correctly. If tool-calling misbehaves, add --chat-template-file Mistral-Small-3.2-24B-Instruct-2506.jinja (the template bundled with the repo).

Push toward the 128K context window. Mistral Small 3.2 advertises a 128K context (max_position_embeddings 131072). To hold a very long window next to Q8_0 weights on 32GB, quantize the KV cache: add -fa on (Flash Attention, required for a quantized cache) and -ctk q8_0 -ctv q8_0, which roughly halves KV-cache VRAM versus f16 with minimal quality impact:

# Longer context by 8-bit-quantizing the KV cache
llama-server -hf bartowski/mistralai_Mistral-Small-3.2-24B-Instruct-2506-GGUF:Q8_0 \
    --port 8000 -ngl 99 -c 98304 --jinja \
    -fa on -ctk q8_0 -ctv q8_0

If you'd rather spend the 32GB on context than weight fidelity, drop to :Q6_K (19.35GB, ~12GB free), :Q5_K_M (16.76GB, ~15GB free), or :Q4_K_M (14.33GB, ~17GB free) — but on this card Q8_0 is the natural pick, since it's near-lossless and still leaves a comfortable KV budget.

Optional — image input. The Pixtral vision tower lets the model read images. Download the mmproj-* file from the same GGUF repo and pass it alongside the quant; it adds ~0.88GB of VRAM on top of the weights:

llama-server -hf bartowski/mistralai_Mistral-Small-3.2-24B-Instruct-2506-GGUF:Q8_0 \
    --mmproj mmproj-mistralai_Mistral-Small-3.2-24B-Instruct-2506-f16.gguf \
    --port 8000 -ngl 99 -c 32768 --jinja

With Ollama

Pull and run the community GGUF directly from Hugging Face; append a :quant tag to choose the quant (HF × Ollama docs):

ollama run hf.co/bartowski/mistralai_Mistral-Small-3.2-24B-Instruct-2506-GGUF:Q8_0

Ollama serves an OpenAI-compatible API at http://localhost:11434/v1 for chat clients.

Use it as a chat assistant

Point any OpenAI-compatible chat client at your local endpoint by setting its base URL and a dummy API key — no cloud, no per-token cost.

Open WebUI (optional local chat front-end). A self-hosted, ChatGPT-style UI that talks to any OpenAI-compatible server. Run it and point it at your local endpoint:

# Point Open WebUI at your local llama-server (or Ollama on :11434)
docker run -d -p 3000:8080 \
    -e OPENAI_API_BASE_URL=http://host.docker.internal:8000/v1 \
    -e OPENAI_API_KEY=EMPTY \
    ghcr.io/open-webui/open-webui:main

Then open http://localhost:3000 and chat. (Open WebUI also autodetects a local Ollama install, so with the Ollama path you can skip the base-URL wiring entirely.)

Directly via the API. Any OpenAI SDK or curl works against the same endpoint — use it for scripts, writing tools, or your own app:

curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
      "model": "mistral-small-3.2-24b",
      "messages": [{"role": "user", "content": "Summarize this in three bullet points: ..."}]
    }'

Local servers don't check the key, so any non-empty string (e.g. EMPTY) works where a client requires one.

Results

VRAM usage: The dense 24B loads entirely as its GGUF file — Q8_0 is 25.05GB on disk (byte-verified from the bartowski GGUF tree). On the RTX 5090's 32GB that leaves ~7GB for the KV cache — a comfortable context at f16, or a much larger window with an 8-bit-quantized cache (see Running). This is the payoff of 32GB: Q8_0 (near-lossless) fits here but not on a 24GB card, a real step up from that tier's Q6_K. Q6_K (19.35GB, ~12GB free), Q5_K_M (16.76GB, ~15GB free) and Q4_K_M (14.33GB, ~17GB free) trade weight fidelity for even more context. Full-precision bf16 (47.15GB) does not fit 32GB. Enabling image input adds ~0.88GB for the mmproj projector.
Model capability (vendor evals — Mistral's own, NOT hardware throughput): Mistral reports MMLU Pro 5-shot CoT 69.06%, MATH 69.42%, GPQA Diamond 46.13%, HumanEval Plus pass@5 92.90%, MBPP Plus 78.33%, plus a sharp instruction-following jump over 3.1 — Wildbench v2 65.33% and Arena Hard v2 43.1%. On vision: MMMU 62.50% and DocVQA 94.86%. It handles 23 languages. These are the vendor's benchmarks, not measurements on this GPU.
Speed: No community throughput benchmark for Mistral Small 3.2 24B on the RTX 5090 exists yet — we would rather omit a tok/s figure than invent one or borrow it from different hardware. Live measurements will appear at /check/mistral-small-3-2-24b/rtx-5090 once contributed.

For the full benchmark data, see /check/mistral-small-3-2-24b/rtx-5090.

Troubleshooting

The chat template looks wrong / responses are malformed

Pass --jinja to llama-server so the GGUF's built-in chat template is applied — without it the assistant format won't parse. Mistral Small 3.2 uses Mistral's own Tekken tokenizer (tekken.json), and on the Python serving paths that needs mistral-common >= 1.6.2. If tool-calling in particular misbehaves, additionally pass --chat-template-file Mistral-Small-3.2-24B-Instruct-2506.jinja (the template bundled in the model repo) to override the embedded one.

Out of memory at Q8_0, or when raising the context

Q8_0 weights (25.05GB) leave ~7GB on a 32GB 5090 for the KV cache, so a very long f16 context can still exhaust VRAM. Options, in order: quantize the KV cache with -fa on -ctk q8_0 -ctv q8_0 (roughly halves cache VRAM); lower -c; or drop to Q6_K (19.35GB, ~12GB free) or Q5_K_M (16.76GB, ~15GB free) for a lot more context headroom at a small fidelity cost. If you enabled --mmproj for images, remember it's another ~0.88GB.

Blackwell / sm_120 build errors

The RTX 5090 is Blackwell (sm_120), which needs the CUDA 12.8+ toolkit — an older toolkit can't emit sm_120 code, so the build either errors or the GPU falls back. Confirm nvcc --version reports 12.8 or newer, build with -DCMAKE_CUDA_ARCHITECTURES=120, and if you use a prebuilt binary make sure it's a recent CUDA build that includes Blackwell kernels. On the Ollama path, use a recent Ollama release for the same reason.

Image input doesn't work

Vision needs the mmproj projector loaded alongside the quant via --mmproj (see Running) — the quant alone is text-only. The mmproj-* file lives in the same GGUF repo as the weights; make sure you're on a recent llama.cpp/Ollama build with multimodal support, and that your client actually sends the image in the request. The projector is ~0.88GB of extra VRAM.

`torch` / CUDA errors — this is llama.cpp, not a Python ML stack

Serving Mistral Small 3.2 via llama.cpp or Ollama does not require PyTorch, flash-attn wheels, or a Python ML stack. If you hit a CUDA error, confirm you built (or downloaded) the CUDA-enabled llama.cpp (Option A, -DGGML_CUDA=ON, CUDA 12.8+ for sm_120) rather than a CPU-only binary. For large-VRAM or multi-GPU production serving you could instead run the full-precision weights under a server like vLLM, but that needs far more than 32GB (bf16 is ~47GB) — on a single 5090 the GGUF + llama.cpp path is the right one.

Model or GPU 404 on /check

Mistral Small 3.2 24B is a new addition; if the /check/mistral-small-3-2-24b/rtx-5090 link 404s, the catalogue row is still being registered. The recipe's install and run steps are independent of the benchmark endpoint.