What You'll Build
A fully local, private general assistant: Mistral Nemo 12B — Mistral AI and NVIDIA's Apache-2.0 generalist (Instruct, release 2407) — served as an OpenAI-compatible endpoint by llama.cpp or Ollama on a single 24GB RTX 3090, then used from a chat UI (Open WebUI is a good local front-end) or directly via the API. This is a text-only chat/reasoning/writing model: general Q&A, drafting and editing, multi-step reasoning, function calling, and strong multilingual support. Positioned as a drop-in upgrade to Mistral 7B, it's a capable 12B that runs on modest hardware — on a 24GB RTX 3090 it's very comfortable, running a near-lossless Q8_0 with ~11GB left for a very large KV cache. Everything runs on your own hardware, so prompts and documents never leave the machine.
Hardware data: RTX 3090 (24GB VRAM) · Mistral Nemo 12B, GGUF Q8_0 (13.02GB, recommended — near-lossless) — or Q6_K (10.06GB) / Q4_K_M (7.48GB) for even more KV-cache / context headroom · See benchmark data
ℹ️ This is a dense, text-only 12B generalist — no MoE, no vision. Mistral Nemo is a
MistralForCausalLM(model_type: mistral) — 40 layers, hidden size 5120, GQA with 32 query / 8 KV heads, head_dim 128. Because it is dense, its footprint is simply the quant file you load plus the KV cache; there is no "active-parameters" shortcut. It is a pure text model — there is no vision tower and no image input. Context window is 128K (max_position_embeddings131072). It was the first model to use Mistral's Tekken tokenizer (tekken.json), which needsmistral-commonon the Python serving paths — but the GGUF / llama.cpp path uses the embedded tokenizer, so no extra install is required there. Nemo was trained with quantization awareness for FP8 inference and tuned for function calling and multilingual use.
ℹ️ Runs on current llama.cpp out of the box. Mistral Nemo shipped in July 2024 and has been long supported — there is no special patch or PR gate. Just use a recent
llama.cpp(or Ollama) build. Pass--jinjaso the embedded chat template applies.
⚠️ Use a low sampling temperature (~0.3). Mistral recommends a low temperature (~0.3) for Nemo; the usual default of 0.7 noticeably degrades output quality on this model. Set it explicitly — this is a real, easy-to-miss gotcha.
ℹ️ Ampere has no FP8 tensor cores — and you don't need them here. The RTX 3090 is Ampere (GA102, sm_86), which predates the FP8 hardware path introduced on Ada/Hopper. Although Nemo was trained with FP8-inference awareness, this GGUF route is integer quantization (Q4/Q6/Q8) — it does not use FP8 and needs no FP8 tensor cores, so the RTX 3090 runs it exactly like any other CUDA card. Don't expect (or look for) an FP8 fast path on this GPU.
Requirements
| Component | Minimum | Tested target |
|---|---|---|
| GPU | 8GB VRAM (Q4_K_M floor — the matrix reaches down this far) | RTX 3090 (24GB, Ampere GA102, sm_86) |
| RAM | 16GB system RAM | 32GB comfortable |
| Storage | ~7.5GB (Q4_K_M) up to ~13GB (Q8_0) | ~13GB for Q8_0 |
| Software | Recent llama.cpp (CUDA) or Ollama; optional Open WebUI chat client | llama-server, Open WebUI |
Model weights (community GGUF — there is NO first-party GGUF). Mistral publishes only the full-precision weights (mistralai/Mistral-Nemo-Instruct-2407); the model is quantized to GGUF by the community. Primary source is bartowski/Mistral-Nemo-Instruct-2407-GGUF; unsloth/Mistral-Nemo-Instruct-2407-GGUF is a good alternative that also ships smaller Q2_K / Q3_K_M quants. Byte-verified on-disk sizes (bartowski):
| Quant | On-disk size | Fit on RTX 3090 (24GB) |
|---|---|---|
| Q4_K_M | 7.48GB | Tiny footprint — huge KV-cache / context headroom; also the quant that fits an 8GB card |
| Q6_K | 10.06GB | Comfortable — near-lossless-feeling with lots of room for a large KV cache |
| Q8_0 | 13.02GB | Recommended — near-lossless weights with ~11GB left for a very large KV cache / long context; the practical best-quality choice on 24GB |
| f16 | 24.50GB | Full precision — fits a 24GB card only tightly (barely, with almost no KV room); not recommended, and at 12B Q8_0 is already near-lossless |
Not model weights — don't count this in the VRAM math:
- The
.imatrix(~7 MB) is calibration data used to produce the quants — never load it as a model.
Licensing. Mistral Nemo 12B is Apache-2.0 — free for commercial and non-commercial use, no revenue caps (model card).
Installation
You have two GGUF runtimes; pick one. Both are fine for this model — there is no patch requirement — so choose Ollama for the fastest start, or llama.cpp for the most control over context and KV-cache quantization.
Option A — llama.cpp with CUDA
The RTX 3090 is Ampere (GA102, sm_86). Build a recent llama.cpp and compile for sm_86, per the official build guide:
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
# RTX 3090 is Ampere = compute capability 8.6 (sm_86)
cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=86
cmake --build build --config Release -j 8
A recent release is all you need — Mistral Nemo has been mainline in llama.cpp since its July 2024 launch. If you prefer a prebuilt binary, grab a current one from the releases page. The CUDA backend flag is -DGGML_CUDA=ON on current llama.cpp (the old LLAMA_CUDA name was retired in late 2024); install the NVIDIA CUDA toolkit first. (Ampere has no FP8 tensor cores, but that is irrelevant to this integer-GGUF path — no special build flag is needed for it.)
Option B — Ollama
Ollama is built on llama.cpp and is the fastest way to stand this model up. Either use the curated tag (ollama run mistral-nemo) or pull the community GGUF straight from Hugging Face (HF × Ollama docs):
ollama run hf.co/bartowski/Mistral-Nemo-Instruct-2407-GGUF:Q8_0
Swap the :Q8_0 tag for :Q6_K or :Q4_K_M if you want an even smaller footprint. Ollama serves an OpenAI-compatible API at http://localhost:11434/v1 for chat clients.
Running
With llama.cpp
Serve an OpenAI-compatible API on port 8000. The -hf flag pulls the GGUF from Hugging Face; append :Q8_0 (case-insensitive) to pick the quant (llama-server docs):
# Q8_0 (recommended, near-lossless), offload all layers to the 3090, low temperature per Mistral's guidance
llama-server -hf bartowski/Mistral-Nemo-Instruct-2407-GGUF:Q8_0 \
--port 8000 \
-ngl 99 \
-c 16384 \
--temp 0.3 \
--jinja
-ngl 99(--n-gpu-layers) offloads every layer to the GPU — the dense 12B quant file (13.02GB at Q8_0) sits entirely in VRAM with room to spare.-c 16384sets a 16K context. At Q8_0 you have ~11GB free after the weights, so you can raise this a lot; quantize the KV cache (below) to push toward the full 128K.--temp 0.3sets the low sampling temperature Mistral recommends for Nemo — leaving it at the usual 0.7 noticeably degrades output. Set it explicitly (many clients default higher).--jinjaapplies the GGUF's built-in chat template so the assistant format parses correctly.
Push toward the 128K context window. Mistral Nemo advertises a 128K context (max_position_embeddings 131072). At Q8_0 on 24GB you have plenty of room, and you can go further by quantizing the KV cache: add -fa on (Flash Attention, required for a quantized cache) and -ctk q8_0 -ctv q8_0, which roughly halves KV-cache VRAM versus f16 with minimal quality impact:
# Longer context by 8-bit-quantizing the KV cache
llama-server -hf bartowski/Mistral-Nemo-Instruct-2407-GGUF:Q8_0 \
--port 8000 -ngl 99 -c 131072 --temp 0.3 --jinja \
-fa on -ctk q8_0 -ctv q8_0
Because Nemo is only 12B, you have generous headroom on a 24GB card — this same model also fits far smaller GPUs (Q4_K_M at 7.48GB runs on an 8GB card), so the matrix reaches well below this tier.
With Ollama
Pull and run the community GGUF directly from Hugging Face; append a :quant tag to choose the quant (HF × Ollama docs):
ollama run hf.co/bartowski/Mistral-Nemo-Instruct-2407-GGUF:Q8_0
Remember to set a low temperature (~0.3) in your client or Modelfile — Ollama's default sampling can be higher, and Nemo degrades at 0.7. Ollama serves an OpenAI-compatible API at http://localhost:11434/v1 for chat clients.
Use it as a chat assistant
Point any OpenAI-compatible chat client at your local endpoint by setting its base URL and a dummy API key — no cloud, no per-token cost.
Open WebUI (optional local chat front-end). A self-hosted, ChatGPT-style UI that talks to any OpenAI-compatible server. Run it and point it at your local endpoint:
# Point Open WebUI at your local llama-server (or Ollama on :11434)
docker run -d -p 3000:8080 \
-e OPENAI_API_BASE_URL=http://host.docker.internal:8000/v1 \
-e OPENAI_API_KEY=EMPTY \
ghcr.io/open-webui/open-webui:main
Then open http://localhost:3000 and chat. (Open WebUI also autodetects a local Ollama install, so with the Ollama path you can skip the base-URL wiring entirely.) Set the temperature to ~0.3 in the model's parameters.
Directly via the API. Any OpenAI SDK or curl works against the same endpoint — use it for scripts, writing tools, or your own app:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "mistral-nemo-12b",
"temperature": 0.3,
"messages": [{"role": "user", "content": "Summarize this in three bullet points: ..."}]
}'
Local servers don't check the key, so any non-empty string (e.g. EMPTY) works where a client requires one.
Results
- VRAM usage: The dense 12B loads entirely as its GGUF file — Q8_0 is 13.02GB on disk (byte-verified from the bartowski GGUF tree). On the RTX 3090's 24GB that leaves ~11GB for the KV cache — plenty for a long context, and even more with an 8-bit-quantized cache (see Running). Q6_K (10.06GB) and Q4_K_M (7.48GB) shrink the footprint further for even larger context or smaller cards. The full-precision f16 GGUF (24.50GB) fits a 24GB card only tightly, with almost no room for the KV cache — not recommended, and at 12B Q8_0 is already near-lossless.
- Model capability (vendor evals — Mistral's own, NOT hardware throughput): Mistral reports MMLU 68.0% and HellaSwag (0-shot) 83.5%, with strong multilingual results — MMLU French 62.3%, German 62.7%, Spanish 64.6%. These are the vendor's benchmarks, not measurements on this GPU.
- Speed: No community throughput benchmark for Mistral Nemo 12B on the RTX 3090 exists yet — we would rather omit a tok/s figure than invent one or borrow it from different hardware. Live measurements will appear at
/check/mistral-nemo-12b/rtx-3090once contributed.
For the full benchmark data, see /check/mistral-nemo-12b/rtx-3090.
Troubleshooting
Output quality is poor / rambling / incoherent — check the temperature
Mistral recommends a low sampling temperature of ~0.3 for Nemo. The common default of 0.7 noticeably degrades this model's output — if responses feel off, this is the first thing to fix. Set --temp 0.3 on llama-server, or the equivalent temperature parameter in your client / Ollama Modelfile.
The chat template looks wrong / responses are malformed
Pass --jinja to llama-server so the GGUF's built-in chat template is applied — without it the assistant format won't parse. Mistral Nemo uses Mistral's Tekken tokenizer (tekken.json) — it was the first Tekken model. On the Python serving paths that needs mistral-common, but the GGUF / llama.cpp path uses the embedded tokenizer, so no extra install is required there.
Out of memory, or when raising the context
Q8_0 weights (13.02GB) leave ~11GB on a 24GB 3090 for the KV cache, so OOM is unlikely at sane context sizes — but a full 128K f16 cache can still be large. Options, in order: quantize the KV cache with -fa on -ctk q8_0 -ctv q8_0 (roughly halves cache VRAM); lower -c; or drop to Q6_K (10.06GB) or Q4_K_M (7.48GB) for even more headroom. Avoid the f16 GGUF (24.50GB) on 24GB — it barely fits the weights with almost no room for a KV cache, and Q8_0 is already near-lossless.
Is there an FP8 fast path on the RTX 3090? No
Nemo was trained with FP8-inference awareness, but Ampere (GA102, sm_86) has no FP8 tensor cores — that hardware path arrived with Ada/Hopper. It doesn't matter here: the GGUF route is integer quantization (Q4/Q6/Q8), not FP8, so the RTX 3090 runs Nemo like any other CUDA card. Don't look for an FP8 flag or a special build — there isn't one for this path.
torch / CUDA errors — this is llama.cpp, not a Python ML stack
Serving Mistral Nemo via llama.cpp or Ollama does not require PyTorch, flash-attn wheels, or a Python ML stack. If you hit a CUDA error, confirm you built (or downloaded) the CUDA-enabled llama.cpp (Option A, -DGGML_CUDA=ON) rather than a CPU-only binary. For large-VRAM or multi-GPU production serving you could instead run the full-precision weights under a server like vLLM, but on a single 3090 the GGUF + llama.cpp path is the right one — and at 12B, Q8_0 is already near-lossless.
Model or GPU 404 on /check
Mistral Nemo 12B is a new addition; if the /check/mistral-nemo-12b/rtx-3090 link 404s, the catalogue row is still being registered. The recipe's install and run steps are independent of the benchmark endpoint.