What You'll Build
A fully local, private general assistant: Mistral Nemo 12B — Mistral AI and NVIDIA's Apache-2.0 generalist (Instruct, release 2407) — served as an OpenAI-compatible endpoint by llama.cpp or Ollama on a single 8GB RTX 4060, then used from a chat UI (Open WebUI is a good local front-end) or directly via the API. This is a text-only chat/reasoning/writing model: general Q&A, drafting and editing, multi-step reasoning, function calling, and strong multilingual support. Positioned as a drop-in upgrade to Mistral 7B, it's a capable 12B that runs on modest hardware — the 8GB RTX 4060 is this model's floor, where a Q4_K_M quant fits with context deliberately kept short. Everything runs on your own hardware, so prompts and documents never leave the machine.
Hardware data: RTX 4060 (8GB VRAM) · Mistral Nemo 12B, GGUF Q4_K_M (7.48GB, recommended — the 8GB fit) — or the unsloth Q3_K_M (6.08GB) / Q2_K (4.79GB) for more KV-cache / context room · See benchmark data
ℹ️ This is a dense, text-only 12B generalist — no MoE, no vision. Mistral Nemo is a
MistralForCausalLM(model_type: mistral) — 40 layers, hidden size 5120, GQA with 32 query / 8 KV heads, head_dim 128. Because it is dense, its footprint is simply the quant file you load plus the KV cache; there is no "active-parameters" shortcut. It is a pure text model — there is no vision tower and no image input. Context window is 128K (max_position_embeddings131072), though on an 8GB card you will run far short of that (see below). It was the first model to use Mistral's Tekken tokenizer (tekken.json), which needsmistral-commonon the Python serving paths — but the GGUF / llama.cpp path uses the embedded tokenizer, so no extra install is required there. Nemo was trained with quantization awareness for FP8 inference and tuned for function calling and multilingual use.
ℹ️ Runs on current llama.cpp out of the box. Mistral Nemo shipped in July 2024 and has been long supported — there is no special patch or PR gate. Just use a recent
llama.cpp(or Ollama) build. Pass--jinjaso the embedded chat template applies.
⚠️ Use a low sampling temperature (~0.3). Mistral recommends a low temperature (~0.3) for Nemo; the usual default of 0.7 noticeably degrades output quality on this model. Set it explicitly — this is a real, easy-to-miss gotcha.
Requirements
| Component | Minimum | Tested target |
|---|---|---|
| GPU | 8GB VRAM (Q4_K_M floor — this is the matrix's entry tier) | RTX 4060 (8GB, Ada Lovelace AD107, sm_89) |
| RAM | 16GB system RAM | 32GB comfortable |
| Storage | ~5GB (Q2_K) up to ~7.5GB (Q4_K_M) | ~7.5GB for Q4_K_M |
| Software | Recent llama.cpp (CUDA) or Ollama; optional Open WebUI chat client | llama-server, Open WebUI |
Model weights (community GGUF — there is NO first-party GGUF). Mistral publishes only the full-precision weights (mistralai/Mistral-Nemo-Instruct-2407); the model is quantized to GGUF by the community. Primary source is bartowski/Mistral-Nemo-Instruct-2407-GGUF; unsloth/Mistral-Nemo-Instruct-2407-GGUF is a good alternative that also ships the smaller Q2_K / Q3_K_M quants you want on an 8GB card. Byte-verified on-disk sizes:
| Quant | On-disk size | Fit on RTX 4060 (8GB) |
|---|---|---|
| Q2_K (unsloth) | 4.79GB | Fits with the most KV-cache room on 8GB — smallest/lowest-quality; use it when you need more context |
| Q3_K_M (unsloth) | 6.08GB | Fits with a comfortable margin for the KV cache — a good middle option on 8GB |
| Q4_K_M | 7.48GB | Recommended — best quality that fits 8GB, but tight: only ~0.5GB is left for the KV cache, so context must be kept short (see Running) |
| Q5_K_M | 8.73GB | Does NOT fit 8GB — the weights alone exceed the card; use Q4_K_M or smaller here |
Not model weights — don't count this in the VRAM math:
- The
.imatrix(~7 MB) is calibration data used to produce the quants — never load it as a model.
Licensing. Mistral Nemo 12B is Apache-2.0 — free for commercial and non-commercial use, no revenue caps (model card).
Installation
You have two GGUF runtimes; pick one. Both are fine for this model — there is no patch requirement — so choose Ollama for the fastest start, or llama.cpp for the most control over context and KV-cache quantization (which matters a lot on an 8GB card).
Option A — llama.cpp with CUDA
The RTX 4060 is Ada Lovelace (AD107, sm_89). Build a recent llama.cpp and compile for sm_89, per the official build guide:
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
# RTX 4060 is Ada Lovelace = compute capability 8.9 (sm_89)
cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=89
cmake --build build --config Release -j 8
A recent release is all you need — Mistral Nemo has been mainline in llama.cpp since its July 2024 launch. If you prefer a prebuilt binary, grab a current one from the releases page. The CUDA backend flag is -DGGML_CUDA=ON on current llama.cpp (the old LLAMA_CUDA name was retired in late 2024); install the NVIDIA CUDA toolkit first.
Option B — Ollama
Ollama is built on llama.cpp and is the fastest way to stand this model up. Either use the curated tag (ollama run mistral-nemo) or pull the community GGUF straight from Hugging Face (HF × Ollama docs):
ollama run hf.co/bartowski/Mistral-Nemo-Instruct-2407-GGUF:Q4_K_M
On 8GB, keep to :Q4_K_M or smaller, and use the unsloth repo (hf.co/unsloth/Mistral-Nemo-Instruct-2407-GGUF:Q3_K_M) if you want the Q3_K_M / Q2_K quants for more context room. Ollama serves an OpenAI-compatible API at http://localhost:11434/v1 for chat clients.
Running
With llama.cpp
Serve an OpenAI-compatible API on port 8000. The -hf flag pulls the GGUF from Hugging Face; append :Q4_K_M (case-insensitive) to pick the quant (llama-server docs):
# Q4_K_M (the 8GB fit), offload all layers to the 4060, SHORT context + quantized KV cache, low temperature
llama-server -hf bartowski/Mistral-Nemo-Instruct-2407-GGUF:Q4_K_M \
--port 8000 \
-ngl 99 \
-c 4096 \
--temp 0.3 \
--jinja \
-fa on -ctk q8_0 -ctv q8_0
-ngl 99(--n-gpu-layers) offloads every layer to the GPU — the dense 12B Q4_K_M file (7.48GB) fills most of the 8GB card, leaving only ~0.5GB for the KV cache.-c 4096sets a short 4K context on purpose. With Q4_K_M leaving so little headroom, context is the constraint on this card. Start at-c 4096; if it runs stable you can try-c 8192, but do not expect to reach 128K on 8GB.-fa on -ctk q8_0 -ctv q8_0turns on Flash Attention and 8-bit-quantizes the KV cache, roughly halving its VRAM versus f16 — essentially required here to fit any usable context alongside the weights.--temp 0.3sets the low sampling temperature Mistral recommends for Nemo — leaving it at the usual 0.7 noticeably degrades output. Set it explicitly (many clients default higher).--jinjaapplies the GGUF's built-in chat template so the assistant format parses correctly.
Need more context room? On 8GB the lever is a smaller quant, not a bigger cache. Drop to the unsloth Q3_K_M (6.08GB) or Q2_K (4.79GB) to free 1.4-2.7GB for the KV cache and a larger -c:
# More context headroom on 8GB via a smaller (unsloth) quant
llama-server -hf unsloth/Mistral-Nemo-Instruct-2407-GGUF:Q3_K_M \
--port 8000 -ngl 99 -c 8192 --temp 0.3 --jinja \
-fa on -ctk q8_0 -ctv q8_0
The 8GB RTX 4060 is the floor for this 12B — a genuinely useful generalist on an entry-tier consumer card, at the cost of a tightly bound context window. Cards with more VRAM (12GB and up) lift that constraint and take higher-quality quants.
With Ollama
Pull and run the community GGUF directly from Hugging Face; append a :quant tag to choose the quant (HF × Ollama docs):
ollama run hf.co/bartowski/Mistral-Nemo-Instruct-2407-GGUF:Q4_K_M
On 8GB, stay at :Q4_K_M or smaller and keep the context modest. Remember to set a low temperature (~0.3) in your client or Modelfile — Ollama's default sampling can be higher, and Nemo degrades at 0.7. Ollama serves an OpenAI-compatible API at http://localhost:11434/v1 for chat clients.
Use it as a chat assistant
Point any OpenAI-compatible chat client at your local endpoint by setting its base URL and a dummy API key — no cloud, no per-token cost.
Open WebUI (optional local chat front-end). A self-hosted, ChatGPT-style UI that talks to any OpenAI-compatible server. Run it and point it at your local endpoint:
# Point Open WebUI at your local llama-server (or Ollama on :11434)
docker run -d -p 3000:8080 \
-e OPENAI_API_BASE_URL=http://host.docker.internal:8000/v1 \
-e OPENAI_API_KEY=EMPTY \
ghcr.io/open-webui/open-webui:main
Then open http://localhost:3000 and chat. (Open WebUI also autodetects a local Ollama install, so with the Ollama path you can skip the base-URL wiring entirely.) Set the temperature to ~0.3 in the model's parameters.
Directly via the API. Any OpenAI SDK or curl works against the same endpoint — use it for scripts, writing tools, or your own app:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "mistral-nemo-12b",
"temperature": 0.3,
"messages": [{"role": "user", "content": "Summarize this in three bullet points: ..."}]
}'
Local servers don't check the key, so any non-empty string (e.g. EMPTY) works where a client requires one.
Results
- VRAM usage: The dense 12B loads entirely as its GGUF file — Q4_K_M is 7.48GB on disk (byte-verified from the bartowski GGUF tree). On the RTX 4060's 8GB that leaves only ~0.5GB for the KV cache, which is why context must be kept short and the KV cache 8-bit-quantized (see Running). The unsloth Q3_K_M (6.08GB) and Q2_K (4.79GB) free more room for context at lower quality. Q5_K_M (8.73GB) does not fit 8GB — its weights alone exceed the card.
- Model capability (vendor evals — Mistral's own, NOT hardware throughput): Mistral reports MMLU 68.0% and HellaSwag (0-shot) 83.5%, with strong multilingual results — MMLU French 62.3%, German 62.7%, Spanish 64.6%. These are the vendor's benchmarks, not measurements on this GPU.
- Speed: No community throughput benchmark for Mistral Nemo 12B on the RTX 4060 exists yet — we would rather omit a tok/s figure than invent one or borrow it from different hardware. Live measurements will appear at
/check/mistral-nemo-12b/rtx-4060once contributed.
For the full benchmark data, see /check/mistral-nemo-12b/rtx-4060.
Troubleshooting
Output quality is poor / rambling / incoherent — check the temperature
Mistral recommends a low sampling temperature of ~0.3 for Nemo. The common default of 0.7 noticeably degrades this model's output — if responses feel off, this is the first thing to fix. Set --temp 0.3 on llama-server, or the equivalent temperature parameter in your client / Ollama Modelfile.
The chat template looks wrong / responses are malformed
Pass --jinja to llama-server so the GGUF's built-in chat template is applied — without it the assistant format won't parse. Mistral Nemo uses Mistral's Tekken tokenizer (tekken.json) — it was the first Tekken model. On the Python serving paths that needs mistral-common, but the GGUF / llama.cpp path uses the embedded tokenizer, so no extra install is required there.
Out of memory — this is the main constraint on 8GB
On the 8GB RTX 4060 the Q4_K_M weights (7.48GB) leave almost nothing for the KV cache, so OOM here is usually a context-size problem. Options, in order: keep -fa on -ctk q8_0 -ctv q8_0 on (roughly halves cache VRAM); lower -c (start at 4096); or drop to the unsloth Q3_K_M (6.08GB) or Q2_K (4.79GB) to free 1.4-2.7GB for the cache. Q5_K_M (8.73GB) and larger will not load on 8GB — the weights exceed the card before any KV cache.
torch / CUDA errors — this is llama.cpp, not a Python ML stack
Serving Mistral Nemo via llama.cpp or Ollama does not require PyTorch, flash-attn wheels, or a Python ML stack. If you hit a CUDA error, confirm you built (or downloaded) the CUDA-enabled llama.cpp (Option A, -DGGML_CUDA=ON) rather than a CPU-only binary. On an 8GB card the GGUF + llama.cpp path is the only sensible one — the full-precision weights are far too large for this GPU.
Model or GPU 404 on /check
Mistral Nemo 12B is a new addition; if the /check/mistral-nemo-12b/rtx-4060 link 404s, the catalogue row is still being registered. The recipe's install and run steps are independent of the benchmark endpoint.