self-hosted/ai
§01·recipe · llm

Mistral Nemo 12B on Apple M3 Max (48GB): Full-Precision Local Assistant via llama.cpp / Ollama (Metal)

llmintermediate48GB+ VRAMJul 3, 2026

This intermediate recipe sets up Mistral Nemo 12B on the Apple M3 Max, needing about 48 GB of VRAM.

models
tools
prerequisites
  • Apple M3 Max Mac with 48GB unified memory (Metal GPU)
  • macOS with the Metal backend (default on Apple Silicon — no CUDA)
  • ~13-25GB free disk for the GGUF (Q8_0 ~13GB up to f16 ~24.5GB)
  • A recent llama.cpp build (Metal) or Ollama — no special patch needed for this July-2024 model
  • Optional: Open WebUI (or any OpenAI-compatible chat client) for a local chat front-end

What You'll Build

A fully local, private general assistant: Mistral Nemo 12B — Mistral AI and NVIDIA's Apache-2.0 generalist (Instruct, release 2407) — served as an OpenAI-compatible endpoint by llama.cpp or Ollama on an Apple M3 Max with 48GB unified memory, then used from a chat UI (Open WebUI is a good local front-end) or directly via the API. This is a text-only chat/reasoning/writing model: general Q&A, drafting and editing, multi-step reasoning, function calling, and strong multilingual support. Positioned as a drop-in upgrade to Mistral 7B, it's a capable 12B — and with 48GB of unified memory the M3 Max has so much headroom for a 12B that the standout move here is to run the full-precision f16 GGUF for maximum fidelity, with room to spare for a long context. Everything runs on your own Mac, so prompts and documents never leave the machine.

Hardware data: Apple M3 Max (48GB unified memory, Metal) · Mistral Nemo 12B, GGUF f16 (24.50GB, recommended for maximum fidelity) — or Q8_0 (13.02GB) as a near-lossless lighter option that frees memory for very long context · See benchmark data

ℹ️ This is a dense, text-only 12B generalist — no MoE, no vision. Mistral Nemo is a MistralForCausalLM (model_type: mistral) — 40 layers, hidden size 5120, GQA with 32 query / 8 KV heads, head_dim 128. Because it is dense, its footprint is simply the quant file you load plus the KV cache; there is no "active-parameters" shortcut. It is a pure text model — there is no vision tower and no image input. Context window is 128K (max_position_embeddings 131072). It was the first model to use Mistral's Tekken tokenizer (tekken.json), which needs mistral-common on the Python serving paths — but the GGUF / llama.cpp path uses the embedded tokenizer, so no extra install is required there. Nemo was trained with quantization awareness for FP8 inference and tuned for function calling and multilingual use.

ℹ️ Runs on current llama.cpp out of the box. Mistral Nemo shipped in July 2024 and has been long supported — there is no special patch or PR gate. Just use a recent llama.cpp (or Ollama) build. On Apple Silicon the Metal backend is on by default. Pass --jinja so the embedded chat template applies.

⚠️ Use a low sampling temperature (~0.3). Mistral recommends a low temperature (~0.3) for Nemo; the usual default of 0.7 noticeably degrades output quality on this model. Set it explicitly — this is a real, easy-to-miss gotcha.

Requirements

ComponentMinimumTested target
GPUApple Silicon with Metal (unified memory)Apple M3 Max (48GB unified, Metal)
Memory48GB unified memory48GB unified (shared with the OS)
Storage~13GB (Q8_0) up to ~25GB (f16)~24.5GB for f16
SoftwareRecent llama.cpp (Metal) or Ollama; optional Open WebUI chat clientllama-server, Open WebUI

Model weights (community GGUF — there is NO first-party GGUF). Mistral publishes only the full-precision weights (mistralai/Mistral-Nemo-Instruct-2407); the model is quantized to GGUF by the community. Primary source is bartowski/Mistral-Nemo-Instruct-2407-GGUF; unsloth/Mistral-Nemo-Instruct-2407-GGUF is a good alternative that also ships smaller Q2_K / Q3_K_M quants. Byte-verified on-disk sizes (bartowski):

QuantOn-disk sizeFit on M3 Max (48GB unified)
Q4_K_M7.48GBTiny footprint — vast KV-cache / context headroom, but you have room for far higher quality here
Q6_K10.06GBComfortable and near-lossless-feeling — lots of room for a large KV cache
Q8_013.02GBNear-lossless — the lighter option that frees most of the 48GB for a very long context
f1624.50GBRecommended — full precision, fits comfortably under the ~34-36GB GPU-usable ceiling with room to spare for the KV cache; maximum fidelity

Not model weights — don't count this in the memory math:

  • The .imatrix (~7 MB) is calibration data used to produce the quants — never load it as a model.

Unified memory, honestly. On Apple Silicon the GPU shares the same physical RAM as the OS and apps. On a 48GB Mac roughly ~34-36GB is realistically usable by the GPU once macOS reserves memory — so plan around that ceiling, not the full 48GB. That's still ample for this 12B: the f16 GGUF (24.50GB) fits comfortably with ~10GB-plus left for the KV cache, which is why full precision is the recommended pick here. If you ever want to raise the wired-memory limit for an unusually large KV cache, use sudo sysctl iogpu.wired_limit_mb=<MB>, but leave the OS a few GB free.

Licensing. Mistral Nemo 12B is Apache-2.0 — free for commercial and non-commercial use, no revenue caps (model card).

Installation

You have two GGUF runtimes; pick one. Both are fine for this model — there is no patch requirement — so choose Ollama for the fastest start, or llama.cpp for the most control over context and KV-cache quantization. Both use Apple's Metal GPU backend; there is no CUDA on a Mac.

Option A — llama.cpp with Metal

Build a recent llama.cpp with the Metal backend, per the official build guide. On Apple Silicon Metal is enabled by default, so a plain build already targets the M3 Max GPU:

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
# Metal is ON by default on macOS/Apple Silicon; the flag is shown here explicitly
cmake -B build -DGGML_METAL=ON
cmake --build build --config Release -j 8

A recent release is all you need — Mistral Nemo has been mainline in llama.cpp since its July 2024 launch. If you prefer a prebuilt binary, grab a current macOS build from the releases page. No CUDA toolkit is involved on a Mac — the Metal backend ships with the build.

Option B — Ollama

Ollama is built on llama.cpp and is the fastest way to stand this model up. On Apple Silicon it uses Metal automatically. Either use the curated tag (ollama run mistral-nemo) or pull the community GGUF straight from Hugging Face (HF × Ollama docs):

ollama run hf.co/bartowski/Mistral-Nemo-Instruct-2407-GGUF:Q8_0

The full-precision f16 is best served via llama.cpp below; for Ollama, :Q8_0 is the near-lossless lighter option. Ollama serves an OpenAI-compatible API at http://localhost:11434/v1 for chat clients.

Running

With llama.cpp

Serve an OpenAI-compatible API on port 8000. The -hf flag pulls the GGUF from Hugging Face; append :f16 (case-insensitive) to pick the quant (llama-server docs):

# f16 full precision (recommended on 48GB), offload all layers to the M3 Max GPU, low temperature per Mistral's guidance
llama-server -hf bartowski/Mistral-Nemo-Instruct-2407-GGUF:f16 \
    --port 8000 \
    -ngl 99 \
    -c 16384 \
    --temp 0.3 \
    --jinja
  • -ngl 99 (--n-gpu-layers) offloads every layer to the Metal GPU — the dense 12B f16 file (24.50GB) sits in unified memory well under the ~34-36GB GPU-usable ceiling.
  • -c 16384 sets a 16K context. With ~10GB-plus free after the f16 weights you can raise this substantially; quantize the KV cache (below) to push toward the full 128K.
  • --temp 0.3 sets the low sampling temperature Mistral recommends for Nemo — leaving it at the usual 0.7 noticeably degrades output. Set it explicitly (many clients default higher).
  • --jinja applies the GGUF's built-in chat template so the assistant format parses correctly.

Push toward the 128K context window. Mistral Nemo advertises a 128K context (max_position_embeddings 131072). At f16 on 48GB you still have room, and you can go much further by switching to the lighter Q8_0 (13.02GB) to free memory for the cache, and/or quantizing the KV cache: add -fa on (Flash Attention, required for a quantized cache) and -ctk q8_0 -ctv q8_0, which roughly halves KV-cache memory versus f16 with minimal quality impact:

# Very long context: Q8_0 weights free memory for a large, 8-bit-quantized KV cache
llama-server -hf bartowski/Mistral-Nemo-Instruct-2407-GGUF:Q8_0 \
    --port 8000 -ngl 99 -c 131072 --temp 0.3 --jinja \
    -fa on -ctk q8_0 -ctv q8_0

Because Nemo is only 12B, the M3 Max's 48GB gives you a rare luxury: run the model at full f16 fidelity for everyday chat, and drop to Q8_0 only when you want to trade a sliver of quality for a very long context.

With Ollama

Pull and run the community GGUF directly from Hugging Face; append a :quant tag to choose the quant (HF × Ollama docs):

ollama run hf.co/bartowski/Mistral-Nemo-Instruct-2407-GGUF:Q8_0

Remember to set a low temperature (~0.3) in your client or Modelfile — Ollama's default sampling can be higher, and Nemo degrades at 0.7. Ollama serves an OpenAI-compatible API at http://localhost:11434/v1 for chat clients.

Use it as a chat assistant

Point any OpenAI-compatible chat client at your local endpoint by setting its base URL and a dummy API key — no cloud, no per-token cost.

Open WebUI (optional local chat front-end). A self-hosted, ChatGPT-style UI that talks to any OpenAI-compatible server. Run it and point it at your local endpoint:

# Point Open WebUI at your local llama-server (or Ollama on :11434)
docker run -d -p 3000:8080 \
    -e OPENAI_API_BASE_URL=http://host.docker.internal:8000/v1 \
    -e OPENAI_API_KEY=EMPTY \
    ghcr.io/open-webui/open-webui:main

Then open http://localhost:3000 and chat. (Open WebUI also autodetects a local Ollama install, so with the Ollama path you can skip the base-URL wiring entirely.) Set the temperature to ~0.3 in the model's parameters.

Directly via the API. Any OpenAI SDK or curl works against the same endpoint — use it for scripts, writing tools, or your own app:

curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
      "model": "mistral-nemo-12b",
      "temperature": 0.3,
      "messages": [{"role": "user", "content": "Summarize this in three bullet points: ..."}]
    }'

Local servers don't check the key, so any non-empty string (e.g. EMPTY) works where a client requires one.

Results

  • Memory usage: The dense 12B loads entirely as its GGUF file — f16 is 24.50GB on disk (byte-verified from the bartowski GGUF tree). On the M3 Max's 48GB unified memory ~34-36GB is usable by the GPU once the OS takes its share, so full-precision f16 fits comfortably with ~10GB-plus left for the KV cache — and much more headroom if you drop to the near-lossless Q8_0 (13.02GB), Q6_K (10.06GB), or Q4_K_M (7.48GB). This is why full precision is the standout choice on this tier: you can afford maximum fidelity without giving up a usable context.
  • Model capability (vendor evals — Mistral's own, NOT hardware throughput): Mistral reports MMLU 68.0% and HellaSwag (0-shot) 83.5%, with strong multilingual results — MMLU French 62.3%, German 62.7%, Spanish 64.6%. These are the vendor's benchmarks, not measurements on this GPU.
  • Speed: No community throughput benchmark for Mistral Nemo 12B on the Apple M3 Max exists yet — we would rather omit a tok/s figure than invent one or borrow it from different hardware. Live measurements will appear at /check/mistral-nemo-12b/m3-max once contributed.

For the full benchmark data, see /check/mistral-nemo-12b/m3-max.

Troubleshooting

Output quality is poor / rambling / incoherent — check the temperature

Mistral recommends a low sampling temperature of ~0.3 for Nemo. The common default of 0.7 noticeably degrades this model's output — if responses feel off, this is the first thing to fix. Set --temp 0.3 on llama-server, or the equivalent temperature parameter in your client / Ollama Modelfile.

The chat template looks wrong / responses are malformed

Pass --jinja to llama-server so the GGUF's built-in chat template is applied — without it the assistant format won't parse. Mistral Nemo uses Mistral's Tekken tokenizer (tekken.json) — it was the first Tekken model. On the Python serving paths that needs mistral-common, but the GGUF / llama.cpp path uses the embedded tokenizer, so no extra install is required there.

Out of memory, or when raising the context

On a 48GB Mac the GPU can use ~34-36GB, so the f16 weights (24.50GB) leave ~10GB-plus for the KV cache and OOM is unlikely at sane context sizes — but a full 128K f16 cache can still be large. Options, in order: quantize the KV cache with -fa on -ctk q8_0 -ctv q8_0 (roughly halves cache memory); switch to the lighter Q8_0 (13.02GB) to free memory for a much larger cache; or lower -c. Q6_K (10.06GB) and Q4_K_M (7.48GB) free even more. If you need an unusually large wired cache, raise sudo sysctl iogpu.wired_limit_mb=<MB>, but leave the OS a few GB.

There's no nvidia-smi — this is Apple Metal, not CUDA

On a Mac there is no nvidia-smi and no CUDA — the GPU is Apple's, driven by the Metal backend (default on Apple Silicon). To watch GPU and memory pressure use Activity Monitor (Window → GPU History) or sudo powermetrics --samplers gpu_power in a terminal. Serving Mistral Nemo via llama.cpp or Ollama does not require PyTorch or a Python ML stack; if the model won't use the GPU, confirm you built (or downloaded) a Metal-enabled llama.cpp (Option A, -DGGML_METAL=ON). At 12B on 48GB, running the f16 GGUF is well within reach.

Model or GPU 404 on /check

Mistral Nemo 12B is a new addition; if the /check/mistral-nemo-12b/m3-max link 404s, the catalogue row is still being registered. The recipe's install and run steps are independent of the benchmark endpoint.

common questions
How much VRAM does Mistral Nemo 12B need?

About 48 GB — the minimum this recipe targets.

Which GPUs is Mistral Nemo 12B tested on?

Apple M3 Max (48 GB).

How hard is this setup?

Intermediate — follow the steps above.