Mistral Nemo 12B
Mistral Nemo 12B (Instruct, release 2407) is a dense 12-billion-parameter model built by Mistral AI in collaboration with NVIDIA. Text-only, with a 128K-token context window and the Tekken tokenizer (its first use). Licensed Apache-2.0. Trained with quantization awareness for FP8 inference and tuned for function calling and multilingual use, it was positioned as a drop-in upgrade to Mistral 7B. Mistral reports MMLU 68.0% and HellaSwag 83.5%, with solid multilingual MMLU (French/German/Spanish ~62-65%). No first-party GGUF; community bartowski/unsloth GGUF builds load on current llama.cpp with no special patch (Q4_K_M ~7.5 GB fits an 8 GB card; Q6_K/Q8_0 for 12-24 GB). Recommended sampling temperature is a low ~0.3.
| GPU | VRAM | Series | Best speed | Min VRAM | Works | Benchmarks | Recipe | |
|---|---|---|---|---|---|---|---|---|
| Apple M2 Pro | 16GB | apple | ~ | 0 | recipe | check ↗ | ||
| Apple M3 Max | 48GB | apple | ~ | 0 | recipe | check ↗ | ||
| RTX 3090 | 24GB | 30 | ~ | 0 | recipe | check ↗ | ||
| RTX 4060 | 8GB | 40 | ~ | 0 | recipe | check ↗ | ||
| RTX 4070 | 12GB | 40 | ~ | 0 | recipe | check ↗ | ||
| RTX 4080 | 16GB | 40 | ~ | 0 | recipe | check ↗ | ||
| RTX 4090 | 24GB | 40 | ~ | 0 | recipe | check ↗ | ||
| RX 7800 XT | 16GB | amd | ~ | 0 | recipe | check ↗ |
✓ benchmarked·~ runs via recipe (not benchmarked)·— untested·✕doesn't fit