What You'll Build
A local install of StableLM 2 12B — Stability AI's multilingual 12-billion-parameter base language model — running entirely on your own RTX 3060 Ti through Ollama, with no cloud and no API key. The whole 12B model fits on an 8 GB card thanks to the Q4 quant, generating text at the rate the benchmark below records.
Hardware data: RTX 3060 Ti (8GB VRAM) · ~18.73 tokens/s generation (Q4, Ollama 0.5.4) · fills the 8 GB ceiling · See benchmark data
ℹ️ This is a base model, not a chat assistant. The plain
stablelm2:12btag — the one this recipe and the benchmark use — is the foundational base model, cited against the canonicalstabilityai/stablelm-2-12brepo. The model card states it plainly: "The model is intended to be used as a foundational base model for application-specific fine-tuning." Base models complete text rather than follow instructions or hold a conversation out of the box. If you want a chat-style assistant, pullstablelm2:12b-chatinstead (also 7.0 GB) — the tags are listed on ollama.com/library/stablelm2/tags. The numbers below are for the basestablelm2:12btag specifically.
Requirements
| Component | Minimum | Tested |
|---|---|---|
| GPU | 8GB VRAM | RTX 3060 Ti (8GB) |
| RAM | 16GB | — |
| Storage | ~7 GB (Q4 weights) | 7.0 GB model pull |
| Software | Ollama, NVIDIA driver + CUDA | Ollama 0.5.4 |
StableLM 2 12B is a multilingual decoder-only language model from Stability AI, released under the Stability AI Community License. The model card describes it directly: "Stable LM 2 12B is a 12.1 billion parameter decoder-only language model pre-trained on 2 trillion tokens of diverse multilingual and code datasets for two epochs." (huggingface.co/stabilityai/stablelm-2-12b) Ollama describes the family as a "state-of-the-art 1.6B and 12B parameter language model trained on multilingual data in English, Spanish, German, Italian, French, Portuguese, and Dutch." (ollama.com/library/stablelm2)
⚠️ Right at the 8 GB wall — this 12B is a tight fit. Unlike the 7-to-9B models that sit comfortably on an 8 GB card, a 12B model at Q4 leaves almost no headroom. The cited benchmark peaks the RTX 3060 Ti's VRAM at 90% utilization — DatabaseMart's own write-up notes that "StableLM 2 (12b) and Falcon 2 (11b) pushed the limits of the RTX 3060 Ti's 8GB VRAM, leading to slower inference speeds." (databasemart.com) The backend records an 8.0 GB peak — effectively full. Close other GPU consumers before you run: browsers with hardware acceleration, other models, even a second monitor's compositor can tip you into an out-of-memory error or a slow CPU fallback. This recipe documents the Q4 quant specifically because heavier quants do not fit 8 GB.
Installation
1. Install Ollama
Download and install Ollama for your OS from ollama.com/download. On Linux:
curl -fsSL https://ollama.com/install.sh | sh
Confirm it sees your GPU:
ollama --version
nvidia-smi
2. Pull the StableLM 2 12B weights
The stablelm2:12b tag is a 7.0 GB Q4 download (ollama.com/library/stablelm2/tags):
ollama pull stablelm2:12b
Running
Because this is a base model, you prompt it as a text completion rather than a chat turn — give it the start of something and let it continue:
ollama run stablelm2:12b "The three laws of thermodynamics are:"
You can still open an interactive session, but remember the base model continues text rather than answering instructions conversationally:
ollama run stablelm2:12b
>>> Once upon a time in a quiet village,
The model loads onto the GPU and streams its continuation token by token. The first run after a fresh pull spends a moment loading the 7.0 GB of weights into VRAM, which on this card is nearly the whole budget; subsequent prompts in the same session reply without reloading.
Results
- Speed: ~18.73 tokens/s generation at Q4 on the RTX 3060 Ti, measured by DatabaseMart under their "Eval Rate(tokens/s)" column (Ollama 0.5.4). For a plain text LLM this is the generation speed — the rate at which the model writes its output. It is noticeably slower than the 7-to-8B models on the same table (which run ~57–73 tokens/s) because a 12B model is larger and fills the card, but it is comfortably faster than the 13B Llama 2 row's ~9 tokens/s.
- VRAM usage: The backend records an 8.0 GB peak on this 8 GB card — i.e. effectively full. DatabaseMart's own table lists StableLM 2's
GPU vRAMutilization at 90% on the card it benchmarked (note: that figure is a percentage of card VRAM, not gigabytes). Either way, plan for essentially no spare VRAM. See /check - Quality notes: This is a single commercial benchmark source, and StableLM 2 12B is a base model — for instruction-following or chat quality you would either use the
12b-chattag or fine-tune it yourself, as the model card recommends. Numbers will vary with your Ollama version, driver, and context length. If you measure your own throughput or peak VRAM on a 3060 Ti, please contribute it via /contribute so the next reader gets a corroborating datapoint.
For the full benchmark data, see /check/stablelm2-12b/rtx-3060-ti.
Troubleshooting
Out of memory / model falls back to CPU
At 8.0 GB peak on an 8 GB card there is zero margin, and a 12B model is tighter than the 7B models people usually run on this card. DatabaseMart found that "Running all models in 4-bit precision proved crucial for fitting into the RTX 3060 Ti" — so do not reach for a heavier quant here. If you see an OOM error or generation suddenly crawls, run nvidia-smi to see what else is resident, close it, and retry. If you still cannot fit it, drop to the much smaller stablelm2:1.6b tag (983 MB), which leaves ample room on an 8 GB card.
The model rambles or ignores my instructions
That is expected — stablelm2:12b is the base model, which completes text rather than following chat instructions. The model card is explicit that it is "intended to be used as a foundational base model for application-specific fine-tuning." For an assistant that answers questions, pull stablelm2:12b-chat (ollama.com/library/stablelm2/tags) instead — the chat variant is also Q4 at 7.0 GB and fits the same 8 GB budget.
Short answers or context runs out
Loading a larger KV cache eats into the already-tight 8 GB budget on this card. If you raise the context length and hit an OOM, lower it again — on an 8 GB card the 7.0 GB of weights alone fill most of the VRAM, leaving little room for a large KV cache.
No other widely-reported issues for this pair. Report problems via the submission form.