How much VRAM does stablelm2 12b need?

About 8 GB — the minimum this recipe targets.

How hard is this setup?

Beginner — follow the steps above.

StableLM 2 12B on RTX 3060 Ti: Run a Multilingual Base LLM at the 8 GB Wall

What You'll Build

A local install of StableLM 2 12B — Stability AI's multilingual 12-billion-parameter base language model — running entirely on your own RTX 3060 Ti through Ollama, with no cloud and no API key. The whole 12B model fits on an 8 GB card thanks to the Q4 quant, generating text at the rate the benchmark below records.

Hardware data: RTX 3060 Ti (8GB VRAM) · ~18.73 tokens/s generation (Q4, Ollama 0.5.4) · fills the 8 GB ceiling · See benchmark data

ℹ️ This is a base model, not a chat assistant. The plain stablelm2:12b tag — the one this recipe and the benchmark use — is the foundational base model, cited against the canonical stabilityai/stablelm-2-12b repo. The model card states it plainly: "The model is intended to be used as a foundational base model for application-specific fine-tuning." Base models complete text rather than follow instructions or hold a conversation out of the box. If you want a chat-style assistant, pull stablelm2:12b-chat instead (also 7.0 GB) — the tags are listed on ollama.com/library/stablelm2/tags. The numbers below are for the base stablelm2:12b tag specifically.

Requirements

Component	Minimum	Tested
GPU	8GB VRAM	RTX 3060 Ti (8GB)
RAM	16GB	—
Storage	~7 GB (Q4 weights)	7.0 GB model pull
Software	Ollama, NVIDIA driver + CUDA	Ollama 0.5.4

StableLM 2 12B is a multilingual decoder-only language model from Stability AI, released under the Stability AI Community License. The model card describes it directly: "Stable LM 2 12B is a 12.1 billion parameter decoder-only language model pre-trained on 2 trillion tokens of diverse multilingual and code datasets for two epochs." (huggingface.co/stabilityai/stablelm-2-12b) Ollama describes the family as a "state-of-the-art 1.6B and 12B parameter language model trained on multilingual data in English, Spanish, German, Italian, French, Portuguese, and Dutch." (ollama.com/library/stablelm2)

⚠️ Right at the 8 GB wall — this 12B is a tight fit. Unlike the 7-to-9B models that sit comfortably on an 8 GB card, a 12B model at Q4 leaves almost no headroom. The cited benchmark peaks the RTX 3060 Ti's VRAM at 90% utilization — DatabaseMart's own write-up notes that "StableLM 2 (12b) and Falcon 2 (11b) pushed the limits of the RTX 3060 Ti's 8GB VRAM, leading to slower inference speeds." (databasemart.com) The backend records an 8.0 GB peak — effectively full. Close other GPU consumers before you run: browsers with hardware acceleration, other models, even a second monitor's compositor can tip you into an out-of-memory error or a slow CPU fallback. This recipe documents the Q4 quant specifically because heavier quants do not fit 8 GB.

Installation

1. Install Ollama

Download and install Ollama for your OS from ollama.com/download. On Linux:

curl -fsSL https://ollama.com/install.sh | sh

Confirm it sees your GPU:

ollama --version
nvidia-smi

2. Pull the StableLM 2 12B weights

The stablelm2:12b tag is a 7.0 GB Q4 download (ollama.com/library/stablelm2/tags):

ollama pull stablelm2:12b

Running

Because this is a base model, you prompt it as a text completion rather than a chat turn — give it the start of something and let it continue:

ollama run stablelm2:12b "The three laws of thermodynamics are:"

You can still open an interactive session, but remember the base model continues text rather than answering instructions conversationally:

ollama run stablelm2:12b
>>> Once upon a time in a quiet village,

The model loads onto the GPU and streams its continuation token by token. The first run after a fresh pull spends a moment loading the 7.0 GB of weights into VRAM, which on this card is nearly the whole budget; subsequent prompts in the same session reply without reloading.

Results

Speed: ~18.73 tokens/s generation at Q4 on the RTX 3060 Ti, measured by DatabaseMart under their "Eval Rate(tokens/s)" column (Ollama 0.5.4). For a plain text LLM this is the generation speed — the rate at which the model writes its output. It is noticeably slower than the 7-to-8B models on the same table (which run ~57–73 tokens/s) because a 12B model is larger and fills the card, but it is comfortably faster than the 13B Llama 2 row's ~9 tokens/s.
VRAM usage: The backend records an 8.0 GB peak on this 8 GB card — i.e. effectively full. DatabaseMart's own table lists StableLM 2's GPU vRAM utilization at 90% on the card it benchmarked (note: that figure is a percentage of card VRAM, not gigabytes). Either way, plan for essentially no spare VRAM. See /check
Quality notes: This is a single commercial benchmark source, and StableLM 2 12B is a base model — for instruction-following or chat quality you would either use the 12b-chat tag or fine-tune it yourself, as the model card recommends. Numbers will vary with your Ollama version, driver, and context length. If you measure your own throughput or peak VRAM on a 3060 Ti, please contribute it via /contribute so the next reader gets a corroborating datapoint.

For the full benchmark data, see /check/stablelm2-12b/rtx-3060-ti.

Troubleshooting

Out of memory / model falls back to CPU

At 8.0 GB peak on an 8 GB card there is zero margin, and a 12B model is tighter than the 7B models people usually run on this card. DatabaseMart found that "Running all models in 4-bit precision proved crucial for fitting into the RTX 3060 Ti" — so do not reach for a heavier quant here. If you see an OOM error or generation suddenly crawls, run nvidia-smi to see what else is resident, close it, and retry. If you still cannot fit it, drop to the much smaller stablelm2:1.6b tag (983 MB), which leaves ample room on an 8 GB card.

The model rambles or ignores my instructions

That is expected — stablelm2:12b is the base model, which completes text rather than following chat instructions. The model card is explicit that it is "intended to be used as a foundational base model for application-specific fine-tuning." For an assistant that answers questions, pull stablelm2:12b-chat (ollama.com/library/stablelm2/tags) instead — the chat variant is also Q4 at 7.0 GB and fits the same 8 GB budget.

Short answers or context runs out

Loading a larger KV cache eats into the already-tight 8 GB budget on this card. If you raise the context length and hit an OOM, lower it again — on an 8 GB card the 7.0 GB of weights alone fill most of the VRAM, leaving little room for a large KV cache.

No other widely-reported issues for this pair. Report problems via the submission form.