How much VRAM does wizardlm2 7b need?

About 8 GB — the minimum this recipe targets.

How hard is this setup?

Beginner — follow the steps above.

WizardLM-2 7B on RTX 3060 Ti: Run Microsoft's Evol-Instruct Chat LLM Locally at the 8 GB Floor

What You'll Build

A local chat assistant powered by WizardLM-2 7B: a small instruction-following language model you talk to entirely on your own machine through Ollama — no cloud, no API key. You ask it questions, have it draft text, reason through a problem, or hold a multi-turn conversation, and it streams answers back. The whole thing runs on an 8 GB RTX 3060 Ti at the Q4 quant.

Hardware data: RTX 3060 Ti (8GB VRAM) · ~70.8 tokens/s generation (Q4, Ollama 0.5.4) · sits right at the 8 GB ceiling · See benchmark data

ℹ️ About this model's provenance. WizardLM-2 was released by Microsoft and then withdrawn — the team pulled the official repository pending a toxicity re-test that was never re-published. The model survives through community mirrors; this site catalogues it against the mirror dreamgen/WizardLM-2-7B (Apache-2.0, the original WizardLM-2 license). Ollama still serves the weights under the wizardlm2 library tag. Treat it as a community-preserved model, not a live first-party release.

Requirements

Component	Minimum	Tested
GPU	8GB VRAM	RTX 3060 Ti (8GB)
RAM	16GB	—
Storage	~5 GB (Q4 weights)	4.1 GB model pull
Software	Ollama, NVIDIA driver + CUDA	Ollama 0.5.4

WizardLM-2 7B is a 7B Mistral-architecture instruction-following LLM trained by Microsoft with Evol-Instruct, distributed under the Apache-2.0 license. Ollama describes the line as "a next generation state-of-the-art large language model with improved performance on complex chat, multilingual, reasoning and agent use cases" and calls the 7B the "fastest model, comparable performance with 10x larger open-source models." (ollama.com/library/wizardlm2)

⚠️ Right at the 8 GB wall. The cited benchmark peaks at 8.0 GB on this 8 GB card — there is no headroom. Close other GPU consumers before you run: browsers with hardware acceleration, other models, even a second monitor's compositor can tip you into an out-of-memory error or force a slow CPU fallback. This recipe documents the Q4 quant specifically because heavier quants do not fit 8 GB.

Installation

1. Install Ollama

Download and install Ollama for your OS from ollama.com/download. On Linux:

curl -fsSL https://ollama.com/install.sh | sh

Confirm it sees your GPU:

ollama --version
nvidia-smi

2. Pull the WizardLM-2 7B weights

The wizardlm2:7b tag is a 4.1 GB Q4 download (ollama.com/library/wizardlm2):

ollama pull wizardlm2:7b

Running

Start an interactive chat session:

ollama run wizardlm2:7b
>>> Explain how a transformer attention head works, in two sentences.

The model streams its answer token by token. You can keep asking follow-up questions in the same session; type /bye to exit.

For a one-shot prompt from the shell (handy for scripting), pass the prompt as an argument:

ollama run wizardlm2:7b "Write a haiku about local LLMs."

Results

Speed: ~70.8 tokens/s generation at Q4 on the RTX 3060 Ti, measured by DatabaseMart under their "Eval Rate(tokens/s)" column (Ollama 0.5.4). This is the rate at which the model writes its answer — for a plain text LLM like this one, that is the generation speed.
VRAM usage: The backend records an 8.0 GB peak on this 8 GB card — i.e. effectively full. (DatabaseMart's own table lists the GPU vRAM figure as a utilization percentage, 70%, not a GB value — so anchor on the backend's measured 8.0 GB peak.) Either way, plan for no spare VRAM. See /check
Quality notes: This is a single commercial benchmark source. Numbers will vary with your Ollama version, driver, and context length. If you measure your own throughput or peak VRAM on a 3060 Ti, please contribute it via /contribute so the next reader gets a corroborating datapoint.

For the full benchmark data, see /check/wizardlm2-7b/rtx-3060-ti.

Troubleshooting

Out of memory / model falls back to CPU

At 8.0 GB peak on an 8 GB card there is no margin. If you see an OOM error or generation suddenly crawls, something else is holding VRAM. Run nvidia-smi to see what is resident, close it, and retry. Don't reach for the larger wizardlm2:8x22b tag on this card — it is an 80 GB download (ollama.com/library/wizardlm2) and will not run on a single consumer GPU. The 7B Q4 is the only WizardLM-2 variant that fits 8 GB.

"Is this the real Microsoft release?"

Not exactly. Microsoft published WizardLM-2, then withdrew the official repository pending a toxicity re-test. The weights you pull here are preserved by the community — the site catalogues the model against the dreamgen/WizardLM-2-7B mirror (Apache-2.0), and Ollama serves the same line under wizardlm2. The model works as documented; just be aware the original first-party repo is no longer live.

Slow generation or short answers

Generation throughput depends on your Ollama version, driver, and how much context you feed it. The ~70.8 tokens/s figure was measured on Ollama 0.5.4; a much older or newer build, a long prompt, or a near-full KV cache can pull it down. If your numbers differ materially, report them via /contribute.

No other widely-reported issues for this pair. Report problems via the submission form.