What You'll Build
A local chat assistant powered by WizardLM-2 7B: a small instruction-following language model you talk to entirely on your own machine through Ollama — no cloud, no API key. You ask it questions, have it draft text, reason through a problem, or hold a multi-turn conversation, and it streams answers back. The whole thing runs on an 8 GB RTX 3060 Ti at the Q4 quant.
Hardware data: RTX 3060 Ti (8GB VRAM) · ~70.8 tokens/s generation (Q4, Ollama 0.5.4) · sits right at the 8 GB ceiling · See benchmark data
ℹ️ About this model's provenance. WizardLM-2 was released by Microsoft and then withdrawn — the team pulled the official repository pending a toxicity re-test that was never re-published. The model survives through community mirrors; this site catalogues it against the mirror
dreamgen/WizardLM-2-7B(Apache-2.0, the original WizardLM-2 license). Ollama still serves the weights under thewizardlm2library tag. Treat it as a community-preserved model, not a live first-party release.
Requirements
| Component | Minimum | Tested |
|---|---|---|
| GPU | 8GB VRAM | RTX 3060 Ti (8GB) |
| RAM | 16GB | — |
| Storage | ~5 GB (Q4 weights) | 4.1 GB model pull |
| Software | Ollama, NVIDIA driver + CUDA | Ollama 0.5.4 |
WizardLM-2 7B is a 7B Mistral-architecture instruction-following LLM trained by Microsoft with Evol-Instruct, distributed under the Apache-2.0 license. Ollama describes the line as "a next generation state-of-the-art large language model with improved performance on complex chat, multilingual, reasoning and agent use cases" and calls the 7B the "fastest model, comparable performance with 10x larger open-source models." (ollama.com/library/wizardlm2)
⚠️ Right at the 8 GB wall. The cited benchmark peaks at 8.0 GB on this 8 GB card — there is no headroom. Close other GPU consumers before you run: browsers with hardware acceleration, other models, even a second monitor's compositor can tip you into an out-of-memory error or force a slow CPU fallback. This recipe documents the Q4 quant specifically because heavier quants do not fit 8 GB.
Installation
1. Install Ollama
Download and install Ollama for your OS from ollama.com/download. On Linux:
curl -fsSL https://ollama.com/install.sh | sh
Confirm it sees your GPU:
ollama --version
nvidia-smi
2. Pull the WizardLM-2 7B weights
The wizardlm2:7b tag is a 4.1 GB Q4 download (ollama.com/library/wizardlm2):
ollama pull wizardlm2:7b
Running
Start an interactive chat session:
ollama run wizardlm2:7b
>>> Explain how a transformer attention head works, in two sentences.
The model streams its answer token by token. You can keep asking follow-up questions in the same session; type /bye to exit.
For a one-shot prompt from the shell (handy for scripting), pass the prompt as an argument:
ollama run wizardlm2:7b "Write a haiku about local LLMs."
Results
- Speed: ~70.8 tokens/s generation at Q4 on the RTX 3060 Ti, measured by DatabaseMart under their "Eval Rate(tokens/s)" column (Ollama 0.5.4). This is the rate at which the model writes its answer — for a plain text LLM like this one, that is the generation speed.
- VRAM usage: The backend records an 8.0 GB peak on this 8 GB card — i.e. effectively full. (DatabaseMart's own table lists the
GPU vRAMfigure as a utilization percentage, 70%, not a GB value — so anchor on the backend's measured 8.0 GB peak.) Either way, plan for no spare VRAM. See /check - Quality notes: This is a single commercial benchmark source. Numbers will vary with your Ollama version, driver, and context length. If you measure your own throughput or peak VRAM on a 3060 Ti, please contribute it via /contribute so the next reader gets a corroborating datapoint.
For the full benchmark data, see /check/wizardlm2-7b/rtx-3060-ti.
Troubleshooting
Out of memory / model falls back to CPU
At 8.0 GB peak on an 8 GB card there is no margin. If you see an OOM error or generation suddenly crawls, something else is holding VRAM. Run nvidia-smi to see what is resident, close it, and retry. Don't reach for the larger wizardlm2:8x22b tag on this card — it is an 80 GB download (ollama.com/library/wizardlm2) and will not run on a single consumer GPU. The 7B Q4 is the only WizardLM-2 variant that fits 8 GB.
"Is this the real Microsoft release?"
Not exactly. Microsoft published WizardLM-2, then withdrew the official repository pending a toxicity re-test. The weights you pull here are preserved by the community — the site catalogues the model against the dreamgen/WizardLM-2-7B mirror (Apache-2.0), and Ollama serves the same line under wizardlm2. The model works as documented; just be aware the original first-party repo is no longer live.
Slow generation or short answers
Generation throughput depends on your Ollama version, driver, and how much context you feed it. The ~70.8 tokens/s figure was measured on Ollama 0.5.4; a much older or newer build, a long prompt, or a near-full KV cache can pull it down. If your numbers differ materially, report them via /contribute.
No other widely-reported issues for this pair. Report problems via the submission form.