How much VRAM does falcon2 11b need?

About 8 GB — the minimum this recipe targets.

How hard is this setup?

Beginner — follow the steps above.

Falcon2 11B on RTX 3060 Ti: Run TII's 11B LLM Local at the 8 GB Ceiling

What You'll Build

A local text LLM: Falcon2 11B — TII's 11-billion-parameter causal decoder model — running entirely on your own RTX 3060 Ti, generating and continuing text with no cloud and no API key. The whole thing runs on an 8 GB RTX 3060 Ti through Ollama at the Q4 quant.

Hardware data: RTX 3060 Ti (8GB VRAM) · ~31.2 tokens/s generation (Q4, Ollama 0.5.4) · the largest model in the 8 GB class, fitting with tight headroom · See benchmark data

ℹ️ This is a raw, pretrained base model — not an instruction-tuned chat model. The tiiuae/falcon-11B card states plainly: "This is a raw, pretrained model, which should be further finetuned for most usecases." It excels at text continuation and few-shot prompting, but it will not follow chat-style instructions the way an -Instruct model does. If you want turn-taking chat, prompt it few-shot or pick an instruction-tuned model instead — the numbers below are for the base Falcon2 11B specifically.

Requirements

Component	Minimum	Tested
GPU	8GB VRAM	RTX 3060 Ti (8GB)
RAM	16GB	—
Storage	~7 GB (Q4 weights)	6.4 GB model pull
Software	Ollama, NVIDIA driver + CUDA	Ollama 0.5.4

Falcon2 11B is a causal decoder-only large language model from the Technology Innovation Institute (TII). Per the model card, it is an 11B-parameter causal decoder-only model built by TII and trained on over 5,000B tokens of RefinedWeb enhanced with curated corpora. It is released under the "TII Falcon License 2.0", described on the card as "the permissive Apache 2.0-based software license" (huggingface.co/tiiuae/falcon-11B). Ollama lists Falcon2 as "an 11B parameters causal decoder-only model built by TII and trained over 5T tokens" (ollama.com/library/falcon2).

⚠️ Right at the 8 GB wall. At 11B parameters, Falcon2 is the largest model you can reasonably run on an 8 GB card, and only at Q4. The falcon2:11b weights are a 6.4 GB Q4_0 download (ollama.com/library/falcon2); the cited benchmark peaks at 8.0 GB on this 8 GB card — effectively full, with no spare headroom. The modest ~31.2 tokens/s reflects exactly this tight fit: a bigger model squeezed onto a small card. Close other GPU consumers before you run — browsers with hardware acceleration, other models, even a second monitor's compositor can tip you into an out-of-memory error or a slow CPU fallback. This recipe documents the Q4 quant specifically because anything heavier does not fit 8 GB.

Installation

1. Install Ollama

Download and install Ollama for your OS from ollama.com/download. On Linux:

curl -fsSL https://ollama.com/install.sh | sh

Confirm it sees your GPU:

ollama --version
nvidia-smi

2. Pull the Falcon2 11B weights

The falcon2:11b tag is a 6.4 GB Q4_0 download, and latest resolves to the same 11B build (ollama.com/library/falcon2):

ollama pull falcon2:11b

Running

Start an interactive session:

ollama run falcon2:11b
>>> The three laws of thermodynamics are

Because this is a base model, it shines at continuation rather than chat. Feed it the start of what you want and let it complete:

ollama run falcon2:11b "Translate to German. English: The weather is nice today. German:"

The model loads onto the GPU and streams its output token by token. The first run after a fresh pull spends a moment loading weights into VRAM; subsequent prompts in the same session reply immediately.

Results

Speed: ~31.2 tokens/s generation at Q4 on the RTX 3060 Ti, measured by DatabaseMart under their "Eval Rate(tokens/s)" column (Ollama 0.5.4). For a plain text LLM like Falcon2, this is the generation speed — the rate at which the model writes its output. It is modest because an 11B model fills an 8 GB card to the brim, leaving the GPU little slack.
VRAM usage: The backend records an 8.0 GB peak on this 8 GB card — i.e. effectively full. DatabaseMart's own table lists Falcon2's GPU vRAM utilization at 85% on the card it benchmarked. Either way, plan for essentially no spare VRAM. See /check
Quality notes: This is a single commercial benchmark source. Numbers will vary with your Ollama version, driver, and context length. If you measure your own throughput or peak VRAM on a 3060 Ti, please contribute it via /contribute so the next reader gets a corroborating datapoint.

For the full benchmark data, see /check/falcon2-11b/rtx-3060-ti.

Troubleshooting

Out of memory / model falls back to CPU

At 8.0 GB peak on an 8 GB card there is zero margin, and Falcon2 11B is the biggest model in this class — it is the most likely of any 8 GB-class LLM to tip over. If you see an OOM error or generation suddenly crawls, something else is holding VRAM. Run nvidia-smi to see what is resident, close it, and retry. Do not try a heavier quant on this card — the Q4_0 build is already the only one that fits. If you need reliable headroom, drop to a smaller 7B-class model instead.

Short context / context runs out

The falcon2:11b Ollama tag ships with a 2K context window (ollama.com/library/falcon2). Raising the context length loads more KV cache, which eats into the already-tight 8 GB budget on this card. If you raise the context and hit an OOM, lower it again — on an 8 GB card the 11B weights alone fill most of the VRAM, leaving little room for a large KV cache.

It won't follow my chat instructions

That is expected — Falcon2 11B is a base model, not instruction-tuned. The model card is explicit: "This is a raw, pretrained model, which should be further finetuned for most usecases." (huggingface.co/tiiuae/falcon-11B) For best results, prompt it few-shot (show it one or two examples of the pattern you want), or use a fine-tuned / instruction model if you need conversational behaviour.

No other widely-reported issues for this pair. Report problems via the submission form.