How much VRAM does Laguna XS 2.1 need?

About 24 GB — the minimum this recipe targets.

How hard is this setup?

Advanced — follow the steps above.

Laguna XS 2.1 on RTX 3090 Ti: Local Agentic Coding via Ollama / llama.cpp + OpenHands (24GB Tier)

What You'll Build

A local, private agentic-coding setup: Laguna XS 2.1 — Poolside's open (OpenMDW-1.1) Mixture-of-Experts coding model — served as an OpenAI-compatible endpoint on a single 24GB RTX 3090 Ti and driven by the OpenHands coding agent (with Aider as an alternative) so the model can read your repo, run shell commands, and edit files. Laguna is a reasoning model with native interleaved thinking and tool-calling, built specifically for "agentic coding and long-horizon work on a local machine" per the Laguna-XS-2.1 model card. This recipe uses the Q4_K_M GGUF (20.27 GB on disk) — the smallest official quant, and the reason 24 GB is the floor for this model. The RTX 3090 Ti shares that 24 GB budget with the plain 3090, with marginally higher clocks and memory bandwidth — faster on the same amount of VRAM.

Hardware data: RTX 3090 Ti (24GB VRAM) · Laguna XS 2.1 Q4_K_M GGUF (20.27 GB weights) · bounded-context, tight fit · See benchmark data

⚠️ Upstream llama.cpp does not yet support this model — Ollama is the turnkey path. The official GGUF card states plainly: "llama.cpp support is not yet upstreamed." A stock brew install llama.cpp / release binary will not load Laguna until ggml-org/llama.cpp#25165 merges (as of this writing that PR is still open). Poolside ships a first-party Ollama build that works today (ollama pull laguna-xs-2.1), so this recipe leads with Ollama and gives the llama.cpp PR-branch build as the second path. See Installation.

ℹ️ An MoE keeps all experts resident — the file size is the VRAM cost. Laguna XS 2.1 is a 33B-total-parameter Mixture-of-Experts with 3B activated per token (256 experts + 1 shared, 8 active/token), per the model card. An MoE activates only some experts per token (a throughput property), but all experts stay loaded in VRAM, so the memory footprint is the full quant file — 20.27 GB at Q4_K_M, not some smaller "3B active" fraction. Do not expect the low active-parameter count to shrink the VRAM requirement.

ℹ️ Card too small? Route to a smaller model. There is no official Q3/Q2/Q5/Q6/Q8 build — the GGUF repo ships exactly two files: Q4_K_M (20.27 GB) and BF16 (66.93 GB). Q4_K_M is the floor, so a card with less than 24 GB cannot run this model from the official GGUF. If your GPU has under 24 GB, pick a smaller coding model rather than a lower quant that does not exist.

Requirements

Component	Minimum	Tested
GPU	24GB VRAM (this is the Q4_K_M floor)	RTX 3090 Ti (24GB, Ampere GA102, sm_86)
RAM	16GB system RAM (32GB comfortable for the agent + repo)	—
Storage	~19GB (the Q4_K_M GGUF is 20.27 GB)	18.88 GiB (`Laguna-XS-2.1-Q4_K_M.gguf`)
Software	Ollama, or a PR-branch llama.cpp (CUDA) build; Python 3.10+ for OpenHands / Aider	Ollama, OpenHands
License	OpenMDW-1.1 (commercial-OK)	—

The Q4_K_M GGUF file is 20.27 GB (20,274,304,032 bytes) per the Laguna-XS-2.1-GGUF file tree; the only other published quant is BF16 66.93 GB (66,930,230,080 bytes) — which needs a 64–80 GB card — so Q4_K_M is the only official quant that fits a 24 GB GPU. The model is licensed under OpenMDW-1.1 (a permissive, commercial-OK license), per the Laguna-XS-2.1-GGUF model card.

Installation

Pick one of the two serving paths below, then install the agent client. Ollama is the recommended path because upstream llama.cpp does not yet support the laguna architecture (see the Known-issue box above).

Path A — Ollama (recommended, turnkey)

Poolside publishes a first-party Ollama build. One command pulls the Q4_K_M weights and registers the model, per the Laguna-XS-2.1 model card and the official Ollama library page:

ollama pull laguna-xs-2.1

The default laguna-xs-2.1 / q4_K_M tag is the 20 GB quant this card needs; Ollama also lists a q8_0 (36 GB) and bf16 (67 GB) tag, but neither fits a 24 GB RTX 3090 Ti — stay on the default. Ollama serves an OpenAI-compatible endpoint at http://localhost:11434/v1 once the model is pulled and running.

Path B — llama.cpp (build from the support PR)

Upstream llama.cpp cannot yet load Laguna. The GGUF card's own instruction is to build from the PR branch that adds support — verbatim from the Laguna-XS-2.1-GGUF model card:

# Build llama.cpp from the PR branch that adds Laguna support
git clone https://github.com/ggml-org/llama.cpp && cd llama.cpp
git fetch origin pull/25165/head:laguna && git checkout laguna
cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=86
cmake --build build -j

# Download the Q4_K_M GGUF
huggingface-cli download poolside/Laguna-XS-2.1-GGUF \
  Laguna-XS-2.1-Q4_K_M.gguf --local-dir ~/models/Laguna-XS-2.1-GGUF

-DCMAKE_CUDA_ARCHITECTURES=86 targets the RTX 3090 Ti's Ampere (GA102, sm_86). The Q4_K_M GGUF is an integer quant format, so no special path is needed — Ampere has no FP8 tensor cores, and none are required to run this quant. Once ggml-org/llama.cpp#25165 merges, a stock build (or brew install llama.cpp) will work without the PR checkout — verify the PR's state before assuming you still need the branch.

Install the agent client

OpenHands is an open-source agentic-coding client that drives any OpenAI-compatible endpoint:

pip install openhands-ai

Aider is a lighter terminal-based alternative that also speaks the OpenAI API:

pip install aider-install && aider-install

Running

1. Serve Laguna with a bounded context

Ollama (Path A) — start the model server; Ollama exposes the OpenAI-compatible API at http://localhost:11434/v1:

ollama run laguna-xs-2.1

Ollama's default context is small; raise it deliberately (e.g. /set parameter num_ctx 32768 in the interactive session, or an OLLAMA_CONTEXT_LENGTH env var) rather than jumping to the model's full 262,144-token window, which no 24 GB card can hold once the weights are resident.

llama.cpp (Path B) — serve the downloaded GGUF, applying the model's built-in chat template so reasoning and tool-calling work. This is the GGUF card's own llama-server invocation, adapted to a bounded context, per the Laguna-XS-2.1-GGUF model card:

./build/bin/llama-server \
  -m ~/models/Laguna-XS-2.1-GGUF/Laguna-XS-2.1-Q4_K_M.gguf \
  --jinja \
  -ngl 99 \
  -c 32768 \
  --port 8000

--jinja applies the model's bundled chat_template.jinja — required for correct reasoning and tool-calling; without it agentic edits misbehave.
-ngl 99 offloads all layers to the GPU (the 20.27 GB Q4 file must sit in VRAM — see the MoE note above).
-c 32768 caps context at 32K. The card documents up to 262,144, but a bounded value keeps KV-cache memory reasonable once ~20 GB of weights are already resident on a 24 GB card. Laguna quantizes its KV cache to FP8 natively and uses sliding-window attention (a 512-token window on 30 of its 40 layers, per the model card), so its KV growth is gentler than a dense full-attention model — but the weights still dominate the budget, so start bounded and raise -c while watching nvidia-smi.

Both paths expose an OpenAI-compatible API (:11434/v1 for Ollama, :8000/v1 for llama.cpp) that the agent client below points at.

2. Point OpenHands at the local server

OpenHands routes through LiteLLM, so a custom OpenAI-compatible endpoint uses an openai/ model prefix, per the OpenHands local-LLM docs:

export LLM_MODEL="openai/laguna-xs-2.1"                 # match your served-model name
export LLM_BASE_URL="http://localhost:11434/v1"         # Ollama; use :8000/v1 for llama.cpp
export LLM_API_KEY="EMPTY"                              # any non-empty string; local servers don't check it

openhands

OpenHands will now use Laguna to plan, run shell commands, and edit files in your workspace. Its interleaved reasoning drives planning and its native tool-calling drives the file/shell actions. (Aider works the same way — point --openai-api-base at the same URL and pass --model openai/laguna-xs-2.1.)

Results

VRAM usage: The Q4_K_M weights are 20.27 GB on disk (18.88 GiB) and must be held resident, leaving only ~3–4 GB of the RTX 3090 Ti's 24 GB for the KV cache and activations — which is why context is bounded above. The native FP8 KV cache and sliding-window attention ease KV pressure relative to a dense model, but do not change the fact that the weights alone nearly fill the card. Weight and per-quant file sizes are verified via the Laguna-XS-2.1-GGUF file tree.
Quality notes: Poolside reports Laguna XS 2.1 scoring SWE-bench Verified 70.9%, SWE-bench Multilingual 63.1%, SWE-Bench Pro (public) 47.6%, and Terminal-Bench 2.0 37.5% — per the vendor benchmark table on the Laguna-XS-2.1 model card. Those are the vendor's own agentic-coding evals, not a measurement on this GPU. The card's benchmarking used temperature 1.0, top_k 20, top_p 1 with thinking enabled.

There is no community throughput benchmark for Laguna XS 2.1 on the RTX 3090 Ti yet — /check/laguna-xs-2-1/rtx-3090-ti has no benchmark data, and this is a brand-new model, so we do not quote a tok/s figure rather than invent one or borrow one from different hardware. (Note that a reasoning model's effective throughput is lower than a raw tok/s number suggests, because much of each turn is thinking content you discard.)

For the full benchmark data, see /check/laguna-xs-2-1/rtx-3090-ti.

Troubleshooting

`llama-server` reports "unknown model architecture 'laguna'" / won't load the GGUF

Your llama.cpp build predates Laguna support. Upstream llama.cpp cannot load this model yet — the GGUF card states "llama.cpp support is not yet upstreamed." Either build from ggml-org/llama.cpp#25165 (see Path B), or use the turnkey Ollama path (Path A), which ships a first-party build that works today. Once the PR merges, upgrade to a stock build.

Out of memory at launch, or the KV cache won't fit

The 20.27 GB of weights leave only ~3–4 GB on a 24GB card, so OOM at startup usually means the context is set too high. Lower -c (try -c 16384 or -c 8192 on llama.cpp; lower num_ctx on Ollama) and close any other GPU app before launching. Watch nvidia-smi during a real agent task — a hard coding problem produces a long thinking block that grows the KV cache mid-generation, so size for the peak, not the idle load. The RTX 3090 Ti's higher clocks do not enlarge its 24 GB budget; if you need the model's full 256K context, that requires a 32/48 GB (e.g. RTX 5090) or larger card.

The agent botches tool calls / doesn't emit reasoning

Make sure the built-in chat template is applied. On llama.cpp that means passing --jinja (Laguna ships a chat_template.jinja for reasoning and tool-calling); without it the model's native tool-call and thinking blocks are not formatted correctly and agent clients misbehave. Ollama applies the packaged template automatically.

No other widely-reported issues on the RTX 3090 Ti yet. If you run Laguna XS 2.1 on this card, report your throughput and any problems via the submission form so we can seed real benchmark data.