How much VRAM does Laguna XS 2.1 need?

About 24 GB — the minimum this recipe targets.

How hard is this setup?

Advanced — follow the steps above.

Laguna XS 2.1 on RTX 5090: Local Agentic Coding via Ollama / llama.cpp + OpenHands (32GB Tier)

What You'll Build

A local, private agentic-coding setup: Laguna XS 2.1 — Poolside's open (OpenMDW-1.1) Mixture-of-Experts coding model — served as an OpenAI-compatible endpoint on a single 32GB RTX 5090 and driven by the OpenHands coding agent (with Aider as an alternative) so the model can read your repo, run shell commands, and edit files. Laguna is a reasoning model with native interleaved thinking and tool-calling, built specifically for "agentic coding and long-horizon work on a local machine" per the Laguna-XS-2.1 model card. This recipe uses the Q4_K_M GGUF (20.27 GB on disk) — the smallest official quant. On the 5090's 32 GB the weights leave a comfortable ~10–12 GB for the KV cache, so this is the card that lets you push context meaningfully past the tight 24 GB tier.

Hardware data: RTX 5090 (32GB VRAM) · Laguna XS 2.1 Q4_K_M GGUF (20.27 GB weights) · roomy KV headroom · See benchmark data

⚠️ Upstream llama.cpp does not yet support this model — Ollama is the turnkey path. The official GGUF card states plainly: "llama.cpp support is not yet upstreamed." A stock brew install llama.cpp / release binary will not load Laguna until ggml-org/llama.cpp#25165 merges (as of this writing that PR is still open). Poolside ships a first-party Ollama build that works today (ollama pull laguna-xs-2.1), so this recipe leads with Ollama and gives the llama.cpp PR-branch build as the second path. See Installation.

ℹ️ An MoE keeps all experts resident — the file size is the VRAM cost. Laguna XS 2.1 is a 33B-total-parameter Mixture-of-Experts with 3B activated per token (256 experts + 1 shared, 8 active/token), per the model card. An MoE activates only some experts per token (a throughput property), but all experts stay loaded in VRAM, so the memory footprint is the full quant file — 20.27 GB at Q4_K_M, not some smaller "3B active" fraction. Do not expect the low active-parameter count to shrink the VRAM requirement.

ℹ️ The extra VRAM does not unlock a bigger quant on this card. There is no official Q3/Q2/Q5/Q6/Q8 build on Hugging Face — the GGUF repo ships exactly two files: Q4_K_M (20.27 GB) and BF16 (66.93 GB). BF16 needs a 64–80 GB card, so it is out of reach on 32 GB. Ollama does list a q8_0 tag at 36 GB — but that is larger than the 5090's 32 GB and will not fit, so it is a trap on this card. Q4_K_M is the right choice on the 5090; the 32 GB simply gives you more room for context than the 24 GB tier, not a heavier quant.

Requirements

Component	Minimum	Tested
GPU	24GB VRAM (this is the Q4_K_M floor)	RTX 5090 (32GB, Blackwell GB202, sm_120)
RAM	16GB system RAM (32GB comfortable for the agent + repo)	—
Storage	~19GB (the Q4_K_M GGUF is 20.27 GB)	18.88 GiB (`Laguna-XS-2.1-Q4_K_M.gguf`)
Software	Ollama, or a PR-branch llama.cpp (CUDA) build; Python 3.10+ for OpenHands / Aider	Ollama, OpenHands
License	OpenMDW-1.1 (commercial-OK)	—

The Q4_K_M GGUF file is 20.27 GB (20,274,304,032 bytes) per the Laguna-XS-2.1-GGUF file tree; the only other Hugging Face quant is BF16 66.93 GB (66,930,230,080 bytes) — which needs a 64–80 GB card. So even on the 5090's 32 GB, Q4_K_M is the quant to run. The model is licensed under OpenMDW-1.1 (a permissive, commercial-OK license), per the Laguna-XS-2.1-GGUF model card.

Installation

Pick one of the two serving paths below, then install the agent client. Ollama is the recommended path because upstream llama.cpp does not yet support the laguna architecture (see the Known-issue box above).

Path A — Ollama (recommended, turnkey)

Poolside publishes a first-party Ollama build. One command pulls the Q4_K_M weights and registers the model, per the Laguna-XS-2.1 model card and the official Ollama library page:

ollama pull laguna-xs-2.1

The default laguna-xs-2.1 / q4_K_M tag is the 20 GB quant to use. Ollama also lists a q8_0 (36 GB) and bf16 (67 GB) tag, but neither fits the 5090's 32 GB — the 36 GB q8_0 is the tempting near-miss, so stay on the default Q4_K_M. Ollama serves an OpenAI-compatible endpoint at http://localhost:11434/v1 once the model is pulled and running.

Path B — llama.cpp (build from the support PR)

Upstream llama.cpp cannot yet load Laguna. The GGUF card's own instruction is to build from the PR branch that adds support — verbatim from the Laguna-XS-2.1-GGUF model card:

# Build llama.cpp from the PR branch that adds Laguna support
git clone https://github.com/ggml-org/llama.cpp && cd llama.cpp
git fetch origin pull/25165/head:laguna && git checkout laguna
cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=120
cmake --build build -j

# Download the Q4_K_M GGUF
huggingface-cli download poolside/Laguna-XS-2.1-GGUF \
  Laguna-XS-2.1-Q4_K_M.gguf --local-dir ~/models/Laguna-XS-2.1-GGUF

-DCMAKE_CUDA_ARCHITECTURES=120 targets the RTX 5090's Blackwell (GB202, sm_120). The Q4_K_M GGUF is an integer quant format, so no special path is needed to run it. Once ggml-org/llama.cpp#25165 merges, a stock build (or brew install llama.cpp) will work without the PR checkout — verify the PR's state before assuming you still need the branch.

Install the agent client

OpenHands is an open-source agentic-coding client that drives any OpenAI-compatible endpoint:

pip install openhands-ai

Aider is a lighter terminal-based alternative that also speaks the OpenAI API:

pip install aider-install && aider-install

Running

1. Serve Laguna with a generous but bounded context

Ollama (Path A) — start the model server; Ollama exposes the OpenAI-compatible API at http://localhost:11434/v1:

ollama run laguna-xs-2.1

Ollama's default context is small; raise it deliberately (e.g. /set parameter num_ctx 131072 in the interactive session, or an OLLAMA_CONTEXT_LENGTH env var). The 5090's ~10–12 GB of post-weights headroom lets you push toward 64K–128K, but do not assume the model's full 262,144-token window fits comfortably even here — start high and watch memory.

llama.cpp (Path B) — serve the downloaded GGUF, applying the model's built-in chat template so reasoning and tool-calling work. This is the GGUF card's own llama-server invocation, adapted to a generous but bounded context, per the Laguna-XS-2.1-GGUF model card:

./build/bin/llama-server \
  -m ~/models/Laguna-XS-2.1-GGUF/Laguna-XS-2.1-Q4_K_M.gguf \
  --jinja \
  -ngl 99 \
  -c 131072 \
  --port 8000

--jinja applies the model's bundled chat_template.jinja — required for correct reasoning and tool-calling; without it agentic edits misbehave.
-ngl 99 offloads all layers to the GPU (the 20.27 GB Q4 file must sit in VRAM — see the MoE note above).
-c 131072 caps context at 128K. The card documents up to 262,144; on the 5090's 32 GB the ~10–12 GB left after the weights supports a much larger window than the 24 GB tier, though still not the full 262K comfortably. Laguna quantizes its KV cache to FP8 natively and uses sliding-window attention (a 512-token window on 30 of its 40 layers, per the model card), so its KV growth is gentler than a dense full-attention model — push -c up while watching nvidia-smi and back off if a long thinking block pressures memory.

Both paths expose an OpenAI-compatible API (:11434/v1 for Ollama, :8000/v1 for llama.cpp) that the agent client below points at.

2. Point OpenHands at the local server

OpenHands routes through LiteLLM, so a custom OpenAI-compatible endpoint uses an openai/ model prefix, per the OpenHands local-LLM docs:

export LLM_MODEL="openai/laguna-xs-2.1"                 # match your served-model name
export LLM_BASE_URL="http://localhost:11434/v1"         # Ollama; use :8000/v1 for llama.cpp
export LLM_API_KEY="EMPTY"                              # any non-empty string; local servers don't check it

openhands

OpenHands will now use Laguna to plan, run shell commands, and edit files in your workspace. Its interleaved reasoning drives planning and its native tool-calling drives the file/shell actions. (Aider works the same way — point --openai-api-base at the same URL and pass --model openai/laguna-xs-2.1.)

Results

VRAM usage: The Q4_K_M weights are 20.27 GB on disk (18.88 GiB) and must be held resident, leaving ~10–12 GB of the RTX 5090's 32 GB for the KV cache and activations — enough to run a much larger context than the tight 24 GB tier, which is why the context values above are higher. The native FP8 KV cache and sliding-window attention stretch that headroom further. Weight and per-quant file sizes are verified via the Laguna-XS-2.1-GGUF file tree.
Quality notes: Poolside reports Laguna XS 2.1 scoring SWE-bench Verified 70.9%, SWE-bench Multilingual 63.1%, SWE-Bench Pro (public) 47.6%, and Terminal-Bench 2.0 37.5% — per the vendor benchmark table on the Laguna-XS-2.1 model card. Those are the vendor's own agentic-coding evals, not a measurement on this GPU. The card's benchmarking used temperature 1.0, top_k 20, top_p 1 with thinking enabled.

There is no community throughput benchmark for Laguna XS 2.1 on the RTX 5090 yet — /check/laguna-xs-2-1/rtx-5090 has no benchmark data, and this is a brand-new model, so we do not quote a tok/s figure rather than invent one or borrow one from different hardware. (Note that a reasoning model's effective throughput is lower than a raw tok/s number suggests, because much of each turn is thinking content you discard.)

For the full benchmark data, see /check/laguna-xs-2-1/rtx-5090.

Troubleshooting

`llama-server` reports "unknown model architecture 'laguna'" / won't load the GGUF

Your llama.cpp build predates Laguna support. Upstream llama.cpp cannot load this model yet — the GGUF card states "llama.cpp support is not yet upstreamed." Either build from ggml-org/llama.cpp#25165 (see Path B), or use the turnkey Ollama path (Path A), which ships a first-party build that works today. Once the PR merges, upgrade to a stock build.

Out of memory when pushing context high

The 20.27 GB of weights leave ~10–12 GB on the 32 GB 5090 — generous, but not unlimited. If you OOM, you set the context too high for a peak thinking block. Lower -c (drop from 131072 toward 65536 on llama.cpp; lower num_ctx on Ollama) and close any other GPU app before launching. Watch nvidia-smi during a real agent task — a hard coding problem produces a long thinking block that grows the KV cache mid-generation, so size for the peak, not the idle load. Note the model's full 256K context still may not fit comfortably even on 32 GB; that headroom is for large-but-bounded windows, not the full window.

The agent botches tool calls / doesn't emit reasoning

Make sure the built-in chat template is applied. On llama.cpp that means passing --jinja (Laguna ships a chat_template.jinja for reasoning and tool-calling); without it the model's native tool-call and thinking blocks are not formatted correctly and agent clients misbehave. Ollama applies the packaged template automatically.

No other widely-reported issues on the RTX 5090 yet. If you run Laguna XS 2.1 on this card, report your throughput and any problems via the submission form so we can seed real benchmark data.