How much VRAM does Laguna XS 2.1 need?

About 24 GB — the minimum this recipe targets.

How hard is this setup?

Advanced — follow the steps above.

Laguna XS 2.1 on Apple M3 Max: Local Agentic Coding via Ollama + OpenHands (48GB Apple / q8_0-capable)

What You'll Build

A local, private agentic-coding setup: Laguna XS 2.1 — Poolside's open (OpenMDW-1.1) Mixture-of-Experts coding model — served as an OpenAI-compatible endpoint on an Apple M3 Max (48GB unified memory) via Ollama on Metal, and driven by the OpenHands coding agent (with Aider as an alternative) so the model can read your repo, run shell commands, and edit files. Laguna is a reasoning model with native interleaved thinking and tool-calling, built specifically for "agentic coding and long-horizon work on a local machine" per the Laguna-XS-2.1 model card. On a 48 GB Apple machine the unified memory is large enough to step above the Q4_K_M floor: this recipe recommends the q8_0 tag (36 GB) as a quality upgrade, with Q4_K_M (20.27 GB) as the lighter option that frees memory for very large context.

Hardware data: Apple M3 Max (48GB unified memory, Metal) · Laguna XS 2.1 q8_0 GGUF (36 GB, recommended) or Q4_K_M (20.27 GB) · bounded-context · See benchmark data

⚠️ Upstream llama.cpp does not yet support this model — Ollama is the turnkey path. The official GGUF card states plainly: "llama.cpp support is not yet upstreamed." A stock brew install llama.cpp / release binary will not load Laguna until ggml-org/llama.cpp#25165 merges (as of this writing that PR is still open). Poolside ships a first-party Ollama build that works today (ollama pull laguna-xs-2.1), and on macOS Ollama uses Metal automatically — no CUDA, no manual GPU flags. This recipe leads with Ollama and gives the llama.cpp Metal PR-branch build as the second path. See Installation.

ℹ️ An MoE keeps all experts resident — the file size is the memory cost. Laguna XS 2.1 is a 33B-total-parameter Mixture-of-Experts with ~3B activated per token (256 experts + 1 shared, 8 active/token), per the model card. An MoE activates only some experts per token (a throughput property), but all experts stay loaded in (unified) memory, so the footprint is the full quant file — 36 GB at q8_0 or 20.27 GB at Q4_K_M, not some smaller "3B active" fraction. Do not expect the low active-parameter count to shrink the memory requirement.

ℹ️ Apple unified memory is shared with the OS — respect the GPU cap. On Apple Silicon the CPU and GPU share one memory pool, but macOS reserves a slice and by default only lets the GPU wire down roughly 70–75% of total (raiseable via sudo sysctl iogpu.wired_limit_mb=<value>). On a 48 GB machine that is roughly 34–36 GB of GPU-usable memory by default — the 36 GB q8_0 weights sit right at that line, so close other heavy apps and, if a q8_0 load fails, either raise iogpu.wired_limit_mb or drop to Q4_K_M. Q4_K_M (20.27 GB) leaves generous headroom.

ℹ️ Quant reality: Q4_K_M is the HF floor, q8_0 lives on Ollama. The HF GGUF repo ships exactly two files — Q4_K_M (20.27 GB) and BF16 (66.93 GB). There is no official Q3/Q2/Q5/Q6/Q8 in the HF repo. The q8_0 (36 GB) quant this recipe recommends exists only on the Ollama library (3 tags: q4_K_M/latest = 20 GB, q8_0 = 36 GB, bf16 = 67 GB). Apple's unified memory is exactly what makes q8_0 worthwhile here — a mid-quant quality bump that a 24 GB CUDA card can't hold.

Requirements

Component	Minimum	Tested
GPU	24GB (Q4_K_M floor); 36GB+ usable to run q8_0	Apple M3 Max (48GB unified memory, Metal)
RAM	Unified with GPU (48GB total on this config)	48GB unified
Storage	~36GB for q8_0 (or ~19GB for Q4_K_M)	36 GB (`laguna-xs-2.1:q8_0`)
Software	Ollama (Metal), or a PR-branch llama.cpp (Metal) build; Python 3.10+ for OpenHands / Aider	Ollama, OpenHands
License	OpenMDW-1.1 (commercial-OK)	—

The Q4_K_M GGUF file is 20.27 GB (20,274,304,032 bytes) and the only other HF-published quant is BF16 66.93 GB (66,930,230,080 bytes), per the Laguna-XS-2.1-GGUF file tree. The q8_0 (36 GB) tag used here is published on the Ollama library, not in the HF GGUF repo. The model is licensed under OpenMDW-1.1 (a permissive, commercial-OK license), per the Laguna-XS-2.1-GGUF model card.

Installation

Pick one of the two serving paths below, then install the agent client. Ollama is the recommended path because upstream llama.cpp does not yet support the laguna architecture (see the Known-issue box above), and on macOS Ollama drives Metal for you.

Path A — Ollama (recommended, turnkey, Metal)

Poolside publishes a first-party Ollama build. On a 48 GB M3 Max the q8_0 tag (36 GB) is the recommended quality upgrade over the default Q4_K_M; both are on the official Ollama library page:

# Recommended on 48GB: the q8_0 (36 GB) tag — higher quality than Q4_K_M
ollama pull laguna-xs-2.1:q8_0

# Lighter option (frees memory for very large context): the default Q4_K_M (20 GB)
ollama pull laguna-xs-2.1

Ollama uses Metal automatically on macOS and serves an OpenAI-compatible endpoint at http://localhost:11434/v1 once the model is pulled and running. There are no CUDA flags to set.

Path B — llama.cpp (build from the support PR, Metal)

Upstream llama.cpp cannot yet load Laguna. The GGUF card's own instruction is to build from the PR branch that adds support; on macOS you build it with Metal (which is the default), per the Laguna-XS-2.1-GGUF model card:

# Build llama.cpp from the PR branch that adds Laguna support (Metal is default on macOS)
git clone https://github.com/ggml-org/llama.cpp && cd llama.cpp
git fetch origin pull/25165/head:laguna && git checkout laguna
cmake -B build -DGGML_METAL=ON
cmake --build build -j

# Download the Q4_K_M GGUF (the only GGUF the HF repo publishes)
huggingface-cli download poolside/Laguna-XS-2.1-GGUF \
  Laguna-XS-2.1-Q4_K_M.gguf --local-dir ~/models/Laguna-XS-2.1-GGUF

-DGGML_METAL=ON is the default on macOS (no CUDA on Apple Silicon), so Metal handles GPU offload. Note the HF GGUF repo only ships Q4_K_M and BF16 — the q8_0 tag is Ollama-only, so on the llama.cpp path you run Q4_K_M unless you quantize BF16 yourself. Once ggml-org/llama.cpp#25165 merges, a stock build (or brew install llama.cpp) will work without the PR checkout — verify the PR's state before assuming you still need the branch.

Install the agent client

OpenHands is an open-source agentic-coding client that drives any OpenAI-compatible endpoint:

pip install openhands-ai

Aider is a lighter terminal-based alternative that also speaks the OpenAI API:

pip install aider-install && aider-install

Running

1. Serve Laguna with a bounded context

Ollama (Path A) — start the model server; Ollama exposes the OpenAI-compatible API at http://localhost:11434/v1 and drives Metal automatically:

ollama run laguna-xs-2.1:q8_0

Ollama's default context is small; raise it deliberately (e.g. /set parameter num_ctx 32768 in the interactive session, or an OLLAMA_CONTEXT_LENGTH env var) rather than jumping to the model's full 262,144-token window. With 36 GB of q8_0 weights resident on a 48 GB machine, the full 256K context will not fit alongside the weights and the OS reservation, so start bounded and raise num_ctx while watching memory in Activity Monitor. If you want maximum context headroom, use the Q4_K_M tag instead — its smaller weights leave more of the 48 GB for the KV cache.

llama.cpp (Path B) — serve the downloaded GGUF, applying the model's built-in chat template so reasoning and tool-calling work. This is the GGUF card's own llama-server invocation, adapted to a bounded context, per the Laguna-XS-2.1-GGUF model card:

./build/bin/llama-server \
  -m ~/models/Laguna-XS-2.1-GGUF/Laguna-XS-2.1-Q4_K_M.gguf \
  --jinja \
  -ngl 99 \
  -c 32768 \
  --port 8000

--jinja applies the model's bundled chat_template.jinja — required on llama.cpp for correct reasoning and tool-calling; without it agentic edits misbehave. (Ollama applies its template automatically, so this flag is a Path-B concern only.)
-ngl 99 offloads all layers to the GPU via Metal (the quant file must sit in unified memory — see the MoE note above).
-c 32768 caps context at 32K. The card documents up to 262,144, but a bounded value keeps KV-cache memory reasonable once the weights are resident. Laguna quantizes its KV cache to FP8 natively and uses sliding-window attention (a 512-token window on 30 of its 40 layers, per the model card), so its KV growth is gentler than a dense full-attention model — but the weights still dominate the budget, so start bounded and raise -c while watching memory (sudo powermetrics --samplers gpu_power, or Activity Monitor's Memory tab).

Both paths expose an OpenAI-compatible API (:11434/v1 for Ollama, :8000/v1 for llama.cpp) that the agent client below points at.

2. Point OpenHands at the local server

OpenHands routes through LiteLLM, so a custom OpenAI-compatible endpoint uses an openai/ model prefix, per the OpenHands local-LLM docs:

export LLM_MODEL="openai/laguna-xs-2.1"                 # match your served-model name
export LLM_BASE_URL="http://localhost:11434/v1"         # Ollama; use :8000/v1 for llama.cpp
export LLM_API_KEY="EMPTY"                              # any non-empty string; local servers don't check it

openhands

OpenHands will now use Laguna to plan, run shell commands, and edit files in your workspace. Its interleaved reasoning drives planning and its native tool-calling drives the file/shell actions. (Aider works the same way — point --openai-api-base at the same URL and pass --model openai/laguna-xs-2.1.)

Results

Memory usage: The recommended q8_0 weights are 36 GB and must be held resident in unified memory, leaving roughly 8–10 GB of the M3 Max's 48 GB for the KV cache, activations, and the OS — which is why context is bounded above and why q8_0 sits near the default GPU-usable ceiling (raise iogpu.wired_limit_mb or fall back to Q4_K_M if a load fails). The native FP8 KV cache and sliding-window attention ease KV pressure relative to a dense model, but do not change the fact that the weights dominate the budget. Q4_K_M weights are 20.27 GB (18.88 GiB), leaving far more room for context. Per-quant file sizes are verified via the Laguna-XS-2.1-GGUF file tree (HF quants) and the Ollama tags page (q8_0).
Quality notes: Poolside reports Laguna XS 2.1 scoring SWE-bench Verified 70.9%, SWE-bench Multilingual 63.1%, SWE-Bench Pro (public) 47.6%, and Terminal-Bench 2.0 37.5% — per the vendor benchmark table on the Laguna-XS-2.1 model card. Those are the vendor's own agentic-coding evals, not a measurement on this GPU. The card's benchmarking used temperature 1.0, top_k 20, top_p 1 with thinking enabled. Running q8_0 rather than Q4_K_M keeps quality closer to those numbers than the entry-tier Q4_K_M does.

There is no community throughput benchmark for Laguna XS 2.1 on the M3 Max yet — /check/laguna-xs-2-1/m3-max has no benchmark data, and this is a brand-new model, so we do not quote a tok/s figure rather than invent one or borrow one from different hardware. (Note that a reasoning model's effective throughput is lower than a raw tok/s number suggests, because much of each turn is thinking content you discard.)

For the full benchmark data, see /check/laguna-xs-2-1/m3-max.

Troubleshooting

`llama-server` reports "unknown model architecture 'laguna'" / won't load the GGUF

Your llama.cpp build predates Laguna support. Upstream llama.cpp cannot load this model yet — the GGUF card states "llama.cpp support is not yet upstreamed." Either build from ggml-org/llama.cpp#25165 with Metal (see Path B), or use the turnkey Ollama path (Path A), which ships a first-party build that works today and drives Metal automatically. Once the PR merges, upgrade to a stock build.

q8_0 fails to load, or the KV cache won't fit

On a 48 GB machine, the 36 GB q8_0 weights sit near the default GPU-usable ceiling (macOS wires down ~70–75% of total unified memory by default). If a q8_0 load fails or OOMs, either raise the cap (sudo sysctl iogpu.wired_limit_mb=<value>, e.g. 40960 for ~40 GB) and close other heavy apps, or drop to the Q4_K_M tag (ollama run laguna-xs-2.1) which leaves generous headroom. If the model loads but a hard task OOMs mid-generation, the context is set too high — lower num_ctx on Ollama (or -c on llama.cpp, e.g. -c 16384). Watch the Memory tab in Activity Monitor during a real agent task — a hard coding problem produces a long thinking block that grows the KV cache mid-generation, so size for the peak, not the idle load.

The agent botches tool calls / doesn't emit reasoning

Make sure the built-in chat template is applied. On llama.cpp that means passing --jinja (Laguna ships a chat_template.jinja for reasoning and tool-calling); without it the model's native tool-call and thinking blocks are not formatted correctly and agent clients misbehave. Ollama applies the packaged template automatically.

No other widely-reported issues on the M3 Max yet. If you run Laguna XS 2.1 on this chip, report your throughput and any problems via the submission form so we can seed real benchmark data.