How much VRAM does Laguna XS 2.1 need?

About 24 GB — the minimum this recipe targets.

How hard is this setup?

Advanced — follow the steps above.

Laguna XS 2.1 on Apple M2 Max: Local Agentic Coding via Ollama + OpenHands (64GB Apple)

What You'll Build

A local, private agentic-coding setup: Laguna XS 2.1 — Poolside's open (OpenMDW-1.1) Mixture-of-Experts coding model — served as an OpenAI-compatible endpoint on an Apple M2 Max (64GB unified memory) via Ollama on Metal, and driven by the OpenHands coding agent (with Aider as an alternative) so the model can read your repo, run shell commands, and edit files. Laguna is a reasoning model with native interleaved thinking and tool-calling, built specifically for "agentic coding and long-horizon work on a local machine" per the Laguna-XS-2.1 model card. On a 64 GB Apple machine the unified memory comfortably clears the Q4_K_M floor: this recipe recommends the q8_0 tag (36 GB) as the top usable quant, running it with generous context headroom, with Q4_K_M (20.27 GB) as the lighter option.

Hardware data: Apple M2 Max (64GB unified memory, Metal) · Laguna XS 2.1 q8_0 GGUF (36 GB, recommended) or Q4_K_M (20.27 GB) · bounded-context, comfortable headroom · See benchmark data

⚠️ Upstream llama.cpp does not yet support this model — Ollama is the turnkey path. The official GGUF card states plainly: "llama.cpp support is not yet upstreamed." A stock brew install llama.cpp / release binary will not load Laguna until ggml-org/llama.cpp#25165 merges (as of this writing that PR is still open). Poolside ships a first-party Ollama build that works today (ollama pull laguna-xs-2.1), and on macOS Ollama uses Metal automatically — no CUDA, no manual GPU flags. This recipe leads with Ollama and gives the llama.cpp Metal PR-branch build as the second path. See Installation.

ℹ️ An MoE keeps all experts resident — the file size is the memory cost. Laguna XS 2.1 is a 33B-total-parameter Mixture-of-Experts with ~3B activated per token (256 experts + 1 shared, 8 active/token), per the model card. An MoE activates only some experts per token (a throughput property), but all experts stay loaded in (unified) memory, so the footprint is the full quant file — 36 GB at q8_0 or 20.27 GB at Q4_K_M, not some smaller "3B active" fraction. Do not expect the low active-parameter count to shrink the memory requirement.

⚠️ BF16 (67 GB) does NOT fit a 64 GB machine — q8_0 is the top usable quant here. The BF16 GGUF is 66.93 GB, which exceeds the M2 Max's 64 GB of total unified memory before macOS even takes its OS reservation — so it cannot be served on this config at all. On a 64 GB Apple machine the largest quant you can actually run is the q8_0 (36 GB) tag, which this recipe recommends; it leaves ~28 GB for the OS, KV cache, and large context.

ℹ️ Apple unified memory is shared with the OS — respect the GPU cap. On Apple Silicon the CPU and GPU share one memory pool, but macOS reserves a slice and by default only lets the GPU wire down roughly 70–75% of total (raiseable via sudo sysctl iogpu.wired_limit_mb=<value>). On a 64 GB machine that is roughly 45–48 GB of GPU-usable memory by default — comfortably above the 36 GB q8_0 weights, so q8_0 runs with real headroom to spare. Q4_K_M (20.27 GB) leaves even more.

ℹ️ Quant reality: Q4_K_M is the HF floor, q8_0 lives on Ollama. The HF GGUF repo ships exactly two files — Q4_K_M (20.27 GB) and BF16 (66.93 GB). There is no official Q3/Q2/Q5/Q6/Q8 in the HF repo. The q8_0 (36 GB) quant this recipe recommends exists only on the Ollama library (3 tags: q4_K_M/latest = 20 GB, q8_0 = 36 GB, bf16 = 67 GB). Apple's 64 GB unified memory is exactly what makes q8_0 the sweet spot here — a mid-quant quality bump that a 24 GB CUDA card can't hold, without the un-fittable BF16.

Requirements

Component	Minimum	Tested
GPU	24GB (Q4_K_M floor); 36GB+ usable to run q8_0	Apple M2 Max (64GB unified memory, Metal)
RAM	Unified with GPU (64GB total on this config)	64GB unified
Storage	~36GB for q8_0 (or ~19GB for Q4_K_M)	36 GB (`laguna-xs-2.1:q8_0`)
Software	Ollama (Metal), or a PR-branch llama.cpp (Metal) build; Python 3.10+ for OpenHands / Aider	Ollama, OpenHands
License	OpenMDW-1.1 (commercial-OK)	—

The Q4_K_M GGUF file is 20.27 GB (20,274,304,032 bytes) and the only other HF-published quant is BF16 66.93 GB (66,930,230,080 bytes) — which exceeds this machine's 64 GB and cannot be served here — per the Laguna-XS-2.1-GGUF file tree. The q8_0 (36 GB) tag used here is published on the Ollama library, not in the HF GGUF repo. The model is licensed under OpenMDW-1.1 (a permissive, commercial-OK license), per the Laguna-XS-2.1-GGUF model card.

Installation

Pick one of the two serving paths below, then install the agent client. Ollama is the recommended path because upstream llama.cpp does not yet support the laguna architecture (see the Known-issue box above), and on macOS Ollama drives Metal for you.

Path A — Ollama (recommended, turnkey, Metal)

Poolside publishes a first-party Ollama build. On a 64 GB M2 Max the q8_0 tag (36 GB) is the recommended top quant — it runs with large context headroom; both tags are on the official Ollama library page:

# Recommended on 64GB: the q8_0 (36 GB) tag — top usable quant, comfortable headroom
ollama pull laguna-xs-2.1:q8_0

# Lighter option: the default Q4_K_M (20 GB)
ollama pull laguna-xs-2.1

Ollama uses Metal automatically on macOS and serves an OpenAI-compatible endpoint at http://localhost:11434/v1 once the model is pulled and running. There are no CUDA flags to set. (Do not pull the bf16 tag on a 64 GB machine — at 67 GB it exceeds total unified memory and won't load.)

Path B — llama.cpp (build from the support PR, Metal)

Upstream llama.cpp cannot yet load Laguna. The GGUF card's own instruction is to build from the PR branch that adds support; on macOS you build it with Metal (which is the default), per the Laguna-XS-2.1-GGUF model card:

# Build llama.cpp from the PR branch that adds Laguna support (Metal is default on macOS)
git clone https://github.com/ggml-org/llama.cpp && cd llama.cpp
git fetch origin pull/25165/head:laguna && git checkout laguna
cmake -B build -DGGML_METAL=ON
cmake --build build -j

# Download the Q4_K_M GGUF (the only GGUF the HF repo publishes)
huggingface-cli download poolside/Laguna-XS-2.1-GGUF \
  Laguna-XS-2.1-Q4_K_M.gguf --local-dir ~/models/Laguna-XS-2.1-GGUF

-DGGML_METAL=ON is the default on macOS (no CUDA on Apple Silicon), so Metal handles GPU offload. Note the HF GGUF repo only ships Q4_K_M and BF16 — the q8_0 tag is Ollama-only, so on the llama.cpp path you run Q4_K_M unless you quantize BF16 yourself. Once ggml-org/llama.cpp#25165 merges, a stock build (or brew install llama.cpp) will work without the PR checkout — verify the PR's state before assuming you still need the branch.

Install the agent client

OpenHands is an open-source agentic-coding client that drives any OpenAI-compatible endpoint:

pip install openhands-ai

Aider is a lighter terminal-based alternative that also speaks the OpenAI API:

pip install aider-install && aider-install

Running

1. Serve Laguna with a bounded context

Ollama (Path A) — start the model server; Ollama exposes the OpenAI-compatible API at http://localhost:11434/v1 and drives Metal automatically:

ollama run laguna-xs-2.1:q8_0

Ollama's default context is small; raise it deliberately (e.g. /set parameter num_ctx 65536 in the interactive session, or an OLLAMA_CONTEXT_LENGTH env var) rather than jumping to the model's full 262,144-token window. With 36 GB of q8_0 weights resident on a 64 GB machine you have real room for a large KV cache, but the full 256K context still will not fit alongside the weights and the OS reservation, so start bounded and raise num_ctx while watching memory in Activity Monitor. The 64 GB budget lets you push context much higher than a 48 GB machine can before hitting the cap.

llama.cpp (Path B) — serve the downloaded GGUF, applying the model's built-in chat template so reasoning and tool-calling work. This is the GGUF card's own llama-server invocation, adapted to a bounded context, per the Laguna-XS-2.1-GGUF model card:

./build/bin/llama-server \
  -m ~/models/Laguna-XS-2.1-GGUF/Laguna-XS-2.1-Q4_K_M.gguf \
  --jinja \
  -ngl 99 \
  -c 65536 \
  --port 8000

--jinja applies the model's bundled chat_template.jinja — required on llama.cpp for correct reasoning and tool-calling; without it agentic edits misbehave. (Ollama applies its template automatically, so this flag is a Path-B concern only.)
-ngl 99 offloads all layers to the GPU via Metal (the quant file must sit in unified memory — see the MoE note above).
-c 65536 caps context at 64K — a comfortable value on a 64 GB machine running Q4_K_M. The card documents up to 262,144, but a bounded value keeps KV-cache memory reasonable. Laguna quantizes its KV cache to FP8 natively and uses sliding-window attention (a 512-token window on 30 of its 40 layers, per the model card), so its KV growth is gentler than a dense full-attention model — but the weights still dominate the budget, so start bounded and raise -c while watching memory (sudo powermetrics --samplers gpu_power, or Activity Monitor's Memory tab).

Both paths expose an OpenAI-compatible API (:11434/v1 for Ollama, :8000/v1 for llama.cpp) that the agent client below points at.

2. Point OpenHands at the local server

OpenHands routes through LiteLLM, so a custom OpenAI-compatible endpoint uses an openai/ model prefix, per the OpenHands local-LLM docs:

export LLM_MODEL="openai/laguna-xs-2.1"                 # match your served-model name
export LLM_BASE_URL="http://localhost:11434/v1"         # Ollama; use :8000/v1 for llama.cpp
export LLM_API_KEY="EMPTY"                              # any non-empty string; local servers don't check it

openhands

OpenHands will now use Laguna to plan, run shell commands, and edit files in your workspace. Its interleaved reasoning drives planning and its native tool-calling drives the file/shell actions. (Aider works the same way — point --openai-api-base at the same URL and pass --model openai/laguna-xs-2.1.)

Results

Memory usage: The recommended q8_0 weights are 36 GB and must be held resident in unified memory, leaving roughly 28 GB of the M2 Max's 64 GB for the KV cache, activations, and the OS — comfortable headroom that lets you run a large context. macOS wires down ~45–48 GB of GPU-usable memory by default on a 64 GB machine, well above the 36 GB weights, so q8_0 loads without touching iogpu.wired_limit_mb. The native FP8 KV cache and sliding-window attention ease KV pressure relative to a dense model. Q4_K_M weights are 20.27 GB (18.88 GiB), leaving even more room. Note that BF16 (66.93 GB) exceeds this machine's total memory and cannot be served here. Per-quant file sizes are verified via the Laguna-XS-2.1-GGUF file tree (HF quants) and the Ollama tags page (q8_0).
Quality notes: Poolside reports Laguna XS 2.1 scoring SWE-bench Verified 70.9%, SWE-bench Multilingual 63.1%, SWE-Bench Pro (public) 47.6%, and Terminal-Bench 2.0 37.5% — per the vendor benchmark table on the Laguna-XS-2.1 model card. Those are the vendor's own agentic-coding evals, not a measurement on this GPU. The card's benchmarking used temperature 1.0, top_k 20, top_p 1 with thinking enabled. Running q8_0 rather than Q4_K_M keeps quality closer to those numbers than the entry-tier Q4_K_M does.

There is no community throughput benchmark for Laguna XS 2.1 on the M2 Max yet — /check/laguna-xs-2-1/m2-max has no benchmark data, and this is a brand-new model, so we do not quote a tok/s figure rather than invent one or borrow one from different hardware. (Note that a reasoning model's effective throughput is lower than a raw tok/s number suggests, because much of each turn is thinking content you discard.)

For the full benchmark data, see /check/laguna-xs-2-1/m2-max.

Troubleshooting

`llama-server` reports "unknown model architecture 'laguna'" / won't load the GGUF

Your llama.cpp build predates Laguna support. Upstream llama.cpp cannot load this model yet — the GGUF card states "llama.cpp support is not yet upstreamed." Either build from ggml-org/llama.cpp#25165 with Metal (see Path B), or use the turnkey Ollama path (Path A), which ships a first-party build that works today and drives Metal automatically. Once the PR merges, upgrade to a stock build.

bf16 won't load / the KV cache won't fit

Don't pull the bf16 tag on a 64 GB machine — at 66.93 GB it exceeds total unified memory and cannot be served here; q8_0 (36 GB) is the top usable quant. q8_0 itself loads comfortably (macOS wires down ~45–48 GB of GPU-usable memory by default, well above 36 GB), so a q8_0 OOM almost always means the context is set too high rather than a weight-fit problem — lower num_ctx on Ollama (or -c on llama.cpp, e.g. -c 32768). Watch the Memory tab in Activity Monitor during a real agent task — a hard coding problem produces a long thinking block that grows the KV cache mid-generation, so size for the peak, not the idle load. If you need still more context room, drop to the Q4_K_M tag.

The agent botches tool calls / doesn't emit reasoning

Make sure the built-in chat template is applied. On llama.cpp that means passing --jinja (Laguna ships a chat_template.jinja for reasoning and tool-calling); without it the model's native tool-call and thinking blocks are not formatted correctly and agent clients misbehave. Ollama applies the packaged template automatically.

No other widely-reported issues on the M2 Max yet. If you run Laguna XS 2.1 on this chip, report your throughput and any problems via the submission form so we can seed real benchmark data.