self-hosted/ai
§01·recipe · llm

Laguna XS 2.1 on RX 7900 XTX: Local Agentic Coding via Ollama (ROCm) / llama.cpp + OpenHands (24GB Tier)

llmadvanced24GB+ VRAMJul 3, 2026

This advanced recipe sets up Laguna XS 2.1 on the RX 7900 XTX, needing about 24 GB of VRAM.

models
tools
prerequisites
  • AMD Radeon RX 7900 XTX (24GB VRAM, RDNA3 / Navi 31 / gfx1100) — this is the 24GB tier for this 33B model; the Q4_K_M GGUF at 20.27 GB is a tight fit and leaves only ~3–4 GB for the KV cache, so context is the binding constraint (see below)
  • Linux (Ubuntu 24.04 / 22.04 or RHEL) with the AMD ROCm v7 driver installed via `amdgpu-install` — ROCm is NOT bundled with Ollama or the llama.cpp binaries
  • Python 3.10+ (for the OpenHands or Aider agent client)
  • Ollama with the ROCm runtime (turnkey — recommended), or a PR-branch build of llama.cpp with the HIP/ROCm backend — see the llama.cpp note below (upstream llama.cpp does not yet support the `laguna` architecture)
  • ~19GB free disk for the Q4_K_M GGUF weight file

What You'll Build

A local, private agentic-coding setup: Laguna XS 2.1 — Poolside's open (OpenMDW-1.1) Mixture-of-Experts coding model — served as an OpenAI-compatible endpoint on a single 24GB Radeon RX 7900 XTX (RDNA3, Navi 31, gfx1100) via AMD's ROCm stack, and driven by the OpenHands coding agent (with Aider as an alternative) so the model can read your repo, run shell commands, and edit files. Laguna is a reasoning model with native interleaved thinking and tool-calling, built specifically for "agentic coding and long-horizon work on a local machine" per the Laguna-XS-2.1 model card. This recipe uses the Q4_K_M GGUF (20.27 GB on disk) — the smallest official quant, and the reason 24 GB is the floor for this model.

Hardware data: RX 7900 XTX (24GB VRAM) · Laguna XS 2.1 Q4_K_M GGUF (20.27 GB weights) · ROCm 7 · bounded-context, tight fit · See benchmark data

⚠️ This is a ROCm recipe, not CUDA. The RX 7900 XTX runs on AMD's ROCm/HIP stack — there is no cu124/cu128 wheel and no FlashAttention prebuilt-wheel step here. For this coding LLM the reliable path is GGUF via Ollama (ROCm) or llama.cpp built with the HIP backend. Do not follow a guide that tells you to pip install flash-attn, pick a cu12x wheel, or use ExLlamaV2/Marlin for this card — those are NVIDIA-only.

⚠️ Upstream llama.cpp does not yet support this model — Ollama is the turnkey path. The official GGUF card states plainly: "llama.cpp support is not yet upstreamed." A stock ROCm llama.cpp build will not load Laguna until ggml-org/llama.cpp#25165 merges (as of this writing that PR is still open). Poolside ships a first-party Ollama build that works today (ollama pull laguna-xs-2.1), and Ollama's ROCm backend runs natively on gfx1100 — so this recipe leads with Ollama and gives the llama.cpp HIP PR-branch build as the second path. See Installation.

ℹ️ An MoE keeps all experts resident — the file size is the VRAM cost. Laguna XS 2.1 is a 33B-total-parameter Mixture-of-Experts with ~3B activated per token (256 experts + 1 shared, 8 active/token), per the model card. An MoE activates only some experts per token (a throughput property), but all experts stay loaded in VRAM, so the memory footprint is the full quant file — 20.27 GB at Q4_K_M, not some smaller "3B active" fraction. Do not expect the low active-parameter count to shrink the VRAM requirement.

ℹ️ Card too small? Route to a smaller model. There is no official Q3/Q2/Q5/Q6/Q8 build — the GGUF repo ships exactly two files: Q4_K_M (20.27 GB) and BF16 (66.93 GB). Ollama's library adds a q8_0 (36 GB) build, but q8_0 does not fit a 24 GB card — only Q4_K_M does. Q4_K_M is the floor, so a card with less than 24 GB cannot run this model. If your GPU has under 24 GB, pick a smaller coding model rather than a lower quant that does not exist.

Requirements

ComponentMinimumTested
GPU24GB VRAM (this is the Q4_K_M floor)RX 7900 XTX (24GB, RDNA3 Navi 31, gfx1100)
RAM16GB system RAM (32GB comfortable for the agent + repo)
Storage~19GB (the Q4_K_M GGUF is 20.27 GB)18.88 GiB (Laguna-XS-2.1-Q4_K_M.gguf)
DriverAMD ROCm v7 (installed via amdgpu-install) on Linux
SoftwareOllama (ROCm), or a PR-branch llama.cpp (HIP) build; Python 3.10+ for OpenHands / AiderOllama
LicenseOpenMDW-1.1 (commercial-OK)

The Q4_K_M GGUF file is 20.27 GB (20,274,304,032 bytes) per the Laguna-XS-2.1-GGUF file tree; the only other published Hugging Face quant is BF16 66.93 GB (66,930,230,080 bytes) — which needs a 64–80 GB card. Ollama's library lists three tags — q4_K_M/latest (20 GB), q8_0 (36 GB), and bf16 (67 GB) — per the official Ollama library page; the q8_0 build is Ollama-only and at 36 GB it overflows a 24 GB card, so Q4_K_M is the only quant that fits the RX 7900 XTX. The model is licensed under OpenMDW-1.1 (a permissive, commercial-OK license), per the Laguna-XS-2.1-GGUF model card.

Installation

Pick one of the two serving paths below, then install the agent client. Ollama is the recommended path because upstream llama.cpp does not yet support the laguna architecture (see the Known-issue box above), and Ollama bundles a first-party Laguna build that runs on the ROCm runtime you install below.

Prerequisite — install the AMD ROCm v7 driver

The RX 7900 XTX (gfx1100) is an officially ROCm-supported GPU, but ROCm is not bundled with Ollama or the llama.cpp release binaries — you install it once at the OS level. Per the Ollama AMD GPU docs, Ollama requires the AMD ROCm v7 driver on Linux, installed or upgraded with the amdgpu-install utility. On Ubuntu 24.04 (Noble), install ROCm 7.2.1 via the standard amdgpu-install flow (AMD's Radeon ROCm install docs cover the current packages):

# 1. Add the amdgpu-install package and install ROCm
wget https://repo.radeon.com/amdgpu-install/7.2.1/ubuntu/noble/amdgpu-install_7.2.1.70201-1_all.deb
sudo apt install ./amdgpu-install_7.2.1.70201-1_all.deb
sudo apt update
sudo amdgpu-install -y --usecase=graphics,rocm

# 2. Add yourself to the render/video groups (log out/in afterward)
sudo usermod -a -G render,video $LOGNAME

The RX 7900 XTX is on Ollama's supported AMD Radeon RX list and gfx1100 is in its supported LLVM-target list — so no HSA_OVERRIDE_GFX_VERSION masquerade is needed for this card (that override is only for cards ROCm doesn't ship kernels for).

Path A — Ollama (recommended, turnkey)

Poolside publishes a first-party Ollama build; Ollama uses the ROCm runtime you installed above and runs the gfx1100 card natively. One command pulls the Q4_K_M weights and registers the model, per the Laguna-XS-2.1 model card and the official Ollama library page:

ollama pull laguna-xs-2.1

Ollama serves an OpenAI-compatible endpoint at http://localhost:11434/v1 once the model is pulled and running. (The default latest tag is the 20 GB Q4_K_M build — the one that fits; do not pull the q8_0 (36 GB) tag on a 24 GB card.)

Path B — llama.cpp (build from the support PR, with the HIP backend)

Upstream llama.cpp cannot yet load Laguna, so this path compounds two build requirements: you need both the laguna architecture support from the open PR and the ROCm/HIP backend for the AMD card. Since you're building from source for HIP anyway, the PR checkout is free — check out the PR branch, then compile it with -DGGML_HIP=ON. Per the llama.cpp build docs, the Linux HIP build for an RDNA3 card like the RX 7900 XTX targets gfx1100:

# Clone llama.cpp and check out the PR branch that adds Laguna support
git clone https://github.com/ggml-org/llama.cpp && cd llama.cpp
git fetch origin pull/25165/head:laguna && git checkout laguna

# Build with the HIP/ROCm backend, pinned to the 7900 XTX's gfx1100
HIPCXX="$(hipconfig -l)/clang" HIP_PATH="$(hipconfig -R)" \
    cmake -S . -B build -DGGML_HIP=ON -DGPU_TARGETS=gfx1100 -DCMAKE_BUILD_TYPE=Release \
    && cmake --build build --config Release -- -j 16

# Download the Q4_K_M GGUF
huggingface-cli download poolside/Laguna-XS-2.1-GGUF \
  Laguna-XS-2.1-Q4_K_M.gguf --local-dir ~/models/Laguna-XS-2.1-GGUF

-DGGML_HIP=ON selects the ROCm backend; -DGPU_TARGETS=gfx1100 pins the kernels to the 7900 XTX's architecture (the build docs use gfx1100 as the explicit example for the "Radeon RX 7900XTX"). The GGUF quant is an integer format, so the absence of FP8 tensor hardware on RDNA3 is irrelevant here — Laguna's Q4_K_M runs on standard HIP kernels. The laguna arch comes from the PR branch, not master; once ggml-org/llama.cpp#25165 merges, a stock ROCm build will work without the PR checkout — verify the PR's state before assuming you still need the branch.

Install the agent client

OpenHands is an open-source agentic-coding client that drives any OpenAI-compatible endpoint:

pip install openhands-ai

Aider is a lighter terminal-based alternative that also speaks the OpenAI API:

pip install aider-install && aider-install

Running

1. Serve Laguna with a bounded context

Ollama (Path A) — start the model server; Ollama exposes the OpenAI-compatible API at http://localhost:11434/v1:

ollama run laguna-xs-2.1

Ollama's default context is small; raise it deliberately (e.g. /set parameter num_ctx 32768 in the interactive session, or an OLLAMA_CONTEXT_LENGTH env var) rather than jumping to the model's full 262,144-token window, which no 24 GB card can hold once the weights are resident.

llama.cpp (Path B) — serve the downloaded GGUF, applying the model's built-in chat template so reasoning and tool-calling work. This is the GGUF card's own llama-server invocation, adapted to a bounded context, per the Laguna-XS-2.1-GGUF model card:

./build/bin/llama-server \
  -m ~/models/Laguna-XS-2.1-GGUF/Laguna-XS-2.1-Q4_K_M.gguf \
  --jinja \
  -ngl 99 \
  -c 32768 \
  --port 8000
  • --jinja applies the model's bundled chat_template.jinjarequired for correct reasoning and tool-calling; without it agentic edits misbehave.
  • -ngl 99 offloads all layers to the GPU (the 20.27 GB Q4 file must sit in VRAM — see the MoE note above).
  • -c 32768 caps context at 32K. The card documents up to 262,144, but a bounded value keeps KV-cache memory reasonable once ~20 GB of weights are already resident on a 24 GB card. Laguna quantizes its KV cache to FP8 natively and uses sliding-window attention (a 512-token window on 30 of its 40 layers, per the model card), so its KV growth is gentler than a dense full-attention model — but the weights still dominate the budget, so start bounded and raise -c while watching rocm-smi.

Both paths expose an OpenAI-compatible API (:11434/v1 for Ollama, :8000/v1 for llama.cpp) that the agent client below points at.

2. Point OpenHands at the local server

OpenHands routes through LiteLLM, so a custom OpenAI-compatible endpoint uses an openai/ model prefix, per the OpenHands local-LLM docs:

export LLM_MODEL="openai/laguna-xs-2.1"                 # match your served-model name
export LLM_BASE_URL="http://localhost:11434/v1"         # Ollama; use :8000/v1 for llama.cpp
export LLM_API_KEY="EMPTY"                              # any non-empty string; local servers don't check it

openhands

OpenHands will now use Laguna to plan, run shell commands, and edit files in your workspace. Its interleaved reasoning drives planning and its native tool-calling drives the file/shell actions. (Aider works the same way — point --openai-api-base at the same URL and pass --model openai/laguna-xs-2.1.)

Results

  • VRAM usage: The Q4_K_M weights are 20.27 GB on disk (18.88 GiB) and must be held resident, leaving only ~3–4 GB of the RX 7900 XTX's 24 GB for the KV cache and activations — which is why context is bounded above. The native FP8 KV cache and sliding-window attention ease KV pressure relative to a dense model, but do not change the fact that the weights alone nearly fill the card. Weight and per-quant file sizes are verified via the Laguna-XS-2.1-GGUF file tree.
  • Quality notes: Poolside reports Laguna XS 2.1 scoring SWE-bench Verified 70.9%, SWE-bench Multilingual 63.1%, SWE-Bench Pro (public) 47.6%, and Terminal-Bench 2.0 37.5% — per the vendor benchmark table on the Laguna-XS-2.1 model card. Those are the vendor's own agentic-coding evals, not a measurement on this GPU. The card's benchmarking used temperature 1.0, top_k 20, top_p 1 with thinking enabled.

There is no community throughput benchmark for Laguna XS 2.1 on the RX 7900 XTX yet — /check/laguna-xs-2-1/rx-7900-xtx has no benchmark data, and this is a brand-new model, so we omit the tok/s figure rather than invent one or borrow one from different hardware. (Note that a reasoning model's effective throughput is lower than a raw tok/s number suggests, because much of each turn is thinking content you discard.)

For the full benchmark data, see /check/laguna-xs-2-1/rx-7900-xtx.

Troubleshooting

llama-server reports "unknown model architecture 'laguna'" / won't load the GGUF

Your llama.cpp build predates Laguna support. Upstream llama.cpp cannot load this model yet — the GGUF card states "llama.cpp support is not yet upstreamed." Either build from ggml-org/llama.cpp#25165 with the HIP backend (see Path B — keep -DGGML_HIP=ON -DGPU_TARGETS=gfx1100), or use the turnkey Ollama path (Path A), which ships a first-party build that works today on the ROCm runtime. Once the PR merges, upgrade to a stock ROCm build.

Out of memory at launch, or the KV cache won't fit

The 20.27 GB of weights leave only ~3–4 GB on a 24GB card, so OOM at startup usually means the context is set too high. Lower -c (try -c 16384 or -c 8192 on llama.cpp; lower num_ctx on Ollama) and close any other GPU app before launching. Watch rocm-smi during a real agent task — a hard coding problem produces a long thinking block that grows the KV cache mid-generation, so size for the peak, not the idle load. The RX 7900 XTX's speed does not enlarge its 24 GB budget; if you need the model's full 256K context, that requires a 32/48 GB card or larger. The q8_0 Ollama tag (36 GB) does not fit this card at all — stay on Q4_K_M.

The agent botches tool calls / doesn't emit reasoning

Make sure the built-in chat template is applied. On llama.cpp that means passing --jinja (Laguna ships a chat_template.jinja for reasoning and tool-calling); without it the model's native tool-call and thinking blocks are not formatted correctly and agent clients misbehave. Ollama applies the packaged template automatically.

Ollama or llama.cpp runs on the CPU instead of the GPU

Confirm the ROCm v7 driver is installed (rocm-smi should list the 7900 XTX) and that your user is in the render and video groups (groups should show both — log out and back in after the usermod step). Per the Ollama AMD GPU docs, ROCm is a separate install from Ollama; if it's missing, Ollama silently falls back to CPU. For a source llama.cpp build, confirm you compiled with -DGGML_HIP=ON and that -ngl 99 is offloading layers. The RX 7900 XTX (gfx1100) is natively supported, so you should not need HSA_OVERRIDE_GFX_VERSION — only unsupported cards need that masquerade.

Token generation feels slower than expected — try the Vulkan backend

On RDNA3 the ROCm/HIP backend can be slower at token generation than llama.cpp's Vulkan backend. Per llama.cpp issue #20934, on the RX 7900 XTX (gfx1100) Vulkan (RADV) reached ~167–177 tok/s on Llama 7B Q4_0 while ROCm landed at ~129–144 tok/s across ROCm 6.4.4–7.x. If your generation rate disappoints under ROCm, build llama.cpp with -DGGML_VULKAN=ON instead of -DGGML_HIP=ON and re-benchmark with llama-bench — Vulkan often wins for pure generation on this card. (These are Llama-7B figures cited only to show the ROCm-vs-Vulkan gap on this GPU, not Laguna numbers. Note the Vulkan backend must also be built from the same laguna PR branch so it includes the arch.)

No other widely-reported issues on the RX 7900 XTX yet. If you run Laguna XS 2.1 on this card, report your throughput and any problems via the submission form so we can seed real benchmark data.

common questions
How much VRAM does Laguna XS 2.1 need?

About 24 GB — the minimum this recipe targets.

Which GPUs is Laguna XS 2.1 tested on?

RX 7900 XTX (24 GB).

How hard is this setup?

Advanced — follow the steps above.