How much VRAM does North Mini Code 1.0 need?

About 24 GB — the minimum this recipe targets.

How hard is this setup?

Advanced — follow the steps above.

North Mini Code 1.0 on RX 7900 XTX: Local Agentic Coding via llama.cpp-HIP + OpenHands (24GB ROCm Entry Tier)

What You'll Build

A local, private agentic-coding setup: North Mini Code 1.0 — Cohere Labs' open Apache-2.0 Mixture-of-Experts coding model — served as an OpenAI-compatible endpoint by llama.cpp (built against AMD's HIP/ROCm backend) on a single 24GB Radeon RX 7900 XTX (RDNA3, Navi 31, gfx1100), driven by the OpenHands coding agent so the model can read your repo, run shell commands, and edit files. North Mini Code is a 30B-A3B MoE (30B total parameters, ~3B active per token) built for pure agentic coding: it emits native JSON-schema tool calls and supports interleaved thinking (a reasoning stream alongside its tool calls). This recipe uses the third-party Q4_K_M GGUF (~17.5 GB on disk) — the quant that fits a 24GB card with real working context.

Hardware data: RX 7900 XTX (24GB VRAM) · North Mini Code 1.0 Q4_K_M GGUF (~17.5 GB weights) · ROCm 7 · moderate-context, comfortable fit · See benchmark data

⚠️ This is a ROCm recipe, not CUDA. The RX 7900 XTX runs on AMD's ROCm/HIP stack — there is no cu124/cu128 wheel and no FlashAttention prebuilt-wheel step here. For this coding LLM the reliable path is GGUF via llama.cpp-HIP (or Ollama, which bundles llama.cpp). Do not follow a guide that tells you to pip install flash-attn, pick a cu12x wheel, or use ExLlamaV2/Marlin for this card — those are NVIDIA-only.

ℹ️ Apache-2.0 — commercial use is allowed. North Mini Code 1.0 ships under Apache-2.0, which is notable because Cohere's open weights usually land under the non-commercial CC-BY-NC. Apache-2.0 is a permissive license: you may use, modify, and deploy this model commercially. Confirm the terms against the model card before you ship.

ℹ️ An MoE keeps all experts resident — the file size is the VRAM cost. North Mini Code is a 128-expert Mixture-of-Experts that activates 8 experts per token (~3B active parameters). The low active count is a speed property — it does not shrink VRAM. All 128 experts stay loaded, so the memory footprint is the full quant file: ~17.5 GB at Q4_K_M, not some smaller "active" fraction. Do not expect the ~3B-active figure to reduce the memory requirement.

⚠️ 256K context does NOT fit on a 24GB card. The model's context window is 256K tokens (config max_position_embeddings 500000), and its attention interleaves a 4096-token sliding window with periodic global attention — which makes the KV cache at long context large. With ~17.5 GB of weights resident, the RX 7900 XTX has roughly 6 GB left for the KV cache. That is enough for a healthy short-to-moderate working context, but not for the full 256K — that needs a 32GB+ card or aggressive KV-cache quantization. This recipe caps context deliberately; read the Running section before you launch.

Requirements

Component	Minimum	Tested
GPU	24GB VRAM (this is North Mini Code's floor at Q4_K_M)	RX 7900 XTX (24GB, RDNA3 Navi 31, gfx1100)
RAM	16GB system RAM (32GB comfortable for the agent + repo)	—
Storage	~18GB (the Q4_K_M GGUF is ~17.5 GB)	18.74 GB file (`North-Mini-Code-1.0-Q4_K_M.gguf`)
Driver	AMD ROCm v7 (installed via `amdgpu-install`) on Linux	—
Software	llama.cpp (HIP build) or Ollama with cohere2_moe support; Python 3.10+ for OpenHands	llama.cpp `llama-server` (b9626+), OpenHands

There is no first-party GGUF — Cohere Labs ships safetensors plus fp8 and w4a16 quants only, and points to community quantizations, per the North Mini Code 1.0 model card. This recipe uses the vetted imatrix GGUF from bartowski: the Q4_K_M file is 18.74 GB (18,744,024,640 bytes → ~17.5 GiB) per the bartowski/North-Mini-Code-1.0-GGUF file tree. Neighbouring quants there are Q3_K_M ~14.2 GB, IQ4_XS 16.58 GB, Q4_K_M 18.74 GB, Q5_K_M 21.85 GB, Q6_K 26.4 GB, and Q8_0 32.44 GB — of these, Q5_K_M (20.34 GiB) is a tight fit on a 24GB card that leaves almost no KV headroom, so Q4_K_M is the recommended quant for this GPU.

Installation

Prerequisite — install the AMD ROCm v7 driver

The RX 7900 XTX (gfx1100) is an officially ROCm-supported GPU, but ROCm is not bundled with Ollama or the llama.cpp release binaries — you install it once at the OS level. Per the Ollama AMD GPU docs, Ollama requires the AMD ROCm v7 driver on Linux, installed or upgraded with the amdgpu-install utility. On Ubuntu 24.04 (Noble), install ROCm 7.2.1 via the standard amdgpu-install flow (AMD's Radeon ROCm install docs cover the current packages):

# 1. Add the amdgpu-install package and install ROCm
wget https://repo.radeon.com/amdgpu-install/7.2.1/ubuntu/noble/amdgpu-install_7.2.1.70201-1_all.deb
sudo apt install ./amdgpu-install_7.2.1.70201-1_all.deb
sudo apt update
sudo amdgpu-install -y --usecase=graphics,rocm

# 2. Add yourself to the render/video groups (log out/in afterward)
sudo usermod -a -G render,video $LOGNAME

The RX 7900 XTX is on Ollama's supported AMD Radeon RX list and gfx1100 is in its supported LLVM-target list — so no HSA_OVERRIDE_GFX_VERSION masquerade is needed for this card (that override is only for cards ROCm doesn't ship kernels for).

1. Build llama.cpp with the HIP/ROCm backend and cohere2_moe support

North Mini Code uses the cohere2_moe architecture (Cohere2MoeForCausalLM). Support for this arch was added to llama.cpp in PR #24260 and first shipped in the b9626 build; bartowski quantized the GGUF with release b9630, per the bartowski GGUF model card. This compounds with the AMD build step: the ROCm/HIP binary you compile must come from a source tree new enough to include cohere2_moe — a stock or older ROCm llama.cpp will fail to load the model with unknown model architecture: cohere2moe. Since you are building from source with HIP anyway, this is free: clone the current master (which already contains PR #24260) so the arch table is up to date. Per the llama.cpp build docs, the Linux HIP build for an RDNA3 card like the RX 7900 XTX is:

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
# Clone/pull the current master — it already includes PR #24260 (cohere2_moe, b9626+)
HIPCXX="$(hipconfig -l)/clang" HIP_PATH="$(hipconfig -R)" \
    cmake -S . -B build -DGGML_HIP=ON -DGPU_TARGETS=gfx1100 -DCMAKE_BUILD_TYPE=Release \
    && cmake --build build --config Release -- -j 16

-DGGML_HIP=ON selects the ROCm backend; -DGPU_TARGETS=gfx1100 pins the kernels to the 7900 XTX's architecture (the build docs use gfx1100 as the explicit example for the "Radeon RX 7900XTX"). The GGUF quants are integer formats, so the absence of FP8 tensor hardware on RDNA3 is irrelevant here — North Mini Code's Q4_K_M runs on standard HIP kernels. Building from a recent master is what guarantees the cohere2_moe arch is present; a b9626-or-later checkout is the minimum.

2. Download the Q4_K_M GGUF

llama-server can pull the GGUF straight from Hugging Face and cache it locally. The -hf flag takes <user>/<model>:<quant>:

# Downloads North-Mini-Code-1.0-Q4_K_M.gguf (~17.5 GB) into the llama.cpp cache on first use
./build/bin/llama-server -hf bartowski/North-Mini-Code-1.0-GGUF:Q4_K_M --port 8000

The first launch downloads ~18 GB; subsequent launches reuse the cached file. To download explicitly first, grab North-Mini-Code-1.0-Q4_K_M.gguf from the GGUF repo files tab and pass it with -m <path> instead.

3. Install the OpenHands coding agent

OpenHands is an open-source agentic-coding client that drives any OpenAI-compatible endpoint:

pip install openhands-ai

Alternatively, run the OpenHands Docker image — in that case point its base URL at http://host.docker.internal:8000/v1 so the container can reach the llama.cpp server running on your host. Aider and Cline are drop-in alternatives; both drive the same OpenAI-compatible endpoint.

Running

1. Serve North Mini Code with the correct chat template

This is the step that makes or breaks tool-calling. North Mini Code ships a custom chat template (chat_template.jinja) that carries tool_call_id and the interleaved-thinking format; the model card documents its tool-calling as JSON-schema function calls with a reasoning field, per the model card. A generic ChatML path will break tool-calling — the model won't emit clean tool calls and OpenHands won't be able to edit files.

Two gotchas the maintainers called out on the arch-support PR (#24260): (a) Cohere ships two conflicting templates — an outdated one in tokenizer_config.json and the current one in chat_template.jinja; some GGUFs embed the wrong one. (b) The reference template for GGUFs lives at models/templates/Cohere2-MoE.jinja in the llama.cpp tree. Serve with --jinja so llama.cpp applies the GGUF's embedded Jinja template rather than a built-in fallback, and if tool-calling still misbehaves, point it explicitly at the reference template with --chat-template-file:

./build/bin/llama-server \
    -hf bartowski/North-Mini-Code-1.0-GGUF:Q4_K_M \
    --port 8000 \
    --jinja \
    -ngl 99 \
    -c 32768 \
    -fa on \
    -ctk q8_0 -ctv q8_0

--jinja enables the model's own Jinja chat template — required for the interleaved-thinking + tool-call format to work in an agent client.
-ngl 99 offloads all layers to the GPU (the ~17.5 GB Q4 file plus the KV cache must sit in VRAM — see the MoE note above).
-c 32768 caps context at 32K. This is a deliberate, comfortable starting point for the 7900 XTX — well under the model's 256K ceiling, which does not fit here. Raise or lower it while watching rocm-smi (see Troubleshooting).
-fa on enables Flash Attention, and -ctk q8_0 -ctv q8_0 quantize the K and V caches to 8-bit — roughly halving KV-cache memory versus the fp16 default, which buys back usable context.

This exposes an OpenAI-compatible API at http://localhost:8000/v1. Ollama is a valid alternative — it too needs a build with cohere2_moe support, it uses the ROCm runtime you installed above and runs the gfx1100 card natively, and it applies its own template, so verify tool-calling works before relying on it for agent edits.

2. Point OpenHands at the local server

OpenHands routes through LiteLLM, so a custom OpenAI-compatible endpoint uses an openai/ model prefix, per the OpenHands local-LLM docs:

export LLM_MODEL="openai/North-Mini-Code-1.0"
export LLM_BASE_URL="http://localhost:8000/v1"
export LLM_API_KEY="EMPTY"   # any non-empty string; local servers don't check it

openhands

OpenHands will now use North Mini Code to plan, run shell commands, and edit files in your workspace. Its interleaved thinking drives planning and its native tool-calling drives the file/shell actions.

Results

VRAM usage: The Q4_K_M weights are ~17.5 GB (18.74 GB file) and must stay resident, leaving roughly 6 GB of the RX 7900 XTX's 24 GB for the KV cache and activations — which is why context is capped at 32K and the KV cache is 8-bit-quantized above. File sizes are verified via the bartowski GGUF file tree.
Quality notes: Cohere reports North Mini Code 1.0 scoring SWE-Bench Verified 80.2% pass@10, Terminal-Bench v2 55.1% pass@10, and mini-SWE-Agent 61.0% pass@1 on agentic-coding evals, per the model card. Those are the vendor's own benchmarks, not a measurement on this GPU. Recommended sampling per the model card is temperature 1.0, top_p 0.95.
Speed: North Mini Code 1.0 is brand-new, so there is no community throughput benchmark for it on the RX 7900 XTX yet — /check/north-mini-code-1-0/rx-7900-xtx has no benchmark data. We omit the tok/s figure rather than invent one or borrow one from different hardware.

For the full benchmark data, see /check/north-mini-code-1-0/rx-7900-xtx.

Troubleshooting

"unknown model architecture: cohere2moe" (or the model won't load)

Your llama.cpp is older than the build that added cohere2_moe support — the single most common failure on AMD, because a stock or distro-packaged ROCm llama.cpp is often behind. The arch landed in PR #24260 (first in the b9626 build); the GGUF itself was made with release b9630, per the bartowski GGUF card. Rebuild your HIP binary from a recent git pull of llama.cpp (keep -DGGML_HIP=ON -DGPU_TARGETS=gfx1100), or download a b9626-or-later ROCm release binary. The same applies to Ollama — you need a version whose bundled llama.cpp includes the arch.

The agent won't edit files / tool calls come out malformed

Almost always a chat-template problem. North Mini Code needs its custom Jinja template for tool-calling; a generic ChatML fallback breaks it. First, make sure you launched llama-server with --jinja (above). If it still misbehaves, the GGUF may carry the outdated template that Cohere ships in tokenizer_config.json rather than the current chat_template.jinja — a mismatch the maintainers flagged on PR #24260. The fix is to pass the reference template explicitly:

# Download the reference template from the llama.cpp tree, then point the server at it
curl -L -o Cohere2-MoE.jinja \
  https://raw.githubusercontent.com/ggml-org/llama.cpp/master/models/templates/Cohere2-MoE.jinja
./build/bin/llama-server -hf bartowski/North-Mini-Code-1.0-GGUF:Q4_K_M --port 8000 \
    --jinja --chat-template-file Cohere2-MoE.jinja -ngl 99 -c 32768 -fa on -ctk q8_0 -ctv q8_0

Out of memory at launch, or the KV cache won't fit

The ~17.5 GB of weights leave ~6 GB on a 24GB card. OOM at startup usually means the context is set too high. Lower -c (try -c 16384), keep -ctk q8_0 -ctv q8_0 and -fa on enabled, and close any other GPU app before launching. Watch rocm-smi during a real agent task — a hard coding problem produces a long interleaved-thinking stream that grows the KV cache mid-generation, so size for the peak, not the idle load. If you need the full 256K context, that requires a 32GB-class card or heavier KV-cache quantization.

Ollama or llama.cpp runs on the CPU instead of the GPU

Confirm the ROCm v7 driver is installed (rocm-smi should list the 7900 XTX) and that your user is in the render and video groups (groups should show both — log out and back in after the usermod step). Per the Ollama AMD GPU docs, ROCm is a separate install from Ollama; if it's missing, Ollama silently falls back to CPU. For a source llama.cpp build, confirm you compiled with -DGGML_HIP=ON and that -ngl 99 is offloading layers. The RX 7900 XTX (gfx1100) is natively supported, so you should not need HSA_OVERRIDE_GFX_VERSION — only unsupported cards need that masquerade.

Token generation feels slower than expected — try the Vulkan backend

On RDNA3 the ROCm/HIP backend can be slower at token generation than llama.cpp's Vulkan backend. Per llama.cpp issue #20934, on the RX 7900 XTX (gfx1100) Vulkan (RADV) reached ~167–177 tok/s on Llama 7B Q4_0 while ROCm landed at ~129–144 tok/s across ROCm 6.4.4–7.x. If your generation rate disappoints under ROCm, build llama.cpp with -DGGML_VULKAN=ON instead of -DGGML_HIP=ON and re-benchmark with llama-bench — Vulkan often wins for pure generation on this card. (These are Llama-7B figures cited only to show the ROCm-vs-Vulkan gap on this GPU, not North Mini Code numbers. Note the Vulkan backend must also be built from a recent tree so it includes cohere2_moe.)