self-hosted/ai
§01·recipe · llm

North Mini Code 1.0 on RTX 4080: Reduced-Quant Local Agentic Coding in 16GB via llama.cpp + OpenHands

llmadvanced16GB+ VRAMJul 3, 2026

This advanced recipe sets up North Mini Code 1.0 on the RTX 4080, needing about 16 GB of VRAM.

models
tools
prerequisites
  • NVIDIA RTX 4080 (16GB VRAM) — a reduced-quant tier for North Mini Code; the Q4_K_M GGUF (18.7 GB) does NOT fit 16 GB, so this recipe uses IQ4_XS (16.58 GB) or Q3_K_M (14.2 GB) with a short context. 24 GB + Q4_K_M is recommended for full quality (see below)
  • Python 3.10+ (for the OpenHands agent client)
  • A recent llama.cpp or Ollama build that includes cohere2_moe architecture support — the arch landed in the b9626 build (see Installation); the GGUF was quantized with release b9630
  • ~17GB free disk for the IQ4_XS GGUF (or ~15GB for Q3_K_M)

What You'll Build

A local, private agentic-coding setup: North Mini Code 1.0 — Cohere Labs' open Apache-2.0 Mixture-of-Experts coding model — served as an OpenAI-compatible endpoint by llama.cpp on a single 16GB RTX 4080, driven by the OpenHands coding agent so the model can read your repo, run shell commands, and edit files. North Mini Code is a 30B-A3B MoE (30B total parameters, ~3B active per token) built for pure agentic coding: it emits native JSON-schema tool calls and supports interleaved thinking (a reasoning stream alongside its tool calls). At 16 GB this is a reduced-quant tier — the recommended Q4_K_M GGUF is too large, so this recipe uses IQ4_XS (16.58 GB) or Q3_K_M (14.2 GB) with a short context. It runs, but read the honest tradeoff note below first.

Hardware data: RTX 4080 (16GB VRAM) · North Mini Code 1.0 IQ4_XS (16.58 GB) or Q3_K_M (14.2 GB) GGUF · short-context, reduced-quant tier · See benchmark data

⚠️ 16 GB is a quality/context tradeoff — 24 GB + Q4_K_M is the recommended path. The recommended Q4_K_M GGUF is 18.74 GB, which does not fit a 16 GB card. To run North Mini Code on the RTX 4080 you must drop to a smaller quant — IQ4_XS (16.58 GB) is a tight fit that leaves almost no room for the KV cache, or Q3_K_M (14.2 GB) leaves ~1.5 GB of headroom for a short context. Both cost quality versus Q4_K_M (Q3_K_M more so than IQ4_XS), and both force a short working context. If you want full-quality weights and real working context, this model wants a 24 GB card at Q4_K_M — a 3090, 3090 Ti, 4090, or 5090. Treat the 16 GB path as a "make it run" option, not the fidelity path.

ℹ️ Apache-2.0 — commercial use is allowed. North Mini Code 1.0 ships under Apache-2.0, which is notable because Cohere's open weights usually land under the non-commercial CC-BY-NC. Apache-2.0 is a permissive license: you may use, modify, and deploy this model commercially. Confirm the terms against the model card before you ship.

ℹ️ An MoE keeps all experts resident — the file size is the VRAM cost. North Mini Code is a 128-expert Mixture-of-Experts that activates 8 experts per token (~3B active parameters). The low active count is a speed property — it does not shrink VRAM. All 128 experts stay loaded, so the memory footprint is the full quant file: 16.58 GB at IQ4_XS or 14.2 GB at Q3_K_M, not some smaller "active" fraction. Do not expect the ~3B-active figure to reduce the memory requirement — this is exactly why a 30B-A3B model still overflows a 16 GB card at Q4_K_M.

⚠️ 256K context does NOT fit on a 16GB card. The model's context window is 256K tokens (config max_position_embeddings 500000), and its attention interleaves a 4096-token sliding window with periodic global attention — which makes the KV cache at long context large. With ~14–16.5 GB of weights resident, the RTX 4080 has only ~1–2 GB left for the KV cache. That is enough for a short working context only — not the full 256K, which needs a 32GB+ card or aggressive KV-cache quantization. This recipe caps context aggressively; read the Running section before you launch.

Requirements

ComponentMinimumTested
GPU16GB VRAM (reduced-quant tier — Q4_K_M's floor is 24 GB)RTX 4080 (16GB, Ada Lovelace AD103, sm_89)
RAM16GB system RAM (32GB comfortable for the agent + repo)
Storage~17GB (IQ4_XS is 16.58 GB) or ~15GB (Q3_K_M is 14.2 GB)16.58 GB (...-IQ4_XS.gguf) / 14.21 GB (...-Q3_K_M.gguf)
Softwarellama.cpp or Ollama with cohere2_moe support; Python 3.10+ for OpenHandsllama.cpp llama-server (b9626+), OpenHands

There is no first-party GGUF — Cohere Labs ships safetensors plus fp8 and w4a16 quants only, and points to community quantizations, per the North Mini Code 1.0 model card. This recipe uses the vetted imatrix GGUF from bartowski. The full ladder there is Q2_K ~11.1 GB, IQ3_XXS ~13.0 GB, Q3_K_M ~14.2 GB, IQ4_XS 16.58 GB, Q4_K_M 18.74 GB, Q5_K_M 21.85 GB, Q6_K 26.4 GB, Q8_0 32.44 GB, per the bartowski/North-Mini-Code-1.0-GGUF file tree. On a 16 GB card the recommended Q4_K_M (18.74 GB) does not fit — use IQ4_XS (16,583,843,392 bytes → 15.44 GiB) for a tight fit closest to Q4_K_M quality, or Q3_K_M (14,209,048,128 bytes → 13.23 GiB) for a little KV-cache headroom at some quality cost.

Installation

1. Install llama.cpp with cohere2_moe support

North Mini Code uses the cohere2_moe architecture (Cohere2MoeForCausalLM). Support for this arch was added to llama.cpp in PR #24260 and first shipped in the b9626 build; bartowski quantized the GGUF with release b9630, per the bartowski GGUF model card. You need a llama.cpp new enough to include that arch — an older binary will fail to load the model with an unknown-architecture error. Build from source at a recent commit, or use a b9626-or-later release binary, per the llama.cpp README:

# Build from source with CUDA (guarantees a recent-enough arch table)
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j

The RTX 4080 is Ada Lovelace (AD103, sm_89) and is fully supported by a standard CUDA build — no special wheel is needed (the GGUF quants are integer formats, so the RTX 4080's FP8 tensor cores are not used on this path; a plain CUDA build is all you need).

2. Download the IQ4_XS (or Q3_K_M) GGUF

llama-server can pull the GGUF straight from Hugging Face and cache it locally. The -hf flag takes <user>/<model>:<quant>:

# IQ4_XS (16.58 GB) — tight fit, closest to Q4_K_M quality
llama-server -hf bartowski/North-Mini-Code-1.0-GGUF:IQ4_XS --port 8000

# Or Q3_K_M (14.2 GB) — a little more KV headroom, a step down in quality
# llama-server -hf bartowski/North-Mini-Code-1.0-GGUF:Q3_K_M --port 8000

The first launch downloads the file; subsequent launches reuse the cached copy. To download explicitly first, grab the matching file from the GGUF repo files tab and pass it with -m <path> instead.

3. Install the OpenHands coding agent

OpenHands is an open-source agentic-coding client that drives any OpenAI-compatible endpoint:

pip install openhands-ai

Alternatively, run the OpenHands Docker image — in that case point its base URL at http://host.docker.internal:8000/v1 so the container can reach the llama.cpp server running on your host. Aider and Cline are drop-in alternatives; both drive the same OpenAI-compatible endpoint.

Running

1. Serve North Mini Code with the correct chat template

This is the step that makes or breaks tool-calling. North Mini Code ships a custom chat template (chat_template.jinja) that carries tool_call_id and the interleaved-thinking format; the model card documents its tool-calling as JSON-schema function calls with a reasoning field, per the model card. A generic ChatML path will break tool-calling — the model won't emit clean tool calls and OpenHands won't be able to edit files.

Two gotchas the maintainers called out on the arch-support PR (#24260): (a) Cohere ships two conflicting templates — an outdated one in tokenizer_config.json and the current one in chat_template.jinja; some GGUFs embed the wrong one. (b) The reference template for GGUFs lives at models/templates/Cohere2-MoE.jinja in the llama.cpp tree. Serve with --jinja so llama.cpp applies the GGUF's embedded Jinja template rather than a built-in fallback, and if tool-calling still misbehaves, point it explicitly at the reference template with --chat-template-file:

llama-server \
    -hf bartowski/North-Mini-Code-1.0-GGUF:IQ4_XS \
    --port 8000 \
    --jinja \
    -ngl 99 \
    -c 8192 \
    -fa on \
    -ctk q8_0 -ctv q8_0
  • --jinja enables the model's own Jinja chat template — required for the interleaved-thinking + tool-call format to work in an agent client.
  • -ngl 99 offloads all layers to the GPU (the 16.58 GB IQ4_XS file — or 14.2 GB Q3_K_M — plus the KV cache must sit in VRAM; see the MoE note above).
  • -c 8192 caps context at 8K. This is a deliberately short starting point for the 4080 — IQ4_XS leaves almost no KV headroom on 16 GB, so start small. On IQ4_XS you may need to drop lower (-c 4096) or fall back to Q3_K_M to fit any real context; on Q3_K_M you have ~1.5 GB more headroom to raise it. The model's 256K ceiling does not fit here. Adjust while watching nvidia-smi (see Troubleshooting).
  • -fa on enables Flash Attention, and -ctk q8_0 -ctv q8_0 quantize the K and V caches to 8-bit — roughly halving KV-cache memory versus the fp16 default, which is what makes any working context possible in this tier.

This exposes an OpenAI-compatible API at http://localhost:8000/v1. Ollama is a valid alternative — it too needs a build with cohere2_moe support, and it applies its own template, so verify tool-calling works before relying on it for agent edits.

2. Point OpenHands at the local server

OpenHands routes through LiteLLM, so a custom OpenAI-compatible endpoint uses an openai/ model prefix, per the OpenHands local-LLM docs:

export LLM_MODEL="openai/North-Mini-Code-1.0"
export LLM_BASE_URL="http://localhost:8000/v1"
export LLM_API_KEY="EMPTY"   # any non-empty string; local servers don't check it

openhands

OpenHands will now use North Mini Code to plan, run shell commands, and edit files in your workspace. Its interleaved thinking drives planning and its native tool-calling drives the file/shell actions.

Results

  • VRAM usage: The IQ4_XS weights are 16.58 GB (or Q3_K_M 14.2 GB) and must stay resident, leaving only ~1–2 GB of the RTX 4080's 16 GB for the KV cache and activations — which is why context is capped at 8K (or lower) and the KV cache is 8-bit-quantized above. If IQ4_XS won't leave enough room for a usable context, drop to Q3_K_M. File sizes are verified via the bartowski GGUF file tree.
  • Quality notes: Cohere reports North Mini Code 1.0 scoring SWE-Bench Verified 80.2% pass@10, Terminal-Bench v2 55.1% pass@10, and mini-SWE-Agent 61.0% pass@1 on agentic-coding evals, per the model card. Those are the vendor's own benchmarks at full precision, not a measurement on this GPU — the reduced IQ4_XS/Q3_K_M quants used here will land somewhat below full-quality Q4_K_M+. Recommended sampling per the model card is temperature 1.0, top_p 0.95.
  • Speed: North Mini Code 1.0 is brand-new, so there is no community throughput benchmark for it on the RTX 4080 yet — /check/north-mini-code-1-0/rtx-4080 has no benchmark data. We omit the tok/s figure rather than invent one or borrow one from different hardware.

For the full benchmark data, see /check/north-mini-code-1-0/rtx-4080.

Troubleshooting

"unknown model architecture: cohere2moe" (or the model won't load)

Your llama.cpp is older than the build that added cohere2_moe support. The arch landed in PR #24260 (first in the b9626 build); the GGUF itself was made with release b9630, per the bartowski GGUF card. Rebuild from a recent git pull of llama.cpp, or download a b9626-or-later release binary. The same applies to Ollama — you need a version whose bundled llama.cpp includes the arch.

The agent won't edit files / tool calls come out malformed

Almost always a chat-template problem. North Mini Code needs its custom Jinja template for tool-calling; a generic ChatML fallback breaks it. First, make sure you launched llama-server with --jinja (above). If it still misbehaves, the GGUF may carry the outdated template that Cohere ships in tokenizer_config.json rather than the current chat_template.jinja — a mismatch the maintainers flagged on PR #24260. The fix is to pass the reference template explicitly:

# Download the reference template from the llama.cpp tree, then point the server at it
curl -L -o Cohere2-MoE.jinja \
  https://raw.githubusercontent.com/ggml-org/llama.cpp/master/models/templates/Cohere2-MoE.jinja
llama-server -hf bartowski/North-Mini-Code-1.0-GGUF:IQ4_XS --port 8000 \
    --jinja --chat-template-file Cohere2-MoE.jinja -ngl 99 -c 8192 -fa on -ctk q8_0 -ctv q8_0

Out of memory at launch, or the KV cache won't fit

This is the common failure on 16 GB. The IQ4_XS weights (16.58 GB) barely fit 16 GB and leave almost nothing for the KV cache. If you OOM at launch: first lower -c (try -c 4096), keep -ctk q8_0 -ctv q8_0 and -fa on enabled, and close any other GPU app. If it still won't fit a usable context, drop to Q3_K_M (14.2 GB) — that frees ~2.4 GB for the KV cache. Watch nvidia-smi during a real agent task — a hard coding problem produces a long interleaved-thinking stream that grows the KV cache mid-generation, so size for the peak, not the idle load. For full-quality Q4_K_M with real working context, this model wants a 24 GB card (3090/3090 Ti/4090); the full 256K context needs a 32GB (e.g. RTX 5090) or larger card.

torch/CUDA or llama.cpp reports no GPU

Confirm your llama.cpp build has CUDA enabled (GGML_CUDA=ON when building from source) and that -ngl 99 is offloading layers. The RTX 4080 (Ada Lovelace sm_89) needs no special flags beyond a standard CUDA build; the GGUF quants are integer formats, so you do not need FP8 support on this path and you do not need to install flash-attn separately — llama.cpp's -fa on uses its own built-in attention kernels.

common questions
How much VRAM does North Mini Code 1.0 need?

About 16 GB — the minimum this recipe targets.

Which GPUs is North Mini Code 1.0 tested on?

RTX 4080 (16 GB).

How hard is this setup?

Advanced — follow the steps above.