How much VRAM does North Mini Code 1.0 need?

About 48 GB — the minimum this recipe targets.

How hard is this setup?

Advanced — follow the steps above.

North Mini Code 1.0 on Apple M3 Max: Local Agentic Coding via llama.cpp Metal + OpenHands (48GB Unified Memory)

What You'll Build

A local, private agentic-coding setup: North Mini Code 1.0 — Cohere Labs' open Apache-2.0 Mixture-of-Experts coding model — served as an OpenAI-compatible endpoint by llama.cpp using Apple's Metal backend on an Apple M3 Max with 48 GB unified memory, driven by the OpenHands coding agent so the model can read your repo, run shell commands, and edit files. There is no NVIDIA GPU, no CUDA, no FlashAttention. North Mini Code is a 30B-A3B MoE (30B total parameters, ~3B active per token) built for pure agentic coding: it emits native JSON-schema tool calls and supports interleaved thinking (a reasoning stream alongside its tool calls). Unlike the tight-fit 24 GB NVIDIA cards — where Q4_K_M (~17.5 GB) is the only quant that fits with working context — the M3 Max's 48 GB unified pool is roomy enough to lead with the near-lossless Q6_K quant (24.59 GB) and still leave room for the KV cache and a real context window.

Hardware data: Apple M3 Max (48 GB unified memory) · North Mini Code 1.0 Q6_K GGUF (24.59 GB weights) · roomy fit, real context · See benchmark data

ℹ️ Apache-2.0 — commercial use is allowed. North Mini Code 1.0 ships under Apache-2.0, which is notable because Cohere's open weights usually land under the non-commercial CC-BY-NC. Apache-2.0 is a permissive license: you may use, modify, and deploy this model commercially. Confirm the terms against the model card before you ship.

ℹ️ Unified memory is not VRAM. The M3 Max has 48 GB of unified memory shared by the OS, CPU, and GPU — it is not 48 GB of dedicated VRAM. By default macOS lets the GPU address only roughly two-thirds of it (~32 GB safe / ~36 GB optimistic via Metal's recommendedMaxWorkingSetSize) on a sub-64 GB Mac. That is still roomy for the 30B-A3B — the recommended Q6_K (24.59 GB) fits within the ~32 GB safe pool with room for the KV cache and a real context window; Q8_0 (30.21 GB) fits with headroom too but is near the safe-pool edge and may want the wired limit raised for a large KV cache; Q5_K_M (20.34 GB) / Q4_K_M (~17.5 GB) run with even more context to spare. This is the opposite of the 24 GB cards, where Q4_K_M was a squeeze. Loading a large model may require raising the GPU wired-memory limit once — see Troubleshooting.

ℹ️ An MoE keeps all experts resident — the file size is the memory cost. North Mini Code is a 128-expert Mixture-of-Experts that activates 8 experts per token (~3B active parameters). The low active count is a speed property — it does not shrink memory. All 128 experts stay loaded, so the footprint is the full quant file — 24.59 GB at Q6_K, not some smaller "active" fraction. Do not expect the ~3B-active figure to reduce the memory requirement.

⚠️ 256K context is the KV-cache constraint, not the weight file. The model's context window is 256K tokens (config max_position_embeddings 500000), and its attention interleaves a 4096-token sliding window with periodic global attention — which makes the KV cache at long context large. With ~24.6 GB of Q6_K weights resident, the 48 GB Mac has a genuinely usable context budget where the 24 GB cards were pinned lower — but the full 256K KV cache still costs many GB on top of the weights, so cap context deliberately and quantize the KV cache (below). The full 256K wants a 64 GB+ Mac or aggressive KV-cache quantization.

Requirements

Component	Minimum	Tested
GPU / memory	48 GB unified memory (~32 GB GPU-addressable by default, raisable)	Apple M3 Max (40-core GPU, 48 GB unified memory)
RAM	Same pool — unified	48 GB unified
Storage	~26 GB (the Q6_K GGUF is 26.4 GB)	26.40 GB (`North-Mini-Code-1.0-Q6_K.gguf`)
Software	llama.cpp (Metal, the macOS default) or Ollama with cohere2_moe support; Python 3.10+ for OpenHands	llama.cpp `llama-server` (b9626+), OpenHands

There is no first-party GGUF — Cohere Labs ships safetensors plus fp8 and w4a16 quants only, and points to community quantizations, per the North Mini Code 1.0 model card. This recipe uses the vetted imatrix GGUF from bartowski, per the bartowski/North-Mini-Code-1.0-GGUF file tree. The relevant quants are Q4_K_M 18.74 GB (18,744,024,640 bytes → ~17.5 GiB), Q5_K_M 21.85 GB (21,845,253,696 bytes → 20.34 GiB), Q6_K 26.40 GB (26,402,856,512 bytes → 24.59 GiB), and Q8_0 32.44 GB (32,437,263,936 bytes → 30.21 GiB). On the M3 Max the binding constraint is addressable unified memory, not raw capacity: the GPU addresses only ~32 GB safely (~36 GB optimistically) by default on a sub-64 GB Mac. Q6_K (24.59 GB) is the recommended default — near-lossless and fitting the ~32 GB safe pool with room for the KV cache and a real context; Q8_0 (30.21 GB) fits but sits near the safe-pool edge, so it may want the wired limit raised once a large KV cache is added; Q5_K_M / Q4_K_M run with even larger context. The bf16 build (~61 GB) does not fit on a 48 GB Mac. The model is Apache-2.0-licensed, per the model card.

Installation

1. Install llama.cpp with Metal (the macOS default) and cohere2_moe support

On Apple Silicon there is nothing CUDA-shaped to install — no CUDA toolkit, no FP8 path, no FlashAttention wheel. llama.cpp's Metal backend runs the model on the Apple GPU and is enabled by default on macOS: "On MacOS, Metal is enabled by default. Using Metal makes the computation run on the GPU." (llama.cpp build docs). But North Mini Code uses the cohere2_moe architecture (Cohere2MoeForCausalLM), so you also need a recent build: support landed in PR #24260, first shipping in the b9626 build; bartowski quantized the GGUF with release b9630, per the bartowski GGUF model card. A Homebrew binary or built-from-source clone that predates b9626 will fail to load the model with an unknown-architecture error, so use a current build:

# Homebrew — ships a Metal-enabled build on macOS (ensure it is recent enough for cohere2_moe)
brew install llama.cpp

# …or build from source (Metal is on by default on macOS; a fresh clone has the arch)
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build
cmake --build build --config Release -j

Metal is the macOS default, so a plain cmake -B build already targets the GPU — there is no CUDA flag to set. The same recent-build requirement holds on Ollama: you need a version whose bundled llama.cpp includes the cohere2_moe arch.

2. Download the Q6_K GGUF

llama-server can pull the GGUF straight from Hugging Face and cache it locally. The -hf flag takes <user>/<model>:<quant>; name the Q6_K quant explicitly since it is the recommended default for this roomy tier:

# Downloads North-Mini-Code-1.0-Q6_K.gguf (~24.6 GB) into the llama.cpp cache on first use
llama-server -hf bartowski/North-Mini-Code-1.0-GGUF:Q6_K --port 8000

The first launch downloads ~26 GB; subsequent launches reuse the cached file. To download explicitly first, grab North-Mini-Code-1.0-Q6_K.gguf from the GGUF repo files tab and pass it with -m <path> instead. For maximum fidelity on this 48 GB Mac use :Q8_0 (30.21 GB, near the safe-pool edge); for very large context drop to :Q5_K_M (20.34 GB) or :Q4_K_M (~17.5 GB).

3. Install the OpenHands coding agent

OpenHands is an open-source agentic-coding client that drives any OpenAI-compatible endpoint — it is runtime-agnostic, so the same wiring works against llama.cpp's Metal server as against a CUDA one:

pip install openhands-ai

Alternatively, run the OpenHands Docker image — in that case point its base URL at http://host.docker.internal:8000/v1 so the container can reach the llama.cpp server running on your host. Aider and Cline are drop-in alternatives; both drive the same OpenAI-compatible endpoint.

Running

1. Serve North Mini Code with the correct chat template

This is the step that makes or breaks tool-calling. North Mini Code ships a custom chat template (chat_template.jinja) that carries tool_call_id and the interleaved-thinking format; the model card documents its tool-calling as JSON-schema function calls with a reasoning field, per the model card. A generic ChatML path will break tool-calling — the model won't emit clean tool calls and OpenHands won't be able to edit files.

Two gotchas the maintainers called out on the arch-support PR (#24260): (a) Cohere ships two conflicting templates — an outdated one in tokenizer_config.json and the current one in chat_template.jinja; some GGUFs embed the wrong one. (b) The reference template for GGUFs lives at models/templates/Cohere2-MoE.jinja in the llama.cpp tree. Serve with --jinja so llama.cpp applies the GGUF's embedded Jinja template rather than a built-in fallback, and if tool-calling still misbehaves, point it explicitly at the reference template with --chat-template-file:

llama-server \
    -hf bartowski/North-Mini-Code-1.0-GGUF:Q6_K \
    --port 8000 \
    --jinja \
    -ngl 99 \
    -c 32768 \
    -fa on \
    -ctk q8_0 -ctv q8_0

--jinja enables the model's own Jinja chat template — required for the interleaved-thinking + tool-call format to work in an agent client.
-ngl 99 offloads all layers to the Apple GPU via Metal (the ~24.6 GB Q6_K file must sit in memory — see the MoE note above). On macOS this is the Metal offload; there are no CUDA semantics involved.
-c 32768 gives a generous 32K context — far beyond what the 24 GB cards allowed — while keeping the KV cache within the ~32 GB safe pool alongside the Q6_K weights. To push further, raise -c and keep the KV cache quantized (below); watch Activity Monitor's Memory-Pressure gauge as you climb, because a large KV cache costs many GB on top of the 24.59 GB of weights.
-fa on enables Flash Attention, and -ctk q8_0 -ctv q8_0 quantize the K and V caches to 8-bit — roughly halving KV-cache memory versus the fp16 default, which is what lets you push context toward the model's 256K ceiling without exhausting unified memory. Interleaved-thinking turns spend many tokens in the reasoning stream, so KV pressure is higher than a plain chat model at the same context setting.

This exposes an OpenAI-compatible API at http://localhost:8000/v1. (Ollama is a valid alternative — it too needs a build with cohere2_moe support and applies its own template, and on macOS uses Metal automatically; its own context default is small and must be raised deliberately. Verify tool-calling works before relying on it for agent edits.)

2. Point OpenHands at the local server

OpenHands routes through LiteLLM, so a custom OpenAI-compatible endpoint uses an openai/ model prefix, per the OpenHands local-LLM docs. The wiring is identical to any other backend — point it at the llama.cpp Metal server (or Ollama's http://localhost:11434/v1):

export LLM_MODEL="openai/North-Mini-Code-1.0"
export LLM_BASE_URL="http://localhost:8000/v1"   # or http://localhost:11434/v1 for Ollama
export LLM_API_KEY="EMPTY"   # any non-empty string; local servers don't check it

openhands

OpenHands will now use North Mini Code to plan, run shell commands, and edit files in your workspace. Its interleaved thinking drives planning and its native tool-calling drives the file/shell actions.

Results

Speed: No community benchmark for North Mini Code 1.0 on the Apple M3 Max exists yet — /check/north-mini-code-1-0/m3-max has no benchmark data, and this is a brand-new model, so we do not quote a tok/s figure rather than invent one or borrow one from different hardware. Token generation on Apple Silicon is bandwidth-bound (memory-bandwidth-limited), so throughput tracks the unified-memory bandwidth rather than any core count; we do not publish a figure without a cited measurement.
Memory usage: The Q6_K weights are 24.59 GB (26.40 GB file) and must be held resident, fitting within the M3 Max's ~32 GB safe unified pool with room for the KV cache and activations — which is why this tier runs a real (32K+) context where the 24 GB cards were pinned lower. Q8_0 (30.21 GB) fits but sits near the safe-pool edge and may want the wired limit raised once a large KV cache is added; the bf16 build (~61 GB) does not fit on 48 GB. Weight and per-quant file sizes are verified via the bartowski GGUF file tree.
Quality notes: Cohere reports North Mini Code 1.0 scoring SWE-Bench Verified 80.2% pass@10, Terminal-Bench v2 55.1% pass@10, and mini-SWE-Agent 61.0% pass@1 on agentic-coding evals, per the model card. Those are the vendor's own benchmarks, not a measurement on this GPU. On this roomy tier the near-lossless Q6_K default preserves essentially all of that quality. Recommended sampling per the model card is temperature 1.0, top_p 0.95.

For the full benchmark data, see /check/north-mini-code-1-0/m3-max.

Troubleshooting

"unknown model architecture: cohere2moe" (or the model won't load)

Your llama.cpp (or Ollama) is older than the build that added cohere2_moe support. The arch landed in PR #24260 (first in the b9626 build); the GGUF itself was made with release b9630, per the bartowski GGUF card. On macOS the same "recent build" requirement holds — brew upgrade llama.cpp for a fresh Metal binary, or rebuild from a recent git pull of llama.cpp. For Ollama, use a version whose bundled llama.cpp includes the arch.

The agent won't edit files / tool calls come out malformed

Almost always a chat-template problem. North Mini Code needs its custom Jinja template for tool-calling; a generic ChatML fallback breaks it. First, make sure you launched llama-server with --jinja (above). If it still misbehaves, the GGUF may carry the outdated template that Cohere ships in tokenizer_config.json rather than the current chat_template.jinja — a mismatch the maintainers flagged on PR #24260. The fix is to pass the reference template explicitly:

# Download the reference template from the llama.cpp tree, then point the server at it
curl -L -o Cohere2-MoE.jinja \
  https://raw.githubusercontent.com/ggml-org/llama.cpp/master/models/templates/Cohere2-MoE.jinja
llama-server -hf bartowski/North-Mini-Code-1.0-GGUF:Q6_K --port 8000 \
    --jinja --chat-template-file Cohere2-MoE.jinja -ngl 99 -c 32768 -fa on -ctk q8_0 -ctv q8_0

Loading a large model fails, or OOM / heavy swapping at long context (raise the GPU wired-memory limit)

By default macOS caps how much unified memory the GPU may wire — roughly two-thirds of the 48 GB (~32 GB safe / ~36 GB optimistic). Loading the ~24.6 GB Q6_K model fits that default, but Q8_0 (30.21 GB) with a large KV cache, or a very long context on Q6_K, can push against the cap. The fix is to raise the GPU wired-memory limit (macOS Sonoma 14 / Sequoia 15+):

sudo sysctl iogpu.wired_limit_mb=43008   # ~42 GB; leaves ~6 GB for macOS

This lets the GPU wire up to ~42 GB — enough for the Q6_K weights plus a healthy KV cache, or the tighter Q8_0 path. Always leave 6–16 GB of headroom for macOS; pushing to 100% causes instability. The setting is temporary and resets on reboot (persist it via /etc/sysctl.conf if you want it to survive a restart); sudo sysctl iogpu.wired_limit_mb=0 restores the default. On macOS Monterey 12 / Ventura 13 the knob is sudo sysctl debug.iogpu.wired_limit=<bytes> instead. Watch Activity Monitor's Memory-Pressure gauge during a real agent task — a hard coding problem produces a long interleaved-thinking stream that grows the KV cache mid-generation, so size for the peak, not the idle load. If you routinely need the full 256K context, a Mac with ≥ 64 GB unified memory gives more headroom.

Tried to install FlashAttention / a CUDA toolkit / a `cu12x` wheel and it failed

None of those apply on Apple Silicon. There is no CUDA, no FP8 path, and no FlashAttention wheel on macOS — llama.cpp uses its own Metal attention kernels (enabled by -fa on) and GGUF K-quants. Confirm your llama.cpp build is a Metal build (Homebrew's is; from source, Metal is on by default on macOS) and that -ngl 99 is offloading all layers to the GPU. If a generic tutorial tells you to pass -DGGML_CUDA=ON, install flash-attn, or set up an FP8 path, skip those steps entirely — the commands above are the complete Apple path.