self-hosted/ai
§01·model · /models

North Mini Code 1.0

llmactiveApache-2.0

Cohere Labs' first developer-focused model — an open, Apache-2.0 licensed agentic-coding LLM built for software engineering and terminal tasks (not general chat). A 30B-A3B Mixture-of-Experts (128 experts, 8 active per token; ~3B active) with a 256K context window, interleaved sliding-window/global attention, native tool-calling via JSON-schema, and interleaved thinking for multi-step agent loops. Vendor evals: 80.2% pass@10 on SWE-Bench Verified and 55.1% pass@10 on Terminal-Bench v2 (SFT), 61.0% pass@1 with mini-SWE-Agent. Runs locally via llama.cpp/Ollama serving a GGUF quant behind an OpenAI-compatible API, paired with an agent client (OpenHands, Aider, Cline). Because an MoE keeps all 128 experts resident in VRAM, footprint is the full quant file: Q4_K_M is ~17.5 GB (comfortable on 24 GB), down to Q2_K ~10.3 GB for tighter cards; long 256K context needs 32 GB+ or KV-cache quantization. Notably permissive vs Cohere Labs' usual non-commercial releases.

§02·GPUs that run this model
8 total
GPUVRAMSeriesBest speedMin VRAMWorksBenchmarksRecipe
Apple M2 Max64GBapple~0recipecheck ↗
Apple M3 Max48GBapple~0recipecheck ↗
RTX 309024GB30~0recipecheck ↗
RTX 3090 Ti24GB30~0recipecheck ↗
RTX 408016GB40~0recipecheck ↗
RTX 409024GB40~0recipecheck ↗
RTX 509032GB50~0recipecheck ↗
RX 7900 XTX24GBamd~0recipecheck ↗

benchmarked·~ runs via recipe (not benchmarked)· untested·doesn't fit