How much VRAM does Ornith 1.0 9B need?

About 12 GB — the minimum this recipe targets.

How hard is this setup?

Intermediate — follow the steps above.

Ornith 1.0 9B on Apple M2 Pro: Local Agentic Coding in 16GB Unified Memory via llama.cpp Metal + OpenHands

What You'll Build

A fully local, private agentic-coding setup: Ornith 1.0 9B — DeepReinforce's open (MIT) ~9B dense coding model — served as an OpenAI-compatible endpoint by llama.cpp using Apple's Metal backend (or Ollama) on an Apple M2 Pro with 16 GB unified memory, and driven by a coding agent (OpenHands as the lead, Aider as a lighter alternative) so the model can read your repo, run shell commands, and edit files. There is no NVIDIA GPU, no CUDA, no FlashAttention. Ornith produces <think> reasoning traces and native tool calls, so the agent can plan and act. This is the 9B — the family's smallest member, sized for a 16 GB Mac. The larger Ornith 1.0 35B does not fit a 16 GB machine (its Q4_K_M floor is 21.2 GB, and there is no smaller build) — see the note below.

Hardware data: Apple M2 Pro (16 GB unified memory) · Ornith 1.0 9B, GGUF Q8_0 (9.53GB) or Q6_K (7.36GB) · See benchmark data

ℹ️ This is a coding LLM, not a chat generalist. Ornith 1.0 is a self-improving family of open-source models for agentic coding (per the Ornith-1.0-9B-GGUF model card). The 9B is the family's smallest member — the card calls it the most lightweight member of the Ornith family, designed for efficient single-GPU deployment. It is post-trained on top of Gemma 4 and Qwen 3.5, is MIT-licensed, and reports SWE-bench Verified 69.4 on the model card's own evaluation table.

ℹ️ Unified memory is not VRAM. The M2 Pro has 16 GB of unified memory shared by the OS, CPU, and GPU — it is not 16 GB of dedicated VRAM, and not a pool the GPU gets all of. By default macOS only lets the GPU address roughly two-thirds of it (~10.5 GB via Metal's recommendedMaxWorkingSetSize on a sub-64 GB Mac). At 9.53 GB the max-fidelity Q8_0 weights fit that ~10.5 GB pool, but the margin is real, not generous — keep the context window modest (≤ 8k for a no-fuss fit), or raise the wired-memory limit for longer sessions (see Troubleshooting). Dropping to Q6_K (7.36 GB) frees ~2 GB more for the KV cache and macOS.

⚠️ The 35B does NOT fit a 16 GB Mac — use this 9B. The larger Ornith 1.0 35B is a different, 48 GB+ model. Its smallest build is Q4_K_M at 21.2 GB (there is no official Q3/Q2), which exceeds a 16 GB Mac's entire physical memory — let alone the ~10.5 GB the GPU can address. On a 16 GB M2 Pro the 9B is the right member of the family; the 35B needs a 48 GB or larger Apple Silicon Mac. Do not confuse the two.

Requirements

Component	Minimum	Tested
GPU / memory	16 GB unified memory (~10.5 GB GPU-addressable by default)	Apple M2 Pro (16-/19-core GPU, 16 GB unified memory)
RAM	Same pool — unified	16 GB unified
Storage	~10 GB for the Q8_0 GGUF (~5.6 GB for Q4_K_M)	9.53 GB (`ornith-1.0-9b-Q8_0.gguf`)
Software	llama.cpp (Metal, the macOS default) or Ollama; OpenHands or Aider client	llama.cpp `llama-server`, OpenHands

Model weights (GGUF, byte-verified from the Ornith-1.0-9B-GGUF repo tree):

Quant	On-disk size	Fit on M2 Pro (16GB unified, ~10.5GB addressable)
Q4_K_M	5.63GB	Fits with wide context room — the safe pick on 16 GB
Q5_K_M	6.47GB	Fits with large context room
Q6_K	7.36GB	Fits — near-lossless, leaves ~3 GB for KV cache and macOS
Q8_0	9.53GB	Recommended for max fidelity — fits the ~10.5 GB pool, but the margin is real; keep context ≤ 8k or raise the wired limit
BF16	17.92GB	Does not fit a 16 GB Mac — needs a much larger unified pool

Licensing. Ornith 1.0 is MIT-licensed (per the model card's license: mit and its highlight line: MIT licensed, globally accessible, and free from regional limitations). You can use it commercially and privately without revenue caps.

Installation

You have two runtimes. Pick one. On Apple Silicon there is nothing CUDA-shaped to install — no CUDA toolkit, no FP8 path, no FlashAttention wheel. llama.cpp gives you the most control over context and KV-cache flags; Ollama is the fastest to stand up. Both run on Apple's Metal GPU backend.

Option A — llama.cpp with Metal (recommended for full control)

llama.cpp's Metal backend runs the model on the Apple GPU and is enabled by default on macOS: "On MacOS, Metal is enabled by default. Using Metal makes the computation run on the GPU." (llama.cpp build docs). Install a prebuilt binary via Homebrew (which ships a Metal build) or build from source per the llama.cpp README:

# Homebrew — ships a Metal-enabled build on macOS
brew install llama.cpp

# …or build from source (Metal is on by default on macOS)
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build
cmake --build build --config Release -j

Metal is the macOS default, so a plain cmake -B build already targets the GPU — there is no CUDA flag to set. llama.cpp pulls the GGUF straight from Hugging Face at launch (next section), so there is no separate download step.

Option B — Ollama

Install Ollama from ollama.com. Ollama is built on llama.cpp, uses Metal automatically on macOS, and can run any GGUF on the Hub directly — no Modelfile needed. No manual weight download; the run command below pulls the quant for you.

Running

With llama.cpp

Serve an OpenAI-compatible API on port 8000. The -hf flag pulls the GGUF from Hugging Face; append :Q8_0 (case-insensitive) to select the quant — without a tag, llama.cpp defaults to Q4_K_M (llama-server docs):

# Q8_0 quant (max fidelity), offload all layers to the Apple GPU via Metal, generous context
llama-server -hf deepreinforce-ai/Ornith-1.0-9B-GGUF:Q8_0 \
    --port 8000 \
    -ngl 99 \
    -c 32768 \
    --jinja

The model card's own quickstart shows the same server invocation with the full context window:

llama-server -hf deepreinforce-ai/Ornith-1.0-9B-GGUF --port 8000 -c 262144

-ngl 99 (alias --n-gpu-layers) offloads every layer to the Apple GPU via Metal — there are no CUDA semantics involved; -c (--ctx-size) sets the context length; --jinja uses the model's built-in chat template so the <think> reasoning and tool-call blocks parse correctly.

How much context actually fits in 16 GB. Ornith 1.0 is a reasoning model — per the model card, the assistant turn opens with a <think>...</think> reasoning block before the final answer, and its context window is a large 262,144 tokens. Context is stored in the KV cache, which grows with the number of tokens and lives in unified memory on top of the weights. With Q8_0 (9.53 GB) loaded against the ~10.5 GB the GPU addresses by default, the margin is real — the command above sets a modest -c 32768 (32K) that stays inside the pool. To push toward the model's full 262,144-token window, quantize the KV cache: add -fa on (Flash Attention, required for quantized cache) and -ctk q8_0 -ctv q8_0 (--cache-type-k / --cache-type-v), which roughly halves KV-cache memory versus the default f16 with minimal quality impact (llama-server docs) — and raise the wired-memory limit (Troubleshooting) so the larger cache has room:

# Push toward a longer context on Q8_0 by quantizing the KV cache (raise the wired limit first)
llama-server -hf deepreinforce-ai/Ornith-1.0-9B-GGUF:Q8_0 \
    --port 8000 -ngl 99 -c 131072 --jinja \
    -fa on -ctk q8_0 -ctv q8_0

If you want more context headroom without touching the memory limit, drop to Q6_K (7.36 GB) — that frees ~2 GB more of the 16 GB pool for the KV cache while staying near-lossless:

llama-server -hf deepreinforce-ai/Ornith-1.0-9B-GGUF:Q6_K \
    --port 8000 -ngl 99 -c 32768 --jinja

With Ollama

Pull and chat with the same GGUF straight from Hugging Face (the card documents this exact form). Ollama uses Metal automatically on macOS. Append a :quant tag to choose the quant; the default is Q4_K_M (HF × Ollama docs):

ollama run hf.co/deepreinforce-ai/Ornith-1.0-9B-GGUF:Q8_0

Ollama serves an OpenAI-compatible API at http://localhost:11434/v1 for agent clients. Its own context default is small; raise it deliberately rather than setting the model's full 256K on a 16 GB Mac.

Connect a coding agent

Ornith is optimized for terminal-based coding agents (verbatim from the model card), which directs you to point any OpenAI-compatible coding CLI at your Ornith endpoint by setting its base URL and API key. The wiring is identical to any other backend — point it at the llama.cpp Metal server (or Ollama's http://localhost:11434/v1).

OpenHands (lead choice). OpenHands is the harness DeepReinforce used to measure the 9B's headline SWE-bench Verified 69.4 (the model card footnote reads: SWE-Bench Verified, Pro and Multilingual: using OpenHands harness), so it is the officially-exercised agentic path for this model. Point it at your local server exactly as the card's OpenHands example does:

pip install openhands-ai

# OpenHands routes through LiteLLM; the "openai/" prefix selects the OpenAI-compatible path.
export LLM_MODEL="openai/deepreinforce-ai/Ornith-1.0-9B"
export LLM_BASE_URL="http://localhost:8000/v1"   # or http://localhost:11434/v1 for Ollama
export LLM_API_KEY="EMPTY"

openhands

Aider (lighter alternative). For a simpler terminal pair-programmer against the same endpoint, Aider connects to any OpenAI-compatible API:

export OPENAI_API_BASE=http://localhost:8000/v1
export OPENAI_API_KEY=EMPTY
aider --model openai/deepreinforce-ai/Ornith-1.0-9B

Recommended sampling for Ornith is temperature=0.6, top_p=0.95, top_k=20 (per the model card's quickstart note).

Results

Memory usage: The dense 9B loads entirely as its GGUF file — Q8_0 is 9.53 GB and Q6_K is 7.36 GB on disk (byte-verified from the GGUF repo tree). On the M2 Pro's 16 GB unified memory the GPU addresses only ~10.5 GB by default, so the max-fidelity Q8_0 fits but with a real margin — enough for a modest context (≤ 8k) at the default f16 cache without touching the memory limit. Q6_K (7.36 GB) frees ~2 GB more for context; Q4_K_M (5.63 GB) leaves the most headroom of all. BF16 (17.92 GB) does not fit a 16 GB Mac.
Model capability: On its own evaluation table the 9B reports SWE-bench Verified 69.4, Terminal-Bench 2.1 (Terminus-2) 43.1, and NL2Repo 27.2 (model card) — state-of-the-art among open models of comparable size, per the card. These are the model's own coding-benchmark scores, not hardware throughput on this GPU.
Speed: No local throughput benchmark exists for Ornith 1.0 9B on the Apple M2 Pro yet — this is a brand-new model and /check/ornith-1-0-9b/m2-pro has no benchmark rows. Token generation on Apple Silicon is bandwidth-bound (memory-bandwidth-limited), so throughput tracks the unified-memory bandwidth rather than any core count; we do not quote a tok/s figure rather than invent one or borrow one from different hardware.

For the full benchmark data, see /check/ornith-1-0-9b/m2-pro.

Troubleshooting

The reply is full of raw `<think>` / `<tool_call>` tags

Ornith is a reasoning model with native tool-calling: the assistant turn opens with a <think> … </think> block and emits <tool_call> blocks. If your client shows these as raw text, make sure the server applies the model's chat template — pass --jinja to llama-server, or use Ollama (which reads the GGUF's built-in tokenizer.chat_template). A correctly-templated server surfaces the reasoning as a separate reasoning_content field and the tool calls as OpenAI-style tool_calls, per the model card's quickstart note.

Out of memory / heavy swapping at long context (raise the GPU wired-memory limit)

At Q8_0 (9.53 GB) the weights sit close to the ~10.5 GB the M2 Pro GPU addresses by default, so the KV cache is what tips you over. If you OOM or start swapping when raising -c past a modest context, either lower -c, drop to Q6_K (7.36 GB) to free ~2 GB, quantize the KV cache with -fa on -ctk q8_0 -ctv q8_0, or raise the GPU wired-memory limit (macOS Sonoma 14 / Sequoia 15+):

sudo sysctl iogpu.wired_limit_mb=13312   # 13 GB; leaves ~3 GB for macOS on a 16 GB Mac

This lets the GPU wire up to 13 GB instead of the ~10.5 GB default. On a 16 GB machine this is genuinely tight — leave at least 2.5–3 GB for macOS and watch Activity Monitor's Memory-Pressure gauge; if it goes yellow/red, back the limit down or shorten your context. The setting is temporary and resets on reboot (persist it via /etc/sysctl.conf); sudo sysctl iogpu.wired_limit_mb=0 restores the default. On macOS Monterey 12 / Ventura 13 the knob is sudo sysctl debug.iogpu.wired_limit=<bytes> instead. For ordinary agent sessions at Q8_0 with a modest context you will usually not need this; it matters when you push toward the model's long context window.

The model gets stuck in a recursive loop / repeats tokens

Community users report on the GGUF model card discussions that Ornith runs can fall into repetitive/recursive generation once the conversation grows long. This model-level failure mode is runtime-agnostic — it applies to the Metal backend just as to CUDA. If you hit it, keep sessions short and the effective context per turn modest — which on a 16 GB Mac you want to do for memory reasons anyway. Treat this as a community-reported behaviour, not a vendor-confirmed fix.

The agent won't edit files / botches tool calls

Ornith's chat template is Qwen-shaped (it is post-trained on a Gemma 4 + Qwen 3.5 lineage). If agentic editing misbehaves with the default template in an agent client, the community workaround is to pass a corrected Qwen-family chat template to llama-server via --chat-template-file <path-to-template.jinja>. This is a model-level issue independent of the Metal runtime — treat it as a community-reported workaround, not an official fix.

Tried to install FlashAttention / a CUDA toolkit / a `cu12x` wheel and it failed

None of those apply on Apple Silicon. There is no CUDA, no FP8 path, and no FlashAttention wheel on macOS — llama.cpp uses its own Metal attention kernels (enabled by -fa on) and GGUF K-quants. Confirm your llama.cpp build is a Metal build (Homebrew's is; from source, Metal is on by default on macOS) and that -ngl 99 is offloading all layers to the GPU. If a generic Ornith or llama.cpp tutorial tells you to pass -DGGML_CUDA=ON, install flash-attn, or set up an FP8 path, skip those steps entirely — the commands above are the complete Apple path.