How much VRAM does Ornith 1.0 35B need?

About 48 GB — the minimum this recipe targets.

How hard is this setup?

Advanced — follow the steps above.

Ornith 1.0 35B on Apple M2 Max: Local Agentic Coding via llama.cpp Metal + OpenHands (64GB Unified Memory)

What You'll Build

A local, private agentic-coding setup: Ornith 1.0 35B — DeepReinforce's open (MIT) Mixture-of-Experts coding model — served as an OpenAI-compatible endpoint by llama.cpp using Apple's Metal backend on an Apple M2 Max with 64 GB unified memory, driven by the OpenHands coding agent so the model can read your repo, run shell commands, and edit files. There is no NVIDIA GPU, no CUDA, no FlashAttention. Ornith is a reasoning model: each turn opens with a <think> chain-of-thought before the answer, and it emits native <tool_call> blocks for agentic workflows. Unlike the tight-fit 24 GB NVIDIA cards — where Q4_K_M (21.2 GB) is the only quant that fits — the M2 Max's 64 GB unified pool is the roomiest Apple tier for the 35B: it runs the near-lossless Q6_K quant (28.51 GB) with a long context to spare, and even the higher-fidelity Q8_0 (36.90 GB) fits with real headroom.

Hardware data: Apple M2 Max (64 GB unified memory) · Ornith 1.0 35B Q6_K GGUF (28.51 GB) or Q8_0 (36.90 GB) · roomy fit, long context · See benchmark data

ℹ️ Unified memory is not VRAM. The M2 Max has 64 GB of unified memory shared by the OS, CPU, and GPU — it is not 64 GB of dedicated VRAM. By default macOS lets the GPU address roughly 75% of it (~48 GB via Metal's recommendedMaxWorkingSetSize on a ≥ 64 GB Mac). That ~48 GB pool is roomy for the 35B: the recommended Q6_K (28.51 GB) fits with a long context to spare, and Q8_0 (36.90 GB) fits for maximum fidelity with real headroom left for the KV cache — the first Apple tier where Q8_0 is comfortable rather than a squeeze. BF16 (69.38 GB) is out of reach — it exceeds even the ~48 GB addressable pool and would need a machine with ~80 GB+ unified memory. Loading Q8_0 or pushing a very long context may still want the GPU wired-memory limit raised once — see Troubleshooting.

ℹ️ An MoE keeps all experts resident — the file size is the memory cost. Ornith 1.0 35B is a Mixture-of-Experts. An MoE activates only some experts per token (a throughput property), but all experts stay loaded in memory, so the footprint is the full quant file — 28.51 GB at Q6_K, 36.90 GB at Q8_0, not some smaller "active" fraction. Do not expect a low active-parameter count to shrink the memory requirement.

ℹ️ Under 48 GB of unified memory? Use the 9B. There is no official Q3/Q2 build of the 35B — Q4_K_M (21.2 GB) is the floor. On a smaller Apple Silicon Mac (e.g. an 8–24 GB unified-memory chip), use Ornith 1.0 9B instead (Q4_K_M is ~5.6 GB); it is the same agentic-coding family sized for smaller machines. A 24 GB-tier unified-memory Mac can run the 35B at Q4_K_M (tight, like the 24 GB NVIDIA cards); the 48 GB Macs run Q6_K comfortably; this 64 GB M2 Max is the tier that runs Q8_0 comfortably too.

Requirements

Component	Minimum	Tested
GPU / memory	48 GB unified memory (Q6_K); 64 GB for a comfortable Q8_0	Apple M2 Max (38-core GPU, 64 GB unified memory)
RAM	Same pool — unified	64 GB unified
Storage	~30 GB (the Q6_K GGUF is 28.51 GB)	28.51 GB (`ornith-1.0-35b-Q6_K.gguf`), 36.90 GB for Q8_0
Software	llama.cpp (Metal, the macOS default) or Ollama; Python 3.10+ for OpenHands	llama.cpp `llama-server`, OpenHands

Published quant sizes per the Ornith-1.0-35B-GGUF file tree are Q4_K_M 21.17 GB (21,166,757,760 bytes), Q5_K_M 24.73 GB, Q6_K 28.51 GB, Q8_0 36.90 GB, and BF16 69.38 GB. On the M2 Max's 64 GB unified memory, the binding constraint is addressable unified memory, not raw capacity: the GPU addresses ~48 GB (~75%) by default. Q6_K (28.51 GB) is the recommended default — near-lossless and leaving ~20 GB of the addressable pool for a long KV cache; Q8_0 (36.90 GB) fits comfortably for maximum fidelity with ~11 GB still free for the KV cache — the first Apple tier where Q8_0 is roomy; Q4_K_M / Q5_K_M run with the model's largest contexts. BF16 (69.38 GB) does not fit — it exceeds the ~48 GB addressable pool and needs ~80 GB+ of unified memory. The model is MIT-licensed, per the Ornith-1.0-35B-GGUF model card.

Installation

1. Install llama.cpp with Metal (the macOS default)

On Apple Silicon there is nothing CUDA-shaped to install — no CUDA toolkit, no FP8 path, no FlashAttention wheel. llama.cpp's Metal backend runs the model on the Apple GPU and is enabled by default on macOS: "On MacOS, Metal is enabled by default. Using Metal makes the computation run on the GPU." (llama.cpp build docs). Install a prebuilt binary via Homebrew (which ships a Metal build) or build from source per the llama.cpp README:

# Homebrew — ships a Metal-enabled build on macOS
brew install llama.cpp

# …or build from source (Metal is on by default on macOS)
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build
cmake --build build --config Release -j

Metal is the macOS default, so a plain cmake -B build already targets the GPU — there is no CUDA flag to set.

2. Download the Q6_K GGUF (or Q8_0 for max fidelity)

llama-server can pull the GGUF straight from Hugging Face and cache it locally. The -hf flag takes <user>/<model>[:quant] per the canonical llama.cpp flag definitions; name the quant explicitly (the flag's own default is Q4_K_M, which you would use only for maximum context):

# Downloads ornith-1.0-35b-Q6_K.gguf (28.51 GB) into the llama.cpp cache on first use
llama-server -hf deepreinforce-ai/Ornith-1.0-35B-GGUF:Q6_K --port 8000

The first launch downloads ~28.5 GB; subsequent launches reuse the cached file. If you prefer to download the file explicitly first, grab ornith-1.0-35b-Q6_K.gguf from the GGUF repo files tab and pass it with -m <path> instead. On this 64 GB Mac, Q8_0 (:Q8_0, 36.90 GB) is a comfortable maximum-fidelity option — the first Apple tier where it fits with real headroom; for the model's very largest contexts drop to :Q4_K_M (21.2 GB) or :Q5_K_M (24.73 GB).

3. Install the OpenHands coding agent

OpenHands is an open-source agentic-coding client that drives any OpenAI-compatible endpoint — it is runtime-agnostic, so the same wiring works against llama.cpp's Metal server as against a CUDA one. The Ornith-1.0-35B-GGUF model card documents the CLI install:

pip install openhands-ai

Alternatively, run the OpenHands Docker image — in that case point its base URL at http://host.docker.internal:8000/v1 so the container can reach the llama.cpp server running on your host.

Running

1. Serve Ornith with a long, KV-quantized context

The model card's llama.cpp example, per the Ornith-1.0-35B-GGUF model card, is llama-server -hf deepreinforce-ai/Ornith-1.0-35B-GGUF --port 8000 -c 262144 — the full 256K context. On the M2 Max the 64 GB unified pool allows a much larger context than the 24 GB NVIDIA cards (which were capped near 16K), but the full 256K KV cache still costs many GB on top of the weights, so size it deliberately. All flags below are from the canonical llama.cpp flag definitions (-c/--ctx-size, -ctk/--cache-type-k, -ctv/--cache-type-v, -fa/--flash-attn, -ngl/--gpu-layers):

llama-server \
    -hf deepreinforce-ai/Ornith-1.0-35B-GGUF:Q6_K \
    --port 8000 \
    -ngl 99 \
    -c 98304 \
    -fa on \
    -ctk q8_0 -ctv q8_0

-ngl 99 offloads all layers to the Apple GPU via Metal (the 28.51 GB Q6_K file must sit in memory — see the MoE note above). On macOS this is the Metal offload; there are no CUDA semantics involved.
-c 98304 gives a very large 96K context — far beyond what the 24 GB cards allowed — while the ~48 GB addressable pool still leaves room over the Q6_K weights. To push toward the full 262K, raise -c further and keep the KV cache quantized (below); watch Activity Monitor's Memory-Pressure gauge as you climb, because a 256K KV cache still costs many GB on top of the 28.51 GB of weights. On Q8_0 (36.90 GB) the same 64 GB pool holds a healthy — if smaller — context.
-fa on enables Flash Attention, and -ctk q8_0 -ctv q8_0 quantize the K and V caches to 8-bit — roughly halving KV-cache memory versus the fp16 default, which is what lets you push context toward the model's 262K ceiling without exhausting unified memory. Reasoning models spend thousands of tokens per turn inside <think> blocks, so KV pressure is higher than a plain chat model at the same context setting.

This exposes an OpenAI-compatible API at http://localhost:8000/v1. (Ollama is a valid alternative — ollama run hf.co/deepreinforce-ai/Ornith-1.0-35B-GGUF, per the same model card — and on macOS Ollama uses Metal automatically; but Ollama's own context default is small and must likewise be raised deliberately rather than set to the model's full 256K.)

2. Point OpenHands at the local server

OpenHands routes through LiteLLM, so a custom OpenAI-compatible endpoint uses an openai/ model prefix, per the OpenHands local-LLM docs. The wiring is identical to any other backend — point it at the llama.cpp Metal server (or Ollama's http://localhost:11434/v1). Using the CLI env-var form documented on the Ornith-1.0-35B-GGUF model card:

export LLM_MODEL="openai/deepreinforce-ai/Ornith-1.0-35B"
export LLM_BASE_URL="http://localhost:8000/v1"   # or http://localhost:11434/v1 for Ollama
export LLM_API_KEY="EMPTY"   # any non-empty string; local servers don't check it

openhands

OpenHands will now use Ornith to plan, run shell commands, and edit files in your workspace. The model's <think> reasoning drives its planning and its native tool-calling drives the file/shell actions.

Results

Speed: No community benchmark for Ornith 1.0 35B on the Apple M2 Max exists yet — /check/ornith-1-0-35b/m2-max has no benchmark data, and this is a brand-new model, so we do not quote a tok/s figure rather than invent one or borrow one from different hardware. Token generation on Apple Silicon is bandwidth-bound (memory-bandwidth-limited), so throughput tracks the unified-memory bandwidth rather than any core count; we do not publish a figure without a cited measurement. (Note also that a reasoning model's effective throughput is lower than a raw tok/s number suggests, because much of each turn is <think> content you discard.)
Memory usage: The Q6_K weights are 28.51 GB and must be held resident, leaving ~20 GB of the M2 Max's ~48 GB addressable pool for the KV cache and activations — which is why this tier runs the model's near-full context where the 24 GB cards were capped near 16K. Q8_0 (36.90 GB) fits comfortably for maximum fidelity with ~11 GB still free; BF16 (69.38 GB) does not fit — it needs ~80 GB+ of unified memory. Weight and per-quant file sizes are verified via the Ornith-1.0-35B-GGUF file tree.
Quality notes: DeepReinforce reports the 35B scoring SWE-bench Verified 75.6 and Terminal-Bench 2.1 (Terminus-2) 64.2 — state-of-the-art among similarly-sized open models — per the vendor benchmark table on the Ornith-1.0-35B-GGUF model card. Those are the vendor's own agentic-coding evals, not a measurement on this GPU. On this roomy tier the near-lossless Q6_K default preserves essentially all of that quality, and Q8_0 preserves practically all of it; Q4_K_M (which you would pick only for maximum context) trades a little. Recommended sampling per the model card is temperature 0.6, top_p 0.95, top_k 20.

For the full benchmark data, see /check/ornith-1-0-35b/m2-max.

Troubleshooting

Loading a large model fails or falls back to a slow path (raise the GPU wired-memory limit)

By default macOS caps how much unified memory the GPU may wire — roughly 75% of the 64 GB (~48 GB). Loading the ~28.51 GB Q6_K model fits that default with ease, but Q8_0 at 36.90 GB plus a large KV cache, or a very long context on Q6_K, can approach the cap. The fix is to raise the GPU wired-memory limit (macOS Sonoma 14 / Sequoia 15+):

sudo sysctl iogpu.wired_limit_mb=57344   # 56 GB; leaves ~8 GB for macOS

This lets the GPU wire up to 56 GB — enough for the Q8_0 weights plus a healthy KV cache, or a very long context on Q6_K. Always leave 8–16 GB of headroom for macOS; pushing to 100% causes instability. The setting is temporary and resets on reboot (persist it via /etc/sysctl.conf if you want it to survive a restart); sudo sysctl iogpu.wired_limit_mb=0 restores the default. On macOS Monterey 12 / Ventura 13 the knob is sudo sysctl debug.iogpu.wired_limit=<bytes> instead. Watch Activity Monitor's Memory-Pressure gauge while loading.

Out of memory / heavy swapping at long context

The 28.51 GB Q6_K weights leave comfortable room in the ~48 GB addressable pool, so ordinary use rarely pressures memory. If you push context toward the model's 262K ceiling — or run Q8_0 (36.90 GB) with a long context — the KV cache grows into many GB on top of the weights and can approach the limit. Keep -fa on with -ctk q8_0 -ctv q8_0 to halve KV-cache memory, lower -c if pressure climbs, and raise the wired limit (above) before reaching for the largest contexts. Watch Activity Monitor during a real agent task — a hard coding problem produces a long <think> block that grows the KV cache mid-generation, so size for the peak, not the idle load. This 64 GB tier gives the most Apple headroom short of a 96 GB+ machine.

The model gets stuck in a recursive loop / repeats tokens

Multiple community users report on the GGUF model card discussions that llama.cpp runs can fall into repetitive/recursive generation once the conversation grows past roughly 22K tokens. This model-level failure mode is runtime-agnostic — it applies to the Metal backend just as to CUDA. If you hit it, keep sessions short and the effective context per turn well under that threshold. These are community reports on the model card discussions, not a vendor-confirmed fix.

The agent won't edit files / botches tool calls

Community users report (on the GGUF model card discussions) that with the default chat template the model sometimes fails to edit files or emit clean tool calls in agent clients. The community workaround shared in that thread is to pass a corrected Qwen-family chat template to llama-server via --chat-template-file <path-to-template.jinja>. Because Ornith is post-trained on a Gemma 4 + Qwen 3.5 lineage, its chat template is Qwen-shaped; if agentic editing misbehaves, try a corrected template as those users did. This is a model-level issue independent of the Metal runtime. Treat it as a community-reported workaround, not an official fix.

Tried to install FlashAttention / a CUDA toolkit / a `cu12x` wheel and it failed

None of those apply on Apple Silicon. There is no CUDA, no FP8 path, and no FlashAttention wheel on macOS — llama.cpp uses its own Metal attention kernels (enabled by -fa on) and GGUF K-quants. Confirm your llama.cpp build is a Metal build (Homebrew's is; from source, Metal is on by default on macOS) and that -ngl 99 is offloading all layers to the GPU. If a generic Ornith or llama.cpp tutorial tells you to pass -DGGML_CUDA=ON, install flash-attn, or set up an FP8 path, skip those steps entirely — the commands above are the complete Apple path.