How much VRAM does Ornith 1.0 35B need?

About 48 GB — the minimum this recipe targets.

How hard is this setup?

Advanced — follow the steps above.

Ornith 1.0 35B on Apple M4 Max: Local Agentic Coding via llama.cpp Metal + OpenHands (48GB Unified Memory)

What You'll Build

A local, private agentic-coding setup: Ornith 1.0 35B — DeepReinforce's open (MIT) Mixture-of-Experts coding model — served as an OpenAI-compatible endpoint by llama.cpp using Apple's Metal backend on an Apple M4 Max with 48 GB unified memory, driven by the OpenHands coding agent so the model can read your repo, run shell commands, and edit files. There is no NVIDIA GPU, no CUDA, no FlashAttention. Ornith is a reasoning model: each turn opens with a <think> chain-of-thought before the answer, and it emits native <tool_call> blocks for agentic workflows. Unlike the tight-fit 24 GB NVIDIA cards — where Q4_K_M (21.2 GB) is the only quant that fits — the M4 Max's 48 GB unified pool is finally roomy for the 35B: this recipe leads with the near-lossless Q6_K quant (28.51 GB) and leaves comfortable room for the KV cache and a real context window.

Hardware data: Apple M4 Max (48 GB unified memory) · Ornith 1.0 35B Q6_K GGUF (28.51 GB weights) · roomy fit, real context · See benchmark data

ℹ️ Unified memory is not VRAM. The M4 Max has 48 GB of unified memory shared by the OS, CPU, and GPU — it is not 48 GB of dedicated VRAM. By default macOS lets the GPU wire only roughly two-thirds to three-quarters of it (~32 GB safe / ~36 GB optimistic via Metal's recommendedMaxWorkingSetSize); realistically ~40 GB is usable for the model after the system's share. That is still roomy for the 35B — the recommended Q6_K (28.51 GB) fits with comfortable room for the KV cache and a real context window; Q8_0 (36.90 GB) fits for max fidelity but leaves little headroom on 48 GB (tight). This is the opposite of the 24 GB cards, where Q4_K_M was a squeeze. Loading a large model may require raising the GPU wired-memory limit once — see Troubleshooting.

ℹ️ An MoE keeps all experts resident — the file size is the memory cost. Ornith 1.0 35B is a Mixture-of-Experts. An MoE activates only some experts per token (a throughput property), but all experts stay loaded in memory, so the footprint is the full quant file — 28.51 GB at Q6_K, not some smaller "active" fraction. Do not expect a low active-parameter count to shrink the memory requirement.

ℹ️ Under 48 GB of unified memory? Use the 9B. There is no official Q3/Q2 build of the 35B — Q4_K_M (21.2 GB) is the floor. On a smaller Apple Silicon Mac (e.g. an 8–24 GB unified-memory chip), use Ornith 1.0 9B instead (Q4_K_M is ~5.6 GB); it is the same agentic-coding family sized for smaller machines. A 24 GB-tier unified-memory Mac can run the 35B at Q4_K_M (tight, like the 24 GB NVIDIA cards); the 48 GB M4 Max is the first tier that runs it comfortably.

Requirements

Component	Minimum	Tested
GPU / memory	48 GB unified memory (~32 GB GPU-addressable by default, raisable)	Apple M4 Max (up to 40-core GPU, unified memory)
RAM	Same pool — unified	48 GB unified
Storage	~30 GB (the Q6_K GGUF is 28.51 GB)	28.51 GB (`ornith-1.0-35b-Q6_K.gguf`)
Software	llama.cpp (Metal, the macOS default) or Ollama; Python 3.10+ for OpenHands	llama.cpp `llama-server`, OpenHands

Published quant sizes per the Ornith-1.0-35B-GGUF file tree are Q4_K_M 21.17 GB (21,166,757,760 bytes), Q5_K_M 24.73 GB, Q6_K 28.51 GB, Q8_0 36.90 GB, and BF16 69.38 GB. On the M4 Max's 48 GB unified memory, the binding constraint is addressable unified memory, not raw capacity: the GPU wires ~32 GB safely by default (~40 GB usable after the system's share once the limit is raised). Q6_K (28.51 GB) is the recommended default — near-lossless and leaving comfortable room for the KV cache and a real context; Q8_0 (36.90 GB) fits for maximum fidelity but leaves little headroom (tight — plan to raise the wired limit and keep context modest); Q4_K_M / Q5_K_M run with very large context to spare. BF16 (69.38 GB) does not fit on a 48 GB Mac. The model is MIT-licensed, per the Ornith-1.0-35B-GGUF model card.

Installation

1. Install llama.cpp with Metal (the macOS default)

On Apple Silicon there is nothing CUDA-shaped to install — no CUDA toolkit, no FP8 path, no FlashAttention wheel. llama.cpp's Metal backend runs the model on the Apple GPU and is enabled by default on macOS: "On MacOS, Metal is enabled by default. Using Metal makes the computation run on the GPU." (llama.cpp build docs). Install a prebuilt binary via Homebrew (which ships a Metal build) or build from source per the llama.cpp README:

# Homebrew — ships a Metal-enabled build on macOS
brew install llama.cpp

# …or build from source (Metal is on by default on macOS)
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build -DGGML_METAL=ON
cmake --build build --config Release -j

-DGGML_METAL=ON is the macOS default, so a plain cmake -B build already targets the GPU; it is spelled out here only to be explicit.

2. Download the Q6_K GGUF

llama-server can pull the GGUF straight from Hugging Face and cache it locally. The -hf flag takes <user>/<model>[:quant] per the canonical llama.cpp flag definitions; name the Q6_K quant explicitly since it is the recommended default for this roomy tier (the flag's own default is Q4_K_M, which you would use only for maximum context):

# Downloads ornith-1.0-35b-Q6_K.gguf (28.51 GB) into the llama.cpp cache on first use
llama-server -hf deepreinforce-ai/Ornith-1.0-35B-GGUF:Q6_K --port 8000

The first launch downloads ~28.5 GB; subsequent launches reuse the cached file. If you prefer to download the file explicitly first, grab ornith-1.0-35b-Q6_K.gguf from the GGUF repo files tab and pass it with -m <path> instead. For maximum fidelity on this 48 GB Mac use :Q8_0 (36.90 GB, tight); for very large context drop to :Q4_K_M (21.2 GB) or :Q5_K_M (24.73 GB).

3. Install the OpenHands coding agent

OpenHands is an open-source agentic-coding client that drives any OpenAI-compatible endpoint — it is runtime-agnostic, so the same wiring works against llama.cpp's Metal server as against a CUDA one. The Ornith-1.0-35B-GGUF model card documents the CLI install:

pip install openhands-ai

Alternatively, run the OpenHands Docker image — in that case point its base URL at http://host.docker.internal:8000/v1 so the container can reach the llama.cpp server running on your host.

Running

1. Serve Ornith with a generous, KV-quantized context

The model card's llama.cpp example, per the Ornith-1.0-35B-GGUF model card, is llama-server -hf deepreinforce-ai/Ornith-1.0-35B-GGUF --port 8000 -c 262144 — the full 256K context. On the M4 Max the 48 GB unified pool allows a much larger context than the 24 GB NVIDIA cards (which were capped near 16K), but the full 256K KV cache still costs many GB on top of the weights, so stay honest and size it deliberately. All flags below are from the canonical llama.cpp flag definitions (-c/--ctx-size, -ctk/--cache-type-k, -ctv/--cache-type-v, -fa/--flash-attn, -ngl/--gpu-layers):

llama-server \
    -hf deepreinforce-ai/Ornith-1.0-35B-GGUF:Q6_K \
    --port 8000 \
    -ngl 99 \
    -c 65536 \
    -fa on \
    -ctk q8_0 -ctv q8_0

-ngl 99 offloads all layers to the Apple GPU via Metal (the 28.51 GB Q6_K file must sit in memory — see the MoE note above). On macOS this is the Metal offload; there are no CUDA semantics involved.
-c 65536 gives a generous 64K context — far beyond what the 24 GB cards allowed — while keeping room to spare on Q6_K. To push toward the full 262K, raise -c further and keep the KV cache quantized (below); watch Activity Monitor's Memory-Pressure gauge as you climb, because a 256K KV cache still costs many GB on top of the 28.51 GB of weights.
-fa on enables Flash Attention, and -ctk q8_0 -ctv q8_0 quantize the K and V caches to 8-bit — roughly halving KV-cache memory versus the fp16 default, which is what lets you push context toward the model's 262K ceiling without exhausting unified memory. Reasoning models spend thousands of tokens per turn inside <think> blocks, so KV pressure is higher than a plain chat model at the same context setting.

This exposes an OpenAI-compatible API at http://localhost:8000/v1. (Ollama is a valid alternative — ollama run hf.co/deepreinforce-ai/Ornith-1.0-35B-GGUF, per the same model card — and on macOS Ollama uses Metal automatically; but Ollama's own context default is small and must likewise be raised deliberately rather than set to the model's full 256K.)

2. Point OpenHands at the local server

OpenHands routes through LiteLLM, so a custom OpenAI-compatible endpoint uses an openai/ model prefix, per the OpenHands local-LLM docs. The wiring is identical to any other backend — point it at the llama.cpp Metal server (or Ollama's http://localhost:11434/v1). Using the CLI env-var form documented on the Ornith-1.0-35B-GGUF model card:

export LLM_MODEL="openai/deepreinforce-ai/Ornith-1.0-35B"
export LLM_BASE_URL="http://localhost:8000/v1"   # or http://localhost:11434/v1 for Ollama
export LLM_API_KEY="EMPTY"   # any non-empty string; local servers don't check it

openhands

OpenHands will now use Ornith to plan, run shell commands, and edit files in your workspace. The model's <think> reasoning drives its planning and its native tool-calling drives the file/shell actions.

Results

Speed: No community benchmark for Ornith 1.0 35B on the Apple M4 Max exists yet — /check/ornith-1-0-35b/m4-max has no benchmark data, and this is a brand-new model, so we do not quote a tok/s figure rather than invent one or borrow one from different hardware. Token generation on Apple Silicon is bandwidth-bound (memory-bandwidth-limited), so throughput tracks the unified-memory bandwidth rather than any core count; we do not publish a figure without a cited measurement. (Note also that a reasoning model's effective throughput is lower than a raw tok/s number suggests, because much of each turn is <think> content you discard.)
Memory usage: The Q6_K weights are 28.51 GB and must be held resident, leaving comfortable room within the M4 Max's ~40 GB usable unified pool for the KV cache and activations — which is why this tier runs a real (64K+) context where the 24 GB cards were capped near 16K. Q8_0 (36.90 GB) fits for maximum fidelity but is tight; BF16 (69.38 GB) does not fit on 48 GB. Weight and per-quant file sizes are verified via the Ornith-1.0-35B-GGUF file tree.
Quality notes: DeepReinforce reports the 35B scoring SWE-bench Verified 75.6 and Terminal-Bench 2.1 (Terminus-2) 64.2 — state-of-the-art among similarly-sized open models — per the vendor benchmark table on the Ornith-1.0-35B-GGUF model card. Those are the vendor's own agentic-coding evals, not a measurement on this GPU. On this roomy tier the near-lossless Q6_K default preserves essentially all of that quality; Q4_K_M (which you would pick only for maximum context) trades a little. Recommended sampling per the model card is temperature 0.6, top_p 0.95, top_k 20.

For the full benchmark data, see /check/ornith-1-0-35b/m4-max.

Troubleshooting

Loading a large model fails or falls back to a slow path (raise the GPU wired-memory limit)

By default macOS caps how much unified memory the GPU may wire — roughly two-thirds of the 48 GB. Loading a ~28.51 GB model (Q6_K), let alone Q8_0 at 36.90 GB, can fail or fall back to slow paths against that default cap. The fix is to raise the GPU wired-memory limit (macOS Sonoma 14 / Sequoia 15+):

sudo sysctl iogpu.wired_limit_mb=40960   # 40 GB; leaves ~8 GB for macOS

This lets the GPU wire up to 40 GB — enough for the Q6_K weights plus a healthy KV cache. Always leave 8–16 GB of headroom for macOS; pushing to 100% causes instability. The setting is temporary and resets on reboot (persist it via /etc/sysctl.conf if you want it to survive a restart); sudo sysctl iogpu.wired_limit_mb=0 restores the default. Watch Activity Monitor's Memory-Pressure gauge while loading. For the tight Q8_0 path set the limit higher (e.g. 43008 ≈ 42 GB) and keep the context modest, since the weights alone are 36.90 GB.

Out of memory / heavy swapping at long context

The 28.51 GB Q6_K weights leave comfortable room in the ~40 GB usable pool, so ordinary use rarely pressures memory. If you push context toward the model's 262K ceiling, the KV cache grows into many GB on top of the weights and can approach the limit. Keep -fa on with -ctk q8_0 -ctv q8_0 to halve KV-cache memory, lower -c if pressure climbs, and raise the wired limit (above) before reaching for the largest contexts. Watch Activity Monitor during a real agent task — a hard coding problem produces a long <think> block that grows the KV cache mid-generation, so size for the peak, not the idle load. If you routinely need the full 256K context, a Mac with ≥ 64 GB unified memory gives more headroom.

The model gets stuck in a recursive loop / repeats tokens

Multiple community users report on the GGUF model card discussions that llama.cpp runs can fall into repetitive/recursive generation once the conversation grows past roughly 22K tokens. This model-level failure mode is runtime-agnostic — it applies to the Metal backend just as to CUDA. If you hit it, keep sessions short and the effective context per turn well under that threshold. These are community reports on the model card discussions, not a vendor-confirmed fix.

The agent won't edit files / botches tool calls

Community users report (on the GGUF model card discussions) that with the default chat template the model sometimes fails to edit files or emit clean tool calls in agent clients. The community workaround shared in that thread is to pass a corrected Qwen-family chat template to llama-server via --chat-template-file <path-to-template.jinja>. Because Ornith is post-trained on a Gemma 4 + Qwen 3.5 lineage, its chat template is Qwen-shaped; if agentic editing misbehaves, try a corrected template as those users did. This is a model-level issue independent of the Metal runtime. Treat it as a community-reported workaround, not an official fix.

Tried to install FlashAttention / a CUDA toolkit / a `cu12x` wheel and it failed

None of those apply on Apple Silicon. There is no CUDA, no FP8 path, and no FlashAttention wheel on macOS — llama.cpp uses its own Metal attention kernels (enabled by -fa on) and GGUF K-quants. Confirm your llama.cpp build is a Metal build (Homebrew's is; from source, Metal is on by default on macOS) and that -ngl 99 is offloading all layers to the GPU. If a generic Ornith or llama.cpp tutorial tells you to pass -DGGML_CUDA=ON, install flash-attn, or set up an FP8 path, skip those steps entirely — the commands above are the complete Apple path.