self-hosted/ai
§01·recipe · llm

Ornith 1.0 35B on RTX 5090: Comfortable Local Agentic Coding via llama.cpp + OpenHands (32GB Tier)

llmadvanced32GB+ VRAMJul 3, 2026

This advanced recipe sets up Ornith 1.0 35B on the RTX 5090, needing about 32 GB of VRAM.

models
tools
prerequisites
  • NVIDIA RTX 5090 (32GB VRAM) — the comfortable 35B tier: Q4_K_M with a genuinely useful context, or Q5_K_M for higher fidelity (see below)
  • Python 3.10+ (for the OpenHands agent client)
  • CUDA toolkit 12.8+ and a recent llama.cpp build with CUDA support — Blackwell (sm_120) needs current tooling (see Troubleshooting); or Ollama on a recent release
  • ~22GB free disk for the Q4_K_M GGUF (or ~25GB for Q5_K_M)

What You'll Build

A local, private agentic-coding setup: Ornith 1.0 35B — DeepReinforce's open (MIT) Mixture-of-Experts coding model — served as an OpenAI-compatible endpoint by llama.cpp on a single 32GB RTX 5090, driven by the OpenHands coding agent so the model can read your repo, run shell commands, and edit files. Ornith is a reasoning model: each turn opens with a <think> chain-of-thought before the answer, and it emits native <tool_call> blocks for agentic workflows. On a 32GB card the 35B stops being a tight squeeze: you can run the Q4_K_M quant (21.2 GB on disk) with a genuinely useful working context, or step up to Q5_K_M (24.73 GB) for higher fidelity with a moderate context.

Hardware data: RTX 5090 (32GB VRAM) · Ornith 1.0 35B Q4_K_M GGUF (21.2 GB weights) or Q5_K_M (24.73 GB) · comfortable fit, real context headroom · See benchmark data

32GB lifts the binding constraint — context is no longer the wall. The 24GB cards (RTX 3090/4090) hold the 21.2 GB Q4_K_M weights with only ~2–3 GB left for the KV cache, which forces context down to ~16K. The RTX 5090's 32GB leaves roughly ~10 GB free after the Q4 weights, so you can run a genuinely useful working context (-c 65536 and up) instead of a cramped 16K cap — or trade that headroom for fidelity by moving up to Q5_K_M (24.73 GB) at a moderate context. Read the Running section for the exact flags.

ℹ️ An MoE keeps all experts resident — the file size is the VRAM cost. Ornith 1.0 35B is a Mixture-of-Experts. An MoE activates only some experts per token (a throughput property), but all experts stay loaded in VRAM, so the memory footprint is the full quant file — 21.2 GB at Q4_K_M, 24.73 GB at Q5_K_M — not some smaller "active" fraction. Do not expect a low active-parameter count to shrink the VRAM requirement.

ℹ️ Card too small? Use the 9B. There is no official Q3/Q2 build of the 35B — Q4_K_M (21.2 GB) is the floor, so cards below 24GB cannot run this model from the official GGUF, and 24GB cards run it only tight. On an 8–16 GB card, use Ornith 1.0 9B instead (Q4_K_M is ~5.6 GB); it is the same agentic-coding family sized for smaller rigs.

Requirements

ComponentMinimumTested
GPU32GB VRAM (for Q4 with real context, or Q5_K_M)RTX 5090 (32GB, Blackwell GB202, sm_120)
RAM16GB system RAM (32GB comfortable for the agent + repo)
Storage~22GB (Q4_K_M is 21.2 GB) or ~25GB (Q5_K_M is 24.73 GB)21.17 GB (ornith-1.0-35b-Q4_K_M.gguf)
Softwarellama.cpp (recent CUDA 12.8+ build) or Ollama; Python 3.10+ for OpenHandsllama.cpp llama-server, OpenHands

The Q4_K_M GGUF file is 21.17 GB (21,166,757,760 bytes) per the Ornith-1.0-35B-GGUF file tree; the other published quants are Q5_K_M 24.73 GB, Q6_K 28.51 GB, Q8_0 36.90 GB, and BF16 69.38 GB. On the 32GB RTX 5090, Q4_K_M and Q5_K_M fit comfortably, Q6_K (28.51 GB) fits but is tight (little room left for the KV cache), and Q8_0 (36.90 GB) and BF16 (69.38 GB) exceed 32 GB and do not fit. The model is MIT-licensed, per the Ornith-1.0-35B-GGUF model card.

Installation

1. Install llama.cpp with CUDA support (Blackwell needs recent tooling)

The RTX 5090 is Blackwell (GB202, compute capability 12.0 / sm_120). This is the one card-specific gotcha: Blackwell's sm_120 is only supported by a recent CUDA toolkit (12.8 or newer) and a recent llama.cpp build — older prebuilt binaries or older CUDA versions predate Blackwell and will fail to use the GPU (see Troubleshooting). The GGUF quants are integer formats, so Blackwell's FP4/FP8 tensor cores are irrelevant here — no special quant or FP8 path is needed. Install a prebuilt binary via Homebrew (macOS/Linux) or, to be sure of Blackwell support, build from source against a current CUDA per the llama.cpp README:

# Homebrew (Linux/macOS) — ships a CUDA-enabled build where available
brew install llama.cpp

# …or build from source with CUDA, targeting Blackwell sm_120 (needs CUDA 12.8+)
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=120
cmake --build build --config Release -j

2. Download the Q4_K_M GGUF

llama-server can pull the GGUF straight from Hugging Face and cache it locally. The -hf flag takes <user>/<model>[:quant], and per the canonical llama.cpp flag definitions the quant is optional and defaults to Q4_K_M — which is the quant this recipe recommends as the default (it leaves the most context headroom on 32GB):

# Downloads ornith-1.0-35b-Q4_K_M.gguf (21.2 GB) into the llama.cpp cache on first use
llama-server -hf deepreinforce-ai/Ornith-1.0-35B-GGUF:Q4_K_M --port 8000

The first launch downloads ~21 GB; subsequent launches reuse the cached file. If you prefer to download the file explicitly first, grab ornith-1.0-35b-Q4_K_M.gguf from the GGUF repo files tab and pass it with -m <path> instead. To run the higher-fidelity Q5_K_M (24.73 GB) instead, swap the quant suffix: -hf deepreinforce-ai/Ornith-1.0-35B-GGUF:Q5_K_M.

3. Install the OpenHands coding agent

OpenHands is an open-source agentic-coding client that drives any OpenAI-compatible endpoint. The Ornith-1.0-35B-GGUF model card documents the CLI install:

pip install openhands-ai

Alternatively, run the OpenHands Docker image — in that case point its base URL at http://host.docker.internal:8000/v1 so the container can reach the llama.cpp server running on your host.

Running

1. Serve Ornith with a real working context

The model card's llama.cpp example, per the Ornith-1.0-35B-GGUF model card, is llama-server -hf deepreinforce-ai/Ornith-1.0-35B-GGUF --port 8000 -c 262144 — but -c 262144 is the full 256K context; the KV cache for that runs to tens of GB and will not fit on a 32GB card alongside the 21.2 GB of Q4 weights. The RTX 5090's headroom means you don't have to cramp context to 16K the way a 24GB card does — a 64K working window fits comfortably at Q4. All flags below are from the canonical llama.cpp flag definitions (-c/--ctx-size, -ctk/--cache-type-k, -ctv/--cache-type-v, -fa/--flash-attn, -ngl/--gpu-layers):

llama-server \
    -hf deepreinforce-ai/Ornith-1.0-35B-GGUF:Q4_K_M \
    --port 8000 \
    -ngl 99 \
    -c 65536 \
    -fa on
  • -ngl 99 offloads all layers to the GPU (the 21.2 GB Q4 file must sit in VRAM — see the MoE note above).
  • -c 65536 gives a genuinely useful 64K working context — far beyond the ~16K a 24GB card is forced into. Reasoning models spend thousands of tokens per turn inside <think> blocks, so KV pressure is higher than a plain chat model at the same context setting; watch nvidia-smi (see Troubleshooting) and raise or lower -c to taste.
  • -fa on enables Flash Attention. On 32GB the fp16 KV cache fits at 64K, so KV quantization isn't required at this context — but to push further (e.g. -c 131072 or beyond) add -ctk q8_0 -ctv q8_0 to quantize the K and V caches to 8-bit, roughly halving KV-cache memory. Even so, the full 262K window still needs KV quantization and eats heavily into 32 GB — don't expect the whole context to run unquantized on this card.

For higher fidelity, run Q5_K_M (24.73 GB) instead — swap the quant suffix and drop context to a moderate window (e.g. -c 32768, adding -ctk q8_0 -ctv q8_0 if needed), since the larger weights leave less KV room:

llama-server \
    -hf deepreinforce-ai/Ornith-1.0-35B-GGUF:Q5_K_M \
    --port 8000 \
    -ngl 99 \
    -c 32768 \
    -fa on \
    -ctk q8_0 -ctv q8_0

Both expose an OpenAI-compatible API at http://localhost:8000/v1. (Ollama is a valid alternative — ollama run hf.co/deepreinforce-ai/Ornith-1.0-35B-GGUF, per the same model card — but Ollama's own context default is small and must likewise be raised deliberately rather than set to the model's full 256K.)

2. Point OpenHands at the local server

OpenHands routes through LiteLLM, so a custom OpenAI-compatible endpoint uses an openai/ model prefix, per the OpenHands local-LLM docs. Using the CLI env-var form documented on the Ornith-1.0-35B-GGUF model card:

export LLM_MODEL="openai/deepreinforce-ai/Ornith-1.0-35B"
export LLM_BASE_URL="http://localhost:8000/v1"
export LLM_API_KEY="EMPTY"   # any non-empty string; local servers don't check it

openhands

OpenHands will now use Ornith to plan, run shell commands, and edit files in your workspace. The model's <think> reasoning drives its planning and its native tool-calling drives the file/shell actions.

Results

  • Speed: No community benchmark for Ornith 1.0 35B on the RTX 5090 exists yet — /check/ornith-1-0-35b/rtx-5090 has no benchmark data, and this is a brand-new model, so we do not quote a tok/s figure rather than invent one or borrow one from different hardware. (Note that a reasoning model's effective throughput is lower than a raw tok/s number suggests, because much of each turn is <think> content you discard.)
  • VRAM usage: The Q4_K_M weights are 21.2 GB on disk and must be held resident, leaving roughly ~10 GB of the RTX 5090's 32 GB for the KV cache and activations — enough for a real 64K context at fp16, which is why this card doesn't need the 16K cap a 24GB card does. Stepping up to Q5_K_M spends 24.73 GB on weights, trading some of that headroom for fidelity. Weight and per-quant file sizes are verified via the Ornith-1.0-35B-GGUF file tree.
  • Quality notes: DeepReinforce reports the 35B scoring SWE-bench Verified 75.6 and Terminal-Bench 2.1 (Terminus-2) 64.2 — state-of-the-art among similarly-sized open models — per the vendor benchmark table on the Ornith-1.0-35B-GGUF model card. Those are the vendor's own agentic-coding evals, not a measurement on this GPU. Recommended sampling per the model card is temperature 0.6, top_p 0.95, top_k 20.

For the full benchmark data, see /check/ornith-1-0-35b/rtx-5090.

Troubleshooting

llama.cpp runs on the CPU / ignores the RTX 5090 (Blackwell not supported by old builds)

The RTX 5090 is Blackwell (GB202, compute capability 12.0 / sm_120), which is only supported by a recent CUDA toolkit (12.8+) and a recent llama.cpp build. An older prebuilt binary or an older CUDA version predates Blackwell — llama.cpp will then fall back to the CPU (or fail to allocate on the GPU) even with -ngl 99 set, and generation will be very slow. Fix: install CUDA 12.8 or newer, then build llama.cpp from source with -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=120 (or grab a fresh prebuilt release that advertises Blackwell/sm_120 support). Confirm the GPU is actually in use by watching VRAM climb in nvidia-smi while the model loads. Note the integer GGUF quants mean you do not need Blackwell's FP4/FP8 tensor cores — a standard recent CUDA build is enough.

Out of memory at launch, or the KV cache won't fit

On 32GB the Q4_K_M weights (21.2 GB) leave ~10 GB for the KV cache, so a 64K context fits at fp16 — but a very large -c still OOMs. If you hit it, lower -c, add -ctk q8_0 -ctv q8_0 and -fa on to halve KV-cache memory, and close any other GPU app before launching. Watch nvidia-smi during a real agent task — a hard coding problem produces a long <think> block that grows the KV cache mid-generation, so size for the peak, not the idle load. Running Q5_K_M (24.73 GB) leaves less room, so keep its context more moderate; Q6_K (28.51 GB) fits but leaves very little KV room, and Q8_0/BF16 exceed 32 GB — if you need those, that requires a 48 GB+ card or Apple unified memory.

The model gets stuck in a recursive loop / repeats tokens

Multiple community users report on the GGUF model card discussions that llama.cpp runs can fall into repetitive/recursive generation once the conversation grows past roughly 22K tokens. Even though 32GB lets you set a large context window, this failure mode is about conversation length, not VRAM — so keep individual agent sessions focused and don't let a single conversation balloon far past ~22K tokens of history. These are community reports on the model card discussions, not a vendor-confirmed fix; if you hit it, keep sessions short.

The agent won't edit files / botches tool calls

Community users report (on the GGUF model card discussions) that with the default chat template the model sometimes fails to edit files or emit clean tool calls in agent clients. The community workaround shared in that thread is to pass a corrected Qwen-family chat template to llama-server via --chat-template-file <path-to-template.jinja>. Because Ornith is post-trained on a Gemma 4 + Qwen 3.5 lineage, its chat template is Qwen-shaped; if agentic editing misbehaves, try a corrected template as those users did. Treat this as a community-reported workaround, not an official fix.

common questions
How much VRAM does Ornith 1.0 35B need?

About 32 GB — the minimum this recipe targets.

Which GPUs is Ornith 1.0 35B tested on?

RTX 5090 (32 GB).

How hard is this setup?

Advanced — follow the steps above.