How much VRAM does Ornith 1.0 35B need?

About 24 GB — the minimum this recipe targets.

How hard is this setup?

Advanced — follow the steps above.

Ornith 1.0 35B on RTX 3090 Ti: Local Agentic Coding via llama.cpp + OpenHands (24GB Entry Tier)

What You'll Build

A local, private agentic-coding setup: Ornith 1.0 35B — DeepReinforce's open (MIT) Mixture-of-Experts coding model — served as an OpenAI-compatible endpoint by llama.cpp on a single 24GB RTX 3090 Ti, driven by the OpenHands coding agent so the model can read your repo, run shell commands, and edit files. Ornith is a reasoning model: each turn opens with a <think> chain-of-thought before the answer, and it emits native <tool_call> blocks for agentic workflows. This recipe uses the Q4_K_M quant (21.2 GB on disk) — the smallest official build, and the reason the RTX 3090 Ti is the entry card for the 35B.

Hardware data: RTX 3090 Ti (24GB VRAM) · Ornith 1.0 35B Q4_K_M GGUF (21.2 GB weights) · short-context, tight fit · See benchmark data

⚠️ This is a tight fit, and context is the binding constraint. The Q4_K_M weights alone occupy 21.2 GB of the RTX 3090 Ti's 24 GB, leaving only ~2–3 GB for the KV cache. The 3090 Ti is a faster card than the 3090 — a fully-enabled GA102 with more SMs, higher clocks, and higher memory bandwidth — but it still has exactly 24 GB, so the memory ceiling is unchanged. Ornith's context window is 262,144 tokens, but you cannot run the full 256K context on a 24GB card — the KV cache for that would need tens of GB. This recipe caps context and quantizes the KV cache so a useful working window fits. Read the Running section before you launch.

ℹ️ An MoE keeps all experts resident — the file size is the VRAM cost. Ornith 1.0 35B is a Mixture-of-Experts. An MoE activates only some experts per token (a throughput property), but all experts stay loaded in VRAM, so the memory footprint is the full quant file — 21.2 GB at Q4_K_M, not some smaller "active" fraction. Do not expect a low active-parameter count to shrink the VRAM requirement.

ℹ️ Card too small? Use the 9B. There is no official Q3/Q2 build of the 35B — Q4_K_M (21.2 GB) is the floor, so cards below 24GB cannot run this model from the official GGUF. On an 8–16 GB card, use Ornith 1.0 9B instead (Q4_K_M is ~5.6 GB); it is the same agentic-coding family sized for smaller rigs.

Requirements

Component	Minimum	Tested
GPU	24GB VRAM (this is the 35B's floor)	RTX 3090 Ti (24GB, Ampere GA102, sm_86)
RAM	16GB system RAM (32GB comfortable for the agent + repo)	—
Storage	~22GB (the Q4_K_M GGUF is 21.2 GB)	21.17 GB (`ornith-1.0-35b-Q4_K_M.gguf`)
Software	llama.cpp (CUDA build) or Ollama; Python 3.10+ for OpenHands	llama.cpp `llama-server`, OpenHands

The Q4_K_M GGUF file is 21.17 GB (21,166,757,760 bytes) per the Ornith-1.0-35B-GGUF file tree; the other published quants are Q5_K_M 24.73 GB, Q6_K 28.51 GB, Q8_0 36.90 GB, and BF16 69.38 GB — all of which exceed the RTX 3090 Ti's 24 GB, so Q4_K_M is the only official quant that fits this card. The model is MIT-licensed, per the Ornith-1.0-35B-GGUF model card.

Installation

1. Install llama.cpp with CUDA support

The RTX 3090 Ti is Ampere (GA102, sm_86) and is fully supported by a standard CUDA build of llama.cpp — no special wheel or FP8 path is needed (the GGUF quants are integer formats, so the absence of FP8 tensor cores on Ampere is irrelevant here). Install a prebuilt binary via Homebrew (macOS/Linux) or build from source per the llama.cpp README:

# Homebrew (Linux/macOS) — ships a CUDA-enabled build where available
brew install llama.cpp

# …or build from source with CUDA
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=86
cmake --build build --config Release -j

2. Download the Q4_K_M GGUF

llama-server can pull the GGUF straight from Hugging Face and cache it locally. The -hf flag takes <user>/<model>[:quant], and per the canonical llama.cpp flag definitions the quant is optional and defaults to Q4_K_M — which is exactly the quant this recipe wants:

# Downloads ornith-1.0-35b-Q4_K_M.gguf (21.2 GB) into the llama.cpp cache on first use
llama-server -hf deepreinforce-ai/Ornith-1.0-35B-GGUF:Q4_K_M --port 8000

The first launch downloads ~21 GB; subsequent launches reuse the cached file. If you prefer to download the file explicitly first, grab ornith-1.0-35b-Q4_K_M.gguf from the GGUF repo files tab and pass it with -m <path> instead.

3. Install the OpenHands coding agent

OpenHands is an open-source agentic-coding client that drives any OpenAI-compatible endpoint. The Ornith-1.0-35B-GGUF model card documents the CLI install:

pip install openhands-ai

Alternatively, run the OpenHands Docker image — in that case point its base URL at http://host.docker.internal:8000/v1 so the container can reach the llama.cpp server running on your host.

Running

1. Serve Ornith with a capped, KV-quantized context

The model card's llama.cpp example, per the Ornith-1.0-35B-GGUF model card, is llama-server -hf deepreinforce-ai/Ornith-1.0-35B-GGUF --port 8000 -c 262144 — but -c 262144 is the full 256K context and will not fit on a 24GB card once the 21.2 GB of weights are resident. On the RTX 3090 Ti, cap the context and quantize the KV cache so it fits in the ~2–3 GB that remains. All flags below are from the canonical llama.cpp flag definitions (-c/--ctx-size, -ctk/--cache-type-k, -ctv/--cache-type-v, -fa/--flash-attn, -ngl/--gpu-layers):

llama-server \
    -hf deepreinforce-ai/Ornith-1.0-35B-GGUF:Q4_K_M \
    --port 8000 \
    -ngl 99 \
    -c 16384 \
    -fa on \
    -ctk q8_0 -ctv q8_0

-ngl 99 offloads all layers to the GPU (the 21.2 GB Q4 file must sit in VRAM — see the MoE note above).
-c 16384 caps context at 16K. This is a deliberate, tight starting point; raise or lower it while watching nvidia-smi (see Troubleshooting). Reasoning models spend thousands of tokens per turn inside <think> blocks, so KV pressure is higher than a plain chat model at the same context setting.
-fa on enables Flash Attention, and -ctk q8_0 -ctv q8_0 quantize the K and V caches to 8-bit — roughly halving KV-cache memory versus the fp16 default, which is what buys back usable context on a card this tight.

This exposes an OpenAI-compatible API at http://localhost:8000/v1. (Ollama is a valid alternative — ollama run hf.co/deepreinforce-ai/Ornith-1.0-35B-GGUF, per the same model card — but Ollama's own context default is small and must likewise be raised deliberately rather than set to the model's full 256K.)

2. Point OpenHands at the local server

OpenHands routes through LiteLLM, so a custom OpenAI-compatible endpoint uses an openai/ model prefix, per the OpenHands local-LLM docs. Using the CLI env-var form documented on the Ornith-1.0-35B-GGUF model card:

export LLM_MODEL="openai/deepreinforce-ai/Ornith-1.0-35B"
export LLM_BASE_URL="http://localhost:8000/v1"
export LLM_API_KEY="EMPTY"   # any non-empty string; local servers don't check it

openhands

OpenHands will now use Ornith to plan, run shell commands, and edit files in your workspace. The model's <think> reasoning drives its planning and its native tool-calling drives the file/shell actions.

Results

Speed: No community benchmark for Ornith 1.0 35B on the RTX 3090 Ti exists yet — /check/ornith-1-0-35b/rtx-3090-ti has no benchmark data, and this is a brand-new model, so we do not quote a tok/s figure rather than invent one or borrow one from different hardware. (Note that a reasoning model's effective throughput is lower than a raw tok/s number suggests, because much of each turn is <think> content you discard.)
VRAM usage: The Q4_K_M weights are 21.2 GB on disk and must be held resident, leaving ~2–3 GB of the RTX 3090 Ti's 24 GB for the KV cache and activations — which is why context is capped and the KV cache is 8-bit-quantized above. Weight and per-quant file sizes are verified via the Ornith-1.0-35B-GGUF file tree.
Quality notes: DeepReinforce reports the 35B scoring SWE-bench Verified 75.6 and Terminal-Bench 2.1 (Terminus-2) 64.2 — state-of-the-art among similarly-sized open models — per the vendor benchmark table on the Ornith-1.0-35B-GGUF model card. Those are the vendor's own agentic-coding evals, not a measurement on this GPU. Recommended sampling per the model card is temperature 0.6, top_p 0.95, top_k 20.

For the full benchmark data, see /check/ornith-1-0-35b/rtx-3090-ti.

Troubleshooting

Out of memory at launch, or the KV cache won't fit

The 21.2 GB of weights leave almost no room on a 24GB card, so OOM at startup usually means the context is set too high. The RTX 3090 Ti's higher clocks and bandwidth do not change this — it still has 24 GB. Lower -c (try -c 8192), keep -ctk q8_0 -ctv q8_0 and -fa on enabled, and close any other GPU app before launching. Watch nvidia-smi during a real agent task — a hard coding problem produces a long <think> block that grows the KV cache mid-generation, so size for the peak, not the idle load. If you need a larger working context than the 3090 Ti can give at Q4, that requires a 32GB (e.g. RTX 5090) or larger card.

The model gets stuck in a recursive loop / repeats tokens

Multiple community users report on the GGUF model card discussions that llama.cpp runs can fall into repetitive/recursive generation once the conversation grows past roughly 22K tokens. This is another reason the recipe caps context (-c 16384) — staying well under that threshold sidesteps the reported failure mode. These are community reports on the model card discussions, not a vendor-confirmed fix; if you hit it, keep sessions short and the context capped.

The agent won't edit files / botches tool calls

Community users report (on the GGUF model card discussions) that with the default chat template the model sometimes fails to edit files or emit clean tool calls in agent clients. The community workaround shared in that thread is to pass a corrected Qwen-family chat template to llama-server via --chat-template-file <path-to-template.jinja>. Because Ornith is post-trained on a Gemma 4 + Qwen 3.5 lineage, its chat template is Qwen-shaped; if agentic editing misbehaves, try a corrected template as those users did. Treat this as a community-reported workaround, not an official fix.

`torch`/CUDA or llama.cpp reports no GPU

Confirm your llama.cpp build has CUDA enabled (GGML_CUDA=ON when building from source) and that -ngl 99 is offloading layers. The RTX 3090 Ti (Ampere sm_86) needs no special flags beyond a standard CUDA build; you do not need FP8 support (the GGUF quants are integer) and you do not need to install flash-attn separately — llama.cpp's -fa on uses its own built-in attention kernels.