self-hosted/ai
§01·recipe · llm

Ornith 1.0 9B on RTX 5060 (8GB): Local Agentic Coding at the Fit Boundary via llama.cpp + OpenHands

llmintermediate8GB+ VRAMJul 3, 2026

This intermediate recipe sets up Ornith 1.0 9B on the RTX 5060, needing about 8 GB of VRAM.

models
tools
prerequisites
  • NVIDIA RTX 5060 (8GB VRAM, Blackwell GB206, compute capability 12.0 / sm_120) or any 8GB+ consumer GPU
  • 16GB+ system RAM
  • ~6GB free disk for the recommended Q4_K_M GGUF
  • A recent CUDA toolkit (12.8+, for Blackwell sm_120 support) and a recent llama.cpp build if you compile yourself; or a current prebuilt CUDA release

What You'll Build

A fully local agentic-coding setup on an 8GB RTX 5060: Ornith 1.0 9B — DeepReinforce's open (MIT) ~9B dense coding model — served as an OpenAI-compatible endpoint by llama.cpp (or Ollama), and driven by a coding agent (OpenHands as the lead, Aider as a lighter alternative). Ornith produces <think> reasoning traces and native tool calls, so the agent can read your repo, run shell commands, and edit files. Unlike this catalogue's image and video recipes, there is no ComfyUI here: the runtime is a text-generation server plus a coding client.

Hardware data: RTX 5060 (8GB VRAM, Blackwell GB206, sm_120, ~448 GB/s GDDR7) · Ornith 1.0 9B, GGUF Q4_K_M (5.63GB) — the 8GB-friendly quant · See benchmark data

ℹ️ This is a coding LLM, not a chat generalist. Ornith 1.0 is a self-improving family of open-source models for agentic coding (per the Ornith-1.0-9B-GGUF model card). The 9B is the family's smallest member — the card calls it the most lightweight member of the Ornith family, designed for efficient single-GPU deployment. It is post-trained on top of Gemma 4 and Qwen 3.5, is MIT-licensed, and reports SWE-bench Verified 69.4 on the model card's own evaluation table.

⚠️ 8GB is the fit boundary — plan the recipe around Q4_K_M and a capped context. The 9B is a dense model, so its whole footprint is the quant file you load. Only Q4_K_M (5.63GB) leaves meaningful room on 8GB; Q5_K_M (6.47GB) fits but tightly, and Q6_K (7.36GB) / Q8_0 (9.53GB) do not leave usable context room on 8GB. With ~5.6GB of weights resident, only ~2GB of the card remains for the KV cache, so the context window must be capped and KV-quantized (see Running). The RTX 5060's newer Blackwell architecture and ~448 GB/s of GDDR7 bandwidth do not buy back VRAM — the 8GB capacity is what binds, so the fit story is the constrained one: you get the model, but not the full 262K window. If you want higher quant or a large context, step up to a 12GB card (RTX 3060 / 4070 / 5070). (The larger Ornith 1.0 35B is a different, 24GB+ model — do not confuse the two; on any 8GB card you want the 9B, and specifically its Q4_K_M quant.)

Requirements

ComponentMinimumTested
GPU8GB VRAM (Q4_K_M)RTX 5060 (8GB, Blackwell GB206, sm_120)
RAM16GB system RAM
Storage5.63GB (Q4_K_M)~6GB for Q4_K_M + headroom
Softwarellama.cpp (recent CUDA build) or Ollama; OpenHands or Aider client

Model weights (GGUF, byte-verified from the Ornith-1.0-9B-GGUF repo tree):

QuantOn-disk sizeFit on RTX 5060 (8GB)
Q4_K_M5.63GBRecommended — the only quant that leaves usable context room on 8GB (cap the context, see Running)
Q5_K_M6.47GBFits but tight — ~1.5GB left for KV cache; small context only
Q6_K7.36GBDoes not leave usable context room on 8GB — use a 12GB card
Q8_09.53GBDoes not fit 8GB — needs 12GB+
BF1617.92GBDoes not fit 8GB — needs 24GB+

Licensing. Ornith 1.0 is MIT-licensed (per the model card's license: mit and its highlight line: MIT licensed, globally accessible, and free from regional limitations). You can use it commercially and privately without revenue caps.

Installation

You have two runtimes. Pick one. llama.cpp gives you the most control over context and KV-cache flags — which matters on 8GB; Ollama is the fastest to stand up.

Option A — llama.cpp (recommended for full control)

1. Get llama.cpp with CUDA. Either download a prebuilt CUDA release, or build from source. llama.cpp publishes prebuilt binaries whose asset names follow llama-<version>-bin-<platform>-<backend>-<arch> — e.g. llama-b9859-bin-win-cuda-12.4-x64.zip for Windows CUDA, plus Ubuntu x64 CUDA packages, from the llama.cpp releases page.

⚠️ Blackwell (RTX 5060) needs a recent CUDA toolkit and a recent llama.cpp build. The 5060's Blackwell GB206 die is compute capability 12.0 (sm_120) — support for it landed only in CUDA 12.8+ and correspondingly recent llama.cpp releases. Older prebuilt CUDA binaries (and older toolkits) predate Blackwell: they will either fail to run or silently fall back to CPU, ignoring the GPU. Pick a current release/toolkit; don't reuse a stale binary from an earlier card.

To build from source instead, the official build guide gives:

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
# RTX 5060 is Blackwell = compute capability 12.0 (sm_120); needs CUDA 12.8+
cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=120
cmake --build build --config Release -j 8

Note. The CUDA backend flag is -DGGML_CUDA=ON on current llama.cpp (the old LLAMA_CUDA name was retired in late 2024). You need the NVIDIA CUDA toolkit (12.8+ for Blackwell) installed first. These are integer GGUF quants, so Blackwell's new FP4/FP8 tensor cores are irrelevant here — nothing special is needed beyond a standard recent CUDA build.

2. That's it for install — llama.cpp pulls the GGUF straight from Hugging Face at launch (next section). No separate download step.

Option B — Ollama

Install Ollama from ollama.com. Ollama is built on llama.cpp and can run any GGUF on the Hub directly — no Modelfile needed. No manual weight download; the run command below pulls the quant for you. (Use a current Ollama build so its bundled llama.cpp includes Blackwell sm_120 support.)

Running

With llama.cpp

Serve an OpenAI-compatible API on port 8000. The -hf flag pulls the GGUF from Hugging Face; append :Q4_K_M (case-insensitive) to select the quant — without a tag, llama.cpp defaults to Q4_K_M anyway (llama-server docs):

# Q4_K_M quant (the 8GB pick), offload all layers to the 5060, capped 16k context
llama-server -hf deepreinforce-ai/Ornith-1.0-9B-GGUF:Q4_K_M \
    --port 8000 \
    -ngl 99 \
    -c 16384 \
    --jinja

The model card's own quickstart shows the same server invocation with the full context window — but that full window is not attainable on 8GB, so treat it as reference only:

# Reference only — the full 262K window does NOT fit an 8GB card.
llama-server -hf deepreinforce-ai/Ornith-1.0-9B-GGUF --port 8000 -c 262144

-ngl 99 (alias --n-gpu-layers) offloads every layer to the GPU; -c (--ctx-size) sets the context length; --jinja uses the model's built-in chat template so the <think> reasoning and tool-call blocks parse correctly.

How much context actually fits in 8GB — cap it and quantize the KV cache. Ornith 1.0 is a reasoning model — per the model card, by default the assistant turn opens with a <think>...</think> reasoning block before the final answer, and its context window is a large 262,144 tokens. Context is stored in the KV cache, which grows with the number of tokens and lives in VRAM on top of the weights. With Q4_K_M (5.63GB) loaded, only about 2GB of the 8GB remains for the KV cache — enough for a capped coding session, but nowhere near the full 262,144-token window, and the <think> reasoning inflates KV pressure further because every reasoning token also lands in the cache. So: cap the context (start at -c 16384, push toward -c 32768 only with the KV cache quantized) and add -fa on (Flash Attention, required for quantized cache) plus -ctk q8_0 -ctv q8_0 (--cache-type-k / --cache-type-v), which roughly halves KV-cache VRAM versus the default f16 with minimal quality impact (llama-server docs):

llama-server -hf deepreinforce-ai/Ornith-1.0-9B-GGUF:Q4_K_M \
    --port 8000 -ngl 99 -c 32768 --jinja \
    -fa on -ctk q8_0 -ctv q8_0

If you need a bigger context window or a higher-fidelity quant than Q4_K_M, an 8GB card is the wrong tier — step up to a 12GB card (RTX 3060 / 4070 / 5070), where Q6_K/Q8_0 fit with real context headroom.

With Ollama

Pull and chat with the same GGUF straight from Hugging Face (the card documents this exact form). Append a :quant tag to choose the quant; the default is Q4_K_M — which is exactly what you want on 8GB (HF × Ollama docs):

ollama run hf.co/deepreinforce-ai/Ornith-1.0-9B-GGUF:Q4_K_M

Ollama serves an OpenAI-compatible API at http://localhost:11434/v1 for agent clients.

Connect a coding agent

Ornith is optimized for terminal-based coding agents (verbatim from the model card), which directs you to point any OpenAI-compatible coding CLI at your Ornith endpoint by setting its base URL and API key.

OpenHands (lead choice). OpenHands is the harness DeepReinforce used to measure the 9B's headline SWE-bench Verified 69.4 (the model card footnote reads: SWE-Bench Verified, Pro and Multilingual: using OpenHands harness), so it is the officially-exercised agentic path for this model. Point it at your local server exactly as the card's OpenHands example does:

pip install openhands-ai

# OpenHands routes through LiteLLM; the "openai/" prefix selects the OpenAI-compatible path.
export LLM_MODEL="openai/deepreinforce-ai/Ornith-1.0-9B"
export LLM_BASE_URL="http://localhost:8000/v1"
export LLM_API_KEY="EMPTY"

openhands

Aider (lighter alternative). For a simpler terminal pair-programmer against the same endpoint, Aider connects to any OpenAI-compatible API:

export OPENAI_API_BASE=http://localhost:8000/v1
export OPENAI_API_KEY=EMPTY
aider --model openai/deepreinforce-ai/Ornith-1.0-9B

Recommended sampling for Ornith is temperature=0.6, top_p=0.95, top_k=20 (per the model card's quickstart note).

Results

  • VRAM usage: The dense 9B loads entirely as its GGUF file — the recommended Q4_K_M is 5.63GB on disk (byte-verified from the GGUF repo tree), fitting inside the RTX 5060's 8GB and leaving ~2GB for a capped KV cache. Q5_K_M (6.47GB) fits but tight; Q6_K (7.36GB) and Q8_0 (9.53GB) do not leave usable context room on 8GB — those quants belong on a 12GB+ card.
  • Model capability: On its own evaluation table the 9B reports SWE-bench Verified 69.4, Terminal-Bench 2.1 (Terminus-2) 43.1, and NL2Repo 27.2 (model card) — state-of-the-art among open models of comparable size, per the card. These are the model's own coding-benchmark scores, not hardware throughput on this GPU.
  • Speed: No local throughput benchmark exists for Ornith 1.0 9B on the RTX 5060 yet — this is a brand-new model and /check/ornith-1-0-9b/rtx-5060 has no benchmark rows. We omit a tok/s figure rather than invent one; live measurements will appear at that link once contributed.

For the full benchmark data, see /check/ornith-1-0-9b/rtx-5060.

Troubleshooting

The reply is full of raw <think> / <tool_call> tags

Ornith is a reasoning model with native tool-calling: the assistant turn opens with a <think> … </think> block and emits <tool_call> blocks. If your client shows these as raw text, make sure the server applies the model's chat template — pass --jinja to llama-server, or use Ollama (which reads the GGUF's built-in tokenizer.chat_template). A correctly-templated server surfaces the reasoning as a separate reasoning_content field and the tool calls as OpenAI-style tool_calls, per the model card's quickstart note.

The GPU is ignored / falls back to CPU on the RTX 5060

Blackwell (RTX 5060, GB206, sm_120) is only recognised by CUDA 12.8+ and recent llama.cpp/Ollama builds. If llama-server runs but is slow and nvidia-smi shows near-zero GPU utilisation, you are almost certainly on a pre-Blackwell binary or toolkit: the layers silently ran on CPU. Upgrade to a current prebuilt CUDA release (or rebuild with -DCMAKE_CUDA_ARCHITECTURES=120 against CUDA 12.8+), and confirm the startup log lists the RTX 5060 as an offload target. Because these are integer GGUF quants, you do not need to configure Blackwell's FP4/FP8 tensor cores — a standard recent CUDA build uses the GPU fully.

Out of memory at a long context

On 8GB the weights leave only ~2GB for the KV cache, so context is the first thing to blow up — especially because the <think> reasoning fills the cache with reasoning tokens too. If you OOM, lower -c (drop back toward -c 16384) and quantize the KV cache with -fa on -ctk q8_0 -ctv q8_0 (see Running). If you still can't get the context you need, the 8GB tier is the constraint — move to a 12GB card (RTX 3060 / 4070 / 5070). Do not try to reach for Q6_K/Q8_0 on 8GB to "fix" this; the higher quant leaves even less room for context. Blackwell's newer compute does not change this — the ceiling is the 8GB capacity, not throughput.

torch / CUDA not needed — this is llama.cpp

Serving Ornith via llama.cpp or Ollama does not require PyTorch, flash-attn wheels, or a Python ML stack — those belong to the vLLM/SGLang/Transformers paths on the card, which target 80GB datacenter GPUs. On the RTX 5060 the GGUF + llama.cpp path is the right one; if you hit a CUDA error, confirm you installed a recent CUDA-enabled llama.cpp build (Option A) with Blackwell support rather than a CPU-only or pre-Blackwell binary.

Model or GPU 404 on /check

Ornith 1.0 9B is a new addition; if the /check link 404s, the catalogue row is still being registered. The recipe's install and run steps are independent of the benchmark endpoint.

common questions
How much VRAM does Ornith 1.0 9B need?

About 8 GB — the minimum this recipe targets.

Which GPUs is Ornith 1.0 9B tested on?

RTX 5060 (8 GB).

How hard is this setup?

Intermediate — follow the steps above.