self-hosted/ai
§01·recipe · llm

Ornith 1.0 9B on RTX 3060 (12GB): A Local Agentic-Coding Model on a Budget Card via llama.cpp + OpenHands

llmintermediate12GB+ VRAMJul 3, 2026

This intermediate recipe sets up Ornith 1.0 9B on the RTX 3060, needing about 12 GB of VRAM.

models
tools
prerequisites
  • NVIDIA RTX 3060 (12GB VRAM, Ampere GA106, sm_86) or any 12GB+ consumer GPU
  • 16GB+ system RAM
  • ~7-10GB free disk for the chosen GGUF quant
  • A recent CUDA toolkit (12.x) if you compile llama.cpp yourself; or use a prebuilt CUDA release

What You'll Build

A fully local agentic-coding setup on a 12GB RTX 3060: Ornith 1.0 9B — DeepReinforce's open (MIT) ~9B dense coding model — served as an OpenAI-compatible endpoint by llama.cpp (or Ollama), and driven by a coding agent (OpenHands as the lead, Aider as a lighter alternative). Ornith produces <think> reasoning traces and native tool calls, so the agent can read your repo, run shell commands, and edit files. Unlike this catalogue's image and video recipes, there is no ComfyUI here: the runtime is a text-generation server plus a coding client. The RTX 3060 12GB is one of the most popular budget GPUs, which makes it an ideal entry point into local agentic coding.

Hardware data: RTX 3060 (12GB VRAM) · Ornith 1.0 9B, GGUF Q6_K (7.36GB) or Q8_0 (9.53GB) · See benchmark data

ℹ️ This is a coding LLM, not a chat generalist. Ornith 1.0 is a self-improving family of open-source models for agentic coding (per the Ornith-1.0-9B-GGUF model card). The 9B is the family's smallest member — the card calls it the most lightweight member of the Ornith family, designed for efficient single-GPU deployment. It is post-trained on top of Gemma 4 and Qwen 3.5, is MIT-licensed, and reports SWE-bench Verified 69.4 on the model card's own evaluation table.

Why the RTX 3060 12GB is a capable tier for the 9B. The 9B is a dense model — its whole footprint is the quant file you load, and even the largest non-BF16 quant (Q8_0, 9.53GB) fits inside 12GB. That leaves room for a genuinely useful context window on top of the weights, which is the story below. The RTX 3060 is the slower card of the 12GB tier — its Ampere GA106 has notably lower memory bandwidth than a newer Ada 4070, so effective throughput is lower — but it holds exactly the same quants with the same context math, and remains a very capable local coding rig. (The larger Ornith 1.0 35B is a different, 24GB+ model — do not confuse the two; on a 12GB card you want the 9B.)

Requirements

ComponentMinimumTested
GPU8GB VRAM (Q4_K_M)RTX 3060 (12GB, Ampere GA106, sm_86)
RAM16GB system RAM
Storage5.63GB (Q4_K_M) up to 9.53GB (Q8_0)~10GB for Q8_0 + headroom
Softwarellama.cpp (CUDA) or Ollama; OpenHands or Aider client

Model weights (GGUF, byte-verified from the Ornith-1.0-9B-GGUF repo tree):

QuantOn-disk sizeFit on RTX 3060 (12GB)
Q4_K_M5.63GBFits with large context room (also runs on 8GB cards)
Q5_K_M6.47GBFits comfortably
Q6_K7.36GBRecommended — near-lossless, leaves several GB for KV cache
Q8_09.53GBFits — highest-fidelity quant, less context headroom
BF1617.92GBDoes not fit 12GB — needs 24GB+

Licensing. Ornith 1.0 is MIT-licensed (per the model card's license: mit and its highlight line: MIT licensed, globally accessible, and free from regional limitations). You can use it commercially and privately without revenue caps.

Installation

You have two runtimes. Pick one. llama.cpp gives you the most control over context and KV-cache flags; Ollama is the fastest to stand up.

Option A — llama.cpp (recommended for full control)

1. Get llama.cpp with CUDA. Either download a prebuilt CUDA release, or build from source. llama.cpp publishes prebuilt binaries whose asset names follow llama-<version>-bin-<platform>-<backend>-<arch> — e.g. llama-b9859-bin-win-cuda-12.4-x64.zip for Windows CUDA, plus Ubuntu x64 CUDA packages, from the llama.cpp releases page.

To build from source instead, the official build guide gives:

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
# RTX 3060 is Ampere = compute capability 8.6 (sm_86)
cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=86
cmake --build build --config Release -j 8

Note. The CUDA backend flag is -DGGML_CUDA=ON on current llama.cpp (the old LLAMA_CUDA name was retired in late 2024). You need the NVIDIA CUDA toolkit installed first. The RTX 3060 uses integer GGUF quants — there is no FP8 path here, so a standard CUDA build is all you need.

2. That's it for install — llama.cpp pulls the GGUF straight from Hugging Face at launch (next section). No separate download step.

Option B — Ollama

Install Ollama from ollama.com. Ollama is built on llama.cpp and can run any GGUF on the Hub directly — no Modelfile needed. No manual weight download; the run command below pulls the quant for you.

Running

With llama.cpp

Serve an OpenAI-compatible API on port 8000. The -hf flag pulls the GGUF from Hugging Face; append :Q6_K (case-insensitive) to select the quant — without a tag, llama.cpp defaults to Q4_K_M (llama-server docs):

# Q6_K quant, offload all layers to the 3060, 65k-token context
llama-server -hf deepreinforce-ai/Ornith-1.0-9B-GGUF:Q6_K \
    --port 8000 \
    -ngl 99 \
    -c 65536 \
    --jinja

The model card's own quickstart shows the same server invocation with the full context window:

llama-server -hf deepreinforce-ai/Ornith-1.0-9B-GGUF --port 8000 -c 262144

-ngl 99 (alias --n-gpu-layers) offloads every layer to the GPU; -c (--ctx-size) sets the context length; --jinja uses the model's built-in chat template so the <think> reasoning and tool-call blocks parse correctly.

How much context actually fits in 12GB. Ornith 1.0 is a reasoning model — per the model card, by default the assistant turn opens with a <think>...</think> reasoning block before the final answer, and its context window is a large 262,144 tokens. Context is stored in the KV cache, which grows with the number of tokens and lives in VRAM on top of the weights. With Q6_K (7.36GB) loaded, several GB of the 12GB remain for the KV cache — plenty for the tens-of-thousands of tokens a typical coding session uses, but not the full 262,144-token window at once. To push context further, quantize the KV cache: add -fa on (Flash Attention, required for quantized cache) and -ctk q8_0 -ctv q8_0 (--cache-type-k / --cache-type-v), which roughly halves KV-cache VRAM versus the default f16 with minimal quality impact (llama-server docs):

llama-server -hf deepreinforce-ai/Ornith-1.0-9B-GGUF:Q6_K \
    --port 8000 -ngl 99 -c 131072 --jinja \
    -fa on -ctk q8_0 -ctv q8_0

With Ollama

Pull and chat with the same GGUF straight from Hugging Face (the card documents this exact form). Append a :quant tag to choose the quant; the default is Q4_K_M (HF × Ollama docs):

ollama run hf.co/deepreinforce-ai/Ornith-1.0-9B-GGUF:Q6_K

Ollama serves an OpenAI-compatible API at http://localhost:11434/v1 for agent clients.

Connect a coding agent

Ornith is optimized for terminal-based coding agents (verbatim from the model card), which directs you to point any OpenAI-compatible coding CLI at your Ornith endpoint by setting its base URL and API key.

OpenHands (lead choice). OpenHands is the harness DeepReinforce used to measure the 9B's headline SWE-bench Verified 69.4 (the model card footnote reads: SWE-Bench Verified, Pro and Multilingual: using OpenHands harness), so it is the officially-exercised agentic path for this model. Point it at your local server exactly as the card's OpenHands example does:

pip install openhands-ai

# OpenHands routes through LiteLLM; the "openai/" prefix selects the OpenAI-compatible path.
export LLM_MODEL="openai/deepreinforce-ai/Ornith-1.0-9B"
export LLM_BASE_URL="http://localhost:8000/v1"
export LLM_API_KEY="EMPTY"

openhands

Aider (lighter alternative). For a simpler terminal pair-programmer against the same endpoint, Aider connects to any OpenAI-compatible API:

export OPENAI_API_BASE=http://localhost:8000/v1
export OPENAI_API_KEY=EMPTY
aider --model openai/deepreinforce-ai/Ornith-1.0-9B

Recommended sampling for Ornith is temperature=0.6, top_p=0.95, top_k=20 (per the model card's quickstart note).

Results

  • VRAM usage: The dense 9B loads entirely as its GGUF file — Q6_K is 7.36GB and Q8_0 is 9.53GB on disk (byte-verified from the GGUF repo tree), both fitting inside the RTX 3060's 12GB with room for the KV cache. Q4_K_M (5.63GB) leaves the most context headroom and also runs on 8GB cards.
  • Model capability: On its own evaluation table the 9B reports SWE-bench Verified 69.4, Terminal-Bench 2.1 (Terminus-2) 43.1, and NL2Repo 27.2 (model card) — state-of-the-art among open models of comparable size, per the card. These are the model's own coding-benchmark scores, not hardware throughput on this GPU.
  • Speed: No local throughput benchmark exists for Ornith 1.0 9B on the RTX 3060 yet — this is a brand-new model and /check/ornith-1-0-9b/rtx-3060 has no benchmark rows. Note that the RTX 3060's memory bandwidth is lower than newer 12GB cards like the 4070, so expect somewhat lower tokens/sec at the same quant — but we do not quote a tok/s figure rather than invent one; live measurements will appear at that link once contributed.

For the full benchmark data, see /check/ornith-1-0-9b/rtx-3060.

Troubleshooting

The reply is full of raw <think> / <tool_call> tags

Ornith is a reasoning model with native tool-calling: the assistant turn opens with a <think> … </think> block and emits <tool_call> blocks. If your client shows these as raw text, make sure the server applies the model's chat template — pass --jinja to llama-server, or use Ollama (which reads the GGUF's built-in tokenizer.chat_template). A correctly-templated server surfaces the reasoning as a separate reasoning_content field and the tool calls as OpenAI-style tool_calls, per the model card's quickstart note.

Out of memory at a long context

The weights fit 12GB easily; the KV cache is what grows. If you OOM when raising -c, either lower the context length or quantize the KV cache with -fa on -ctk q8_0 -ctv q8_0 (see Running). Dropping from Q8_0 to Q6_K also frees ~2GB for context.

torch / CUDA not needed — this is llama.cpp

Serving Ornith via llama.cpp or Ollama does not require PyTorch, flash-attn wheels, or a Python ML stack — those belong to the vLLM/SGLang/Transformers paths on the card, which target 80GB datacenter GPUs. On the RTX 3060 the GGUF + llama.cpp path is the right one; if you hit a CUDA error, confirm you installed the CUDA-enabled llama.cpp build (Option A) rather than a CPU-only binary.

Model or GPU 404 on /check

Ornith 1.0 9B is a new addition; if the /check link 404s, the catalogue row is still being registered. The recipe's install and run steps are independent of the benchmark endpoint.

common questions
How much VRAM does Ornith 1.0 9B need?

About 12 GB — the minimum this recipe targets.

Which GPUs is Ornith 1.0 9B tested on?

RTX 3060 (12 GB).

How hard is this setup?

Intermediate — follow the steps above.