self-hosted/ai
§01·recipe · llm

Ornith 1.0 9B on RTX 4070 Ti SUPER: Max-Fidelity Local Agentic Coding in 16GB via llama.cpp + OpenHands

llmintermediate12GB+ VRAMJul 3, 2026

This intermediate recipe sets up Ornith 1.0 9B on the RTX 4070 Ti Super, needing about 12 GB of VRAM.

models
tools
prerequisites
  • NVIDIA RTX 4070 Ti SUPER (16GB VRAM, Ada AD103, sm_89) or any 16GB+ consumer GPU
  • 16GB+ system RAM
  • ~10-12GB free disk for the chosen GGUF quant
  • A recent CUDA toolkit (12.x) if you compile llama.cpp yourself; or use a prebuilt CUDA release

What You'll Build

A fully local agentic-coding setup on a 16GB RTX 4070 Ti SUPER: Ornith 1.0 9B — DeepReinforce's open (MIT) ~9B dense coding model — served as an OpenAI-compatible endpoint by llama.cpp (or Ollama), and driven by a coding agent (OpenHands as the lead, Aider as a lighter alternative). Ornith produces <think> reasoning traces and native tool calls, so the agent can read your repo, run shell commands, and edit files. Unlike this catalogue's image and video recipes, there is no ComfyUI here: the runtime is a text-generation server plus a coding client.

Hardware data: RTX 4070 Ti SUPER (16GB VRAM, Ada AD103, sm_89, ~672 GB/s) · Ornith 1.0 9B, GGUF Q8_0 (9.53GB) or Q6_K (7.36GB) · See benchmark data

ℹ️ This is a coding LLM, not a chat generalist. Ornith 1.0 is a self-improving family of open-source models for agentic coding (per the Ornith-1.0-9B-GGUF model card). The 9B is the family's smallest member — the card calls it the most lightweight member of the Ornith family, designed for efficient single-GPU deployment. It is post-trained on top of Gemma 4 and Qwen 3.5, is MIT-licensed, and reports SWE-bench Verified 69.4 on the model card's own evaluation table.

Why the RTX 4070 Ti SUPER is a roomy tier for the 9B. The 9B is a dense model — its whole footprint is the quant file you load. On 16GB you can run the highest-fidelity non-BF16 quant, Q8_0 (9.53GB), and keep a large context window resident at the same time — no KV-cache quantization needed for typical coding sessions. That extra ~4GB over a 12GB card is pure context headroom: where a 12GB card running Q8_0 leaves less room for the KV cache, the 4070 Ti SUPER lets you serve max-fidelity weights and a generous window together, which is the story below. (The larger Ornith 1.0 35B is a different, 24GB+ model — do not confuse the two; on a 16GB card you want the 9B, since the 35B's Q4_K_M floor is 21.2GB and does not fit 16GB.)

Requirements

ComponentMinimumTested
GPU8GB VRAM (Q4_K_M)RTX 4070 Ti SUPER (16GB, Ada AD103, sm_89)
RAM16GB system RAM
Storage5.63GB (Q4_K_M) up to 9.53GB (Q8_0)~12GB for Q8_0 + headroom
Softwarellama.cpp (CUDA) or Ollama; OpenHands or Aider client

Model weights (GGUF, byte-verified from the Ornith-1.0-9B-GGUF repo tree):

QuantOn-disk sizeFit on RTX 4070 Ti SUPER (16GB)
Q4_K_M5.63GBFits with huge context room (also runs on 8GB cards)
Q5_K_M6.47GBFits with very large context room
Q6_K7.36GBFits — near-lossless, leaves ~8GB for KV cache and long context
Q8_09.53GBRecommended — highest-fidelity quant; on 16GB it still leaves ~6GB for a large context
BF1617.92GBDoes not fit 16GB — needs 24GB+

Licensing. Ornith 1.0 is MIT-licensed (per the model card's license: mit and its highlight line: MIT licensed, globally accessible, and free from regional limitations). You can use it commercially and privately without revenue caps.

Installation

You have two runtimes. Pick one. llama.cpp gives you the most control over context and KV-cache flags; Ollama is the fastest to stand up.

Option A — llama.cpp (recommended for full control)

1. Get llama.cpp with CUDA. Either download a prebuilt CUDA release, or build from source. llama.cpp publishes prebuilt binaries whose asset names follow llama-<version>-bin-<platform>-<backend>-<arch> — e.g. llama-b9859-bin-win-cuda-12.4-x64.zip for Windows CUDA, plus Ubuntu x64 CUDA packages, from the llama.cpp releases page.

To build from source instead, the official build guide gives:

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
# RTX 4070 Ti SUPER is Ada Lovelace = compute capability 8.9 (sm_89)
cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=89
cmake --build build --config Release -j 8

Note. The CUDA backend flag is -DGGML_CUDA=ON on current llama.cpp (the old LLAMA_CUDA name was retired in late 2024). You need the NVIDIA CUDA toolkit installed first. The RTX 4070 Ti SUPER uses the AD103 die and the Ada sm_89 compute capability shared across the whole Ada consumer line, so -DCMAKE_CUDA_ARCHITECTURES=89 is identical whether you build for a 4070, a 4080, or this card.

2. That's it for install — llama.cpp pulls the GGUF straight from Hugging Face at launch (next section). No separate download step.

Option B — Ollama

Install Ollama from ollama.com. Ollama is built on llama.cpp and can run any GGUF on the Hub directly — no Modelfile needed. No manual weight download; the run command below pulls the quant for you.

Running

With llama.cpp

Serve an OpenAI-compatible API on port 8000. The -hf flag pulls the GGUF from Hugging Face; append :Q8_0 (case-insensitive) to select the quant — without a tag, llama.cpp defaults to Q4_K_M (llama-server docs):

# Q8_0 quant (max fidelity), offload all layers to the 4070 Ti SUPER, 131k-token context
llama-server -hf deepreinforce-ai/Ornith-1.0-9B-GGUF:Q8_0 \
    --port 8000 \
    -ngl 99 \
    -c 131072 \
    --jinja

The model card's own quickstart shows the same server invocation with the full context window:

llama-server -hf deepreinforce-ai/Ornith-1.0-9B-GGUF --port 8000 -c 262144

-ngl 99 (alias --n-gpu-layers) offloads every layer to the GPU; -c (--ctx-size) sets the context length; --jinja uses the model's built-in chat template so the <think> reasoning and tool-call blocks parse correctly.

How much context actually fits in 16GB. Ornith 1.0 is a reasoning model — per the model card, by default the assistant turn opens with a <think>...</think> reasoning block before the final answer, and its context window is a large 262,144 tokens. Context is stored in the KV cache, which grows with the number of tokens and lives in VRAM on top of the weights. With Q8_0 (9.53GB) loaded, roughly ~6GB of the 16GB remain for the KV cache — enough to serve a 131,072-token window at default f16 cache without any quantization, which is why the command above sets -c 131072 and needs no KV-quant flags. To push toward the model's full 262,144-token window, quantize the KV cache: add -fa on (Flash Attention, required for quantized cache) and -ctk q8_0 -ctv q8_0 (--cache-type-k / --cache-type-v), which roughly halves KV-cache VRAM versus the default f16 with minimal quality impact (llama-server docs):

# Push toward the full 262k context on Q8_0 by quantizing the KV cache
llama-server -hf deepreinforce-ai/Ornith-1.0-9B-GGUF:Q8_0 \
    --port 8000 -ngl 99 -c 262144 --jinja \
    -fa on -ctk q8_0 -ctv q8_0

If you want even more context headroom without touching the KV cache, drop to Q6_K (7.36GB) — that frees ~2GB more of the 16GB for the f16 KV cache while staying near-lossless:

llama-server -hf deepreinforce-ai/Ornith-1.0-9B-GGUF:Q6_K \
    --port 8000 -ngl 99 -c 131072 --jinja

With Ollama

Pull and chat with the same GGUF straight from Hugging Face (the card documents this exact form). Append a :quant tag to choose the quant; the default is Q4_K_M (HF × Ollama docs):

ollama run hf.co/deepreinforce-ai/Ornith-1.0-9B-GGUF:Q8_0

Ollama serves an OpenAI-compatible API at http://localhost:11434/v1 for agent clients.

Connect a coding agent

Ornith is optimized for terminal-based coding agents (verbatim from the model card), which directs you to point any OpenAI-compatible coding CLI at your Ornith endpoint by setting its base URL and API key.

OpenHands (lead choice). OpenHands is the harness DeepReinforce used to measure the 9B's headline SWE-bench Verified 69.4 (the model card footnote reads: SWE-Bench Verified, Pro and Multilingual: using OpenHands harness), so it is the officially-exercised agentic path for this model. Point it at your local server exactly as the card's OpenHands example does:

pip install openhands-ai

# OpenHands routes through LiteLLM; the "openai/" prefix selects the OpenAI-compatible path.
export LLM_MODEL="openai/deepreinforce-ai/Ornith-1.0-9B"
export LLM_BASE_URL="http://localhost:8000/v1"
export LLM_API_KEY="EMPTY"

openhands

Aider (lighter alternative). For a simpler terminal pair-programmer against the same endpoint, Aider connects to any OpenAI-compatible API:

export OPENAI_API_BASE=http://localhost:8000/v1
export OPENAI_API_KEY=EMPTY
aider --model openai/deepreinforce-ai/Ornith-1.0-9B

Recommended sampling for Ornith is temperature=0.6, top_p=0.95, top_k=20 (per the model card's quickstart note).

Results

  • VRAM usage: The dense 9B loads entirely as its GGUF file — Q8_0 is 9.53GB and Q6_K is 7.36GB on disk (byte-verified from the GGUF repo tree). On the RTX 4070 Ti SUPER's 16GB, the max-fidelity Q8_0 fits with roughly ~6GB left over for the KV cache — enough for a large context window at the default f16 cache, no KV quantization required. Q4_K_M (5.63GB) leaves the most context headroom and also runs on 8GB cards.
  • Model capability: On its own evaluation table the 9B reports SWE-bench Verified 69.4, Terminal-Bench 2.1 (Terminus-2) 43.1, and NL2Repo 27.2 (model card) — state-of-the-art among open models of comparable size, per the card. These are the model's own coding-benchmark scores, not hardware throughput on this GPU.
  • Speed: No local throughput benchmark exists for Ornith 1.0 9B on the RTX 4070 Ti SUPER yet — this is a brand-new model and /check/ornith-1-0-9b/rtx-4070-ti-super has no benchmark rows. We would rather omit tok/s than invent it; live measurements will appear at that link once contributed.

For the full benchmark data, see /check/ornith-1-0-9b/rtx-4070-ti-super.

Troubleshooting

The reply is full of raw <think> / <tool_call> tags

Ornith is a reasoning model with native tool-calling: the assistant turn opens with a <think> … </think> block and emits <tool_call> blocks. If your client shows these as raw text, make sure the server applies the model's chat template — pass --jinja to llama-server, or use Ollama (which reads the GGUF's built-in tokenizer.chat_template). A correctly-templated server surfaces the reasoning as a separate reasoning_content field and the tool calls as OpenAI-style tool_calls, per the model card's quickstart note.

Out of memory at a long context

Even at Q8_0 the weights fit 16GB easily; the KV cache is what grows. If you OOM when raising -c beyond ~131k, either lower the context length or quantize the KV cache with -fa on -ctk q8_0 -ctv q8_0 (see Running) to reach toward the full 262k window. Dropping from Q8_0 to Q6_K also frees ~2GB for context.

torch / CUDA not needed — this is llama.cpp

Serving Ornith via llama.cpp or Ollama does not require PyTorch, flash-attn wheels, or a Python ML stack — those belong to the vLLM/SGLang/Transformers paths on the card, which target 80GB datacenter GPUs. On the RTX 4070 Ti SUPER the GGUF + llama.cpp path is the right one; if you hit a CUDA error, confirm you installed the CUDA-enabled llama.cpp build (Option A) rather than a CPU-only binary.

Model or GPU 404 on /check

Ornith 1.0 9B is a new addition; if the /check link 404s, the catalogue row is still being registered. The recipe's install and run steps are independent of the benchmark endpoint.

common questions
How much VRAM does Ornith 1.0 9B need?

About 12 GB — the minimum this recipe targets.

Which GPUs is Ornith 1.0 9B tested on?

RTX 4070 Ti Super (16 GB).

How hard is this setup?

Intermediate — follow the steps above.