How much VRAM does Ornith 1.0 9B need?

About 12 GB — the minimum this recipe targets.

How hard is this setup?

Intermediate — follow the steps above.

Ornith 1.0 9B on RX 7800 XT: Max-Fidelity Local Agentic Coding in 16GB via llama.cpp-HIP + OpenHands (ROCm)

What You'll Build

A fully local agentic-coding setup on a 16GB Radeon RX 7800 XT (RDNA3, Navi 32, gfx1101): Ornith 1.0 9B — DeepReinforce's open (MIT) ~9B dense coding model — served as an OpenAI-compatible endpoint by llama.cpp (built against AMD's HIP/ROCm backend) or Ollama, and driven by a coding agent (OpenHands as the lead, Aider as a lighter alternative). Ornith produces <think> reasoning traces and native tool calls, so the agent can read your repo, run shell commands, and edit files. Unlike this catalogue's image and video recipes, there is no ComfyUI here: the runtime is a text-generation server plus a coding client.

Hardware data: RX 7800 XT (16GB VRAM) · Ornith 1.0 9B, GGUF Q8_0 (9.53GB) or Q6_K (7.36GB) · ROCm 7 · See benchmark data

⚠️ This is a ROCm recipe, not CUDA. The RX 7800 XT runs on AMD's ROCm/HIP stack — there is no cu124/cu128 wheel and no FlashAttention prebuilt-wheel step here. For this coding LLM the reliable path is GGUF via llama.cpp-HIP (or Ollama, which bundles llama.cpp). Do not follow a guide that tells you to pip install flash-attn, pick a cu12x wheel, or use ExLlamaV2/Marlin for this card — those are NVIDIA-only.

ℹ️ This is a coding LLM, not a chat generalist. Ornith 1.0 is a self-improving family of open-source models for agentic coding (per the Ornith-1.0-9B-GGUF model card). The 9B is the family's smallest member — the card calls it the most lightweight member of the Ornith family, designed for efficient single-GPU deployment. It is post-trained on top of Gemma 4 and Qwen 3.5, is MIT-licensed, and reports SWE-bench Verified 69.4 on the model card's own evaluation table.

✅ Why the RX 7800 XT is a roomy tier for the 9B. The 9B is a dense model — its whole footprint is the quant file you load. On 16GB you can run the highest-fidelity non-BF16 quant, Q8_0 (9.53GB), and keep a large context window resident at the same time — no KV-cache quantization needed for typical coding sessions. That headroom over a 12GB card is pure context room. (The larger Ornith 1.0 35B is a different, 24GB+ model — do not confuse the two; on a 16GB card you want the 9B, since the 35B's Q4_K_M floor is 21.2GB and does not fit 16GB. For the 35B on AMD you need a 24GB RX 7900 XTX.)

Requirements

Component	Minimum	Tested
GPU	8GB VRAM (Q4_K_M)	RX 7800 XT (16GB, RDNA3 Navi 32, gfx1101)
RAM	16GB system RAM	—
Storage	5.63GB (Q4_K_M) up to 9.53GB (Q8_0)	~12GB for Q8_0 + headroom
Driver	AMD ROCm v7 (installed via `amdgpu-install`) on Linux	—
Software	llama.cpp (HIP build) or Ollama; OpenHands or Aider client	—

Model weights (GGUF, byte-verified from the Ornith-1.0-9B-GGUF repo tree):

Quant	On-disk size	Fit on RX 7800 XT (16GB)
Q4_K_M	5.63GB	Fits with huge context room (also runs on 8GB cards)
Q5_K_M	6.47GB	Fits with very large context room
Q6_K	7.36GB	Fits — near-lossless, leaves ~8GB for KV cache and long context
Q8_0	9.53GB	Recommended — highest-fidelity quant; on 16GB it still leaves ~6GB for a large context
BF16	17.92GB	Does not fit 16GB — needs 24GB+

Licensing. Ornith 1.0 is MIT-licensed (per the model card's license: mit and its highlight line: MIT licensed, globally accessible, and free from regional limitations). You can use it commercially and privately without revenue caps.

Installation

Prerequisite — install the AMD ROCm v7 driver

The RX 7800 XT (gfx1101) is an officially ROCm-supported GPU — it is listed with LLVM target gfx1101 in AMD's ROCm Linux system-requirements matrix — but ROCm is not bundled with Ollama or the llama.cpp release binaries; you install it once at the OS level. Per the Ollama AMD GPU docs, Ollama requires the AMD ROCm v7 driver on Linux, installed or upgraded with the amdgpu-install utility. On Ubuntu 24.04 (Noble), install ROCm 7.2.1 via the standard amdgpu-install flow (AMD's Radeon ROCm install docs cover the current packages):

# 1. Add the amdgpu-install package and install ROCm
wget https://repo.radeon.com/amdgpu-install/7.2.1/ubuntu/noble/amdgpu-install_7.2.1.70201-1_all.deb
sudo apt install ./amdgpu-install_7.2.1.70201-1_all.deb
sudo apt update
sudo amdgpu-install -y --usecase=graphics,rocm

# 2. Add yourself to the render/video groups (log out/in afterward)
sudo usermod -a -G render,video $LOGNAME

Because the RX 7800 XT is on the supported-GPU matrix as gfx1101, you should not normally need an HSA_OVERRIDE_GFX_VERSION masquerade. If a tool ships only gfx1100 kernels and refuses to start on your card, the documented Linux fallback is to export HSA_OVERRIDE_GFX_VERSION=11.0.0 so the gfx1101 card presents as gfx1100 — treat that as a fallback, not a default.

You have two runtimes. Pick one. llama.cpp gives you the most control over context and KV-cache flags; Ollama is the fastest to stand up.

Option A — llama.cpp built with HIP/ROCm (recommended for full control)

1. Build llama.cpp with the HIP backend. Per the llama.cpp build docs, the Linux HIP build pattern is the same as for any RDNA3 card — only the GPU_TARGETS value changes. For the RX 7800 XT, pin it to gfx1101 (the card's LLVM target per the AMD ROCm system-requirements matrix):

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
HIPCXX="$(hipconfig -l)/clang" HIP_PATH="$(hipconfig -R)" \
    cmake -S . -B build -DGGML_HIP=ON -DGPU_TARGETS=gfx1101 -DCMAKE_BUILD_TYPE=Release \
    && cmake --build build --config Release -- -j 16

-DGGML_HIP=ON selects the ROCm backend; -DGPU_TARGETS=gfx1101 pins the kernels to the RX 7800 XT's architecture (Navi 32). Do not copy a gfx1100 value from a 7900-series guide — that is the wrong target for this card. The GGUF quants are integer formats, so RDNA3's lack of FP8/FP4 tensor hardware is irrelevant here.

2. That's it for install — llama.cpp pulls the GGUF straight from Hugging Face at launch (next section). No separate download step.

Option B — Ollama

Install Ollama with the Linux one-liner. Ollama is built on llama.cpp, detects the ROCm runtime you installed above, runs the gfx1101 card without any manual architecture flag, and can run any GGUF on the Hub directly — no Modelfile needed:

curl -fsSL https://ollama.com/install.sh | sh

Per the Ollama AMD preview blog, all of Ollama's features can be accelerated by AMD graphics cards on Linux and Windows. No manual weight download; the run command below pulls the quant for you.

Running

With llama.cpp

Serve an OpenAI-compatible API on port 8000. The -hf flag pulls the GGUF from Hugging Face; append :Q8_0 (case-insensitive) to select the quant — without a tag, llama.cpp defaults to Q4_K_M (llama-server docs):

# Q8_0 quant (max fidelity), offload all layers to the 7800 XT, 131k-token context
./build/bin/llama-server -hf deepreinforce-ai/Ornith-1.0-9B-GGUF:Q8_0 \
    --port 8000 \
    -ngl 99 \
    -c 131072 \
    --jinja

The model card's own quickstart shows the same server invocation with the full context window:

llama-server -hf deepreinforce-ai/Ornith-1.0-9B-GGUF --port 8000 -c 262144

-ngl 99 (alias --n-gpu-layers) offloads every layer to the GPU; -c (--ctx-size) sets the context length; --jinja uses the model's built-in chat template so the <think> reasoning and tool-call blocks parse correctly.

How much context actually fits in 16GB. Ornith 1.0 is a reasoning model — per the model card, by default the assistant turn opens with a <think>...</think> reasoning block before the final answer, and its context window is a large 262,144 tokens. Context is stored in the KV cache, which grows with the number of tokens and lives in VRAM on top of the weights. With Q8_0 (9.53GB) loaded, roughly ~6GB of the 16GB remain for the KV cache — enough to serve a 131,072-token window at default f16 cache without any quantization, which is why the command above sets -c 131072 and needs no KV-quant flags. To push toward the model's full 262,144-token window, quantize the KV cache: add -fa on (Flash Attention, required for quantized cache) and -ctk q8_0 -ctv q8_0 (--cache-type-k / --cache-type-v), which roughly halves KV-cache VRAM versus the default f16 with minimal quality impact (llama-server docs):

# Push toward the full 262k context on Q8_0 by quantizing the KV cache
./build/bin/llama-server -hf deepreinforce-ai/Ornith-1.0-9B-GGUF:Q8_0 \
    --port 8000 -ngl 99 -c 262144 --jinja \
    -fa on -ctk q8_0 -ctv q8_0

If you want even more context headroom without touching the KV cache, drop to Q6_K (7.36GB) — that frees ~2GB more of the 16GB for the f16 KV cache while staying near-lossless:

./build/bin/llama-server -hf deepreinforce-ai/Ornith-1.0-9B-GGUF:Q6_K \
    --port 8000 -ngl 99 -c 131072 --jinja

With Ollama

Pull and chat with the same GGUF straight from Hugging Face (the card documents this exact form). Append a :quant tag to choose the quant; the default is Q4_K_M (HF × Ollama docs):

ollama run hf.co/deepreinforce-ai/Ornith-1.0-9B-GGUF:Q8_0

Ollama serves an OpenAI-compatible API at http://localhost:11434/v1 for agent clients and uses the ROCm runtime on the gfx1101 card automatically.

Connect a coding agent

Ornith is optimized for terminal-based coding agents (verbatim from the model card), which directs you to point any OpenAI-compatible coding CLI at your Ornith endpoint by setting its base URL and API key.

OpenHands (lead choice). OpenHands is the harness DeepReinforce used to measure the 9B's headline SWE-bench Verified 69.4 (the model card footnote reads: SWE-Bench Verified, Pro and Multilingual: using OpenHands harness), so it is the officially-exercised agentic path for this model. Point it at your local server exactly as the card's OpenHands example does:

pip install openhands-ai

# OpenHands routes through LiteLLM; the "openai/" prefix selects the OpenAI-compatible path.
export LLM_MODEL="openai/deepreinforce-ai/Ornith-1.0-9B"
export LLM_BASE_URL="http://localhost:8000/v1"
export LLM_API_KEY="EMPTY"

openhands

Aider (lighter alternative). For a simpler terminal pair-programmer against the same endpoint, Aider connects to any OpenAI-compatible API:

export OPENAI_API_BASE=http://localhost:8000/v1
export OPENAI_API_KEY=EMPTY
aider --model openai/deepreinforce-ai/Ornith-1.0-9B

Recommended sampling for Ornith is temperature=0.6, top_p=0.95, top_k=20 (per the model card's quickstart note).

Results

VRAM usage: The dense 9B loads entirely as its GGUF file — Q8_0 is 9.53GB and Q6_K is 7.36GB on disk (byte-verified from the GGUF repo tree). On the RX 7800 XT's 16GB, the max-fidelity Q8_0 fits with roughly ~6GB left over for the KV cache — enough for a large context window at the default f16 cache, no KV quantization required. Q4_K_M (5.63GB) leaves the most context headroom and also runs on 8GB cards.
Model capability: On its own evaluation table the 9B reports SWE-bench Verified 69.4, Terminal-Bench 2.1 (Terminus-2) 43.1, and NL2Repo 27.2 (model card) — state-of-the-art among open models of comparable size, per the card. These are the model's own coding-benchmark scores, not hardware throughput on this GPU.
Speed: No local throughput benchmark exists for Ornith 1.0 9B on the RX 7800 XT yet — this is a brand-new model and /check/ornith-1-0-9b/rx-7800-xt has no benchmark rows. We do not quote a tok/s figure rather than invent one or borrow one from a different card; live measurements will appear at that link once contributed.

For the full benchmark data, see /check/ornith-1-0-9b/rx-7800-xt.

Troubleshooting

The reply is full of raw `<think>` / `<tool_call>` tags

Ornith is a reasoning model with native tool-calling: the assistant turn opens with a <think> … </think> block and emits <tool_call> blocks. If your client shows these as raw text, make sure the server applies the model's chat template — pass --jinja to llama-server, or use Ollama (which reads the GGUF's built-in tokenizer.chat_template). A correctly-templated server surfaces the reasoning as a separate reasoning_content field and the tool calls as OpenAI-style tool_calls, per the model card's quickstart note.

Ollama or llama.cpp runs on the CPU instead of the GPU

Confirm the ROCm v7 driver is installed (rocm-smi should list the 7800 XT) and that your user is in the render and video groups (groups should show both — log out and back in after the usermod step). Per the Ollama AMD GPU docs, ROCm is a separate install from Ollama; if it's missing, Ollama silently falls back to CPU. For a source llama.cpp build, confirm you compiled with -DGGML_HIP=ON -DGPU_TARGETS=gfx1101. The RX 7800 XT (gfx1101) is on the supported-GPU matrix, so you should not normally need HSA_OVERRIDE_GFX_VERSION — reach for the HSA_OVERRIDE_GFX_VERSION=11.0.0 masquerade only if a specific tool ships gfx1100-only kernels and refuses to start.

Out of memory at a long context

Even at Q8_0 the weights fit 16GB easily; the KV cache is what grows. If you OOM when raising -c beyond ~131k, either lower the context length or quantize the KV cache with -fa on -ctk q8_0 -ctv q8_0 (see Running) to reach toward the full 262k window. Dropping from Q8_0 to Q6_K also frees ~2GB for context.

Token generation feels slower than expected — try the Vulkan backend

On RDNA3 the ROCm/HIP backend can be slower at token generation than llama.cpp's Vulkan backend. Per llama.cpp issue #20934, measured on a 7900-series RDNA3 card, the Vulkan (RADV) backend outpaced ROCm for pure generation on a small Q4_0 model across ROCm 6.4.4–7.x. If your generation rate disappoints under ROCm on the 7800 XT, build llama.cpp with -DGGML_VULKAN=ON instead of -DGGML_HIP=ON and re-benchmark with llama-bench — Vulkan often wins for pure generation on RDNA3.

Model or GPU 404 on /check

Ornith 1.0 9B is a new addition; if the /check link 404s, the catalogue row is still being registered. The recipe's install and run steps are independent of the benchmark endpoint.