What You'll Build
A local, private agentic-coding setup: Ornith 1.0 35B — DeepReinforce's open (MIT) Mixture-of-Experts coding model — served as an OpenAI-compatible endpoint by llama.cpp (built against AMD's HIP/ROCm backend) on a single 24GB Radeon RX 7900 XTX (RDNA3, Navi 31, gfx1100), driven by the OpenHands coding agent so the model can read your repo, run shell commands, and edit files. Ornith is a reasoning model: each turn opens with a <think> chain-of-thought before the answer, and it emits native <tool_call> blocks for agentic workflows. This recipe uses the Q4_K_M quant (21.2 GB on disk) — the smallest official build, and the reason the RX 7900 XTX is the entry AMD card for the 35B.
Hardware data: RX 7900 XTX (24GB VRAM) · Ornith 1.0 35B Q4_K_M GGUF (21.2 GB weights) · ROCm 7 · short-context, tight fit · See benchmark data
⚠️ This is a ROCm recipe, not CUDA. The RX 7900 XTX runs on AMD's ROCm/HIP stack — there is no
cu124/cu128wheel and no FlashAttention prebuilt-wheel step here. For this coding LLM the reliable path is GGUF via llama.cpp-HIP (or Ollama, which bundles llama.cpp). Do not follow a guide that tells you topip install flash-attn, pick acu12xwheel, or use ExLlamaV2/Marlin for this card — those are NVIDIA-only.
⚠️ This is a tight fit, and context is the binding constraint. The Q4_K_M weights alone occupy 21.2 GB of the RX 7900 XTX's 24 GB, leaving only ~2–3 GB for the KV cache. Ornith's context window is 262,144 tokens, but you cannot run the full 256K context on a 24GB card — the KV cache for that would need tens of GB. This recipe caps context and quantizes the KV cache so a useful working window fits. Read the Running section before you launch.
ℹ️ An MoE keeps all experts resident — the file size is the VRAM cost. Ornith 1.0 35B is a Mixture-of-Experts. An MoE activates only some experts per token (a throughput property), but all experts stay loaded in VRAM, so the memory footprint is the full quant file — 21.2 GB at Q4_K_M, not some smaller "active" fraction. Do not expect a low active-parameter count to shrink the VRAM requirement.
ℹ️ Card too small? Use the 9B. There is no official Q3/Q2 build of the 35B — Q4_K_M (21.2 GB) is the floor, so AMD cards below 24GB cannot run this model from the official GGUF. On a 16 GB card like the RX 7800 XT, use Ornith 1.0 9B instead (Q4_K_M is ~5.6 GB); it is the same agentic-coding family sized for smaller rigs.
Requirements
| Component | Minimum | Tested |
|---|---|---|
| GPU | 24GB VRAM (this is the 35B's floor) | RX 7900 XTX (24GB, RDNA3 Navi 31, gfx1100) |
| RAM | 16GB system RAM (32GB comfortable for the agent + repo) | — |
| Storage | ~22GB (the Q4_K_M GGUF is 21.2 GB) | 21.17 GB (ornith-1.0-35b-Q4_K_M.gguf) |
| Driver | AMD ROCm v7 (installed via amdgpu-install) on Linux | — |
| Software | llama.cpp (HIP build) or Ollama; Python 3.10+ for OpenHands | llama.cpp llama-server, OpenHands |
The Q4_K_M GGUF file is 21.17 GB (21,166,757,760 bytes) per the Ornith-1.0-35B-GGUF file tree; the other published quants are Q5_K_M 24.73 GB, Q6_K 28.51 GB, Q8_0 36.90 GB, and BF16 69.38 GB — all of which exceed the RX 7900 XTX's 24 GB, so Q4_K_M is the only official quant that fits this card. The model is MIT-licensed, per the Ornith-1.0-35B-GGUF model card.
Installation
Prerequisite — install the AMD ROCm v7 driver
The RX 7900 XTX (gfx1100) is an officially ROCm-supported GPU, but ROCm is not bundled with Ollama or the llama.cpp release binaries — you install it once at the OS level. Per the Ollama AMD GPU docs, Ollama requires the AMD ROCm v7 driver on Linux, installed or upgraded with the amdgpu-install utility. On Ubuntu 24.04 (Noble), install ROCm 7.2.1 via the standard amdgpu-install flow (AMD's Radeon ROCm install docs cover the current packages):
# 1. Add the amdgpu-install package and install ROCm
wget https://repo.radeon.com/amdgpu-install/7.2.1/ubuntu/noble/amdgpu-install_7.2.1.70201-1_all.deb
sudo apt install ./amdgpu-install_7.2.1.70201-1_all.deb
sudo apt update
sudo amdgpu-install -y --usecase=graphics,rocm
# 2. Add yourself to the render/video groups (log out/in afterward)
sudo usermod -a -G render,video $LOGNAME
The RX 7900 XTX is on Ollama's supported AMD Radeon RX list and gfx1100 is in its supported LLVM-target list — so no HSA_OVERRIDE_GFX_VERSION masquerade is needed for this card (that override is only for cards ROCm doesn't ship kernels for).
1. Build llama.cpp with the HIP/ROCm backend
The RX 7900 XTX is RDNA3 (Navi 31, gfx1100) and is a first-class HIP target. Per the llama.cpp build docs, the Linux HIP build for an RDNA3 card like the RX 7900 XTX is:
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
HIPCXX="$(hipconfig -l)/clang" HIP_PATH="$(hipconfig -R)" \
cmake -S . -B build -DGGML_HIP=ON -DGPU_TARGETS=gfx1100 -DCMAKE_BUILD_TYPE=Release \
&& cmake --build build --config Release -- -j 16
-DGGML_HIP=ON selects the ROCm backend; -DGPU_TARGETS=gfx1100 pins the kernels to the 7900 XTX's architecture (the build docs use gfx1100 as the explicit example for the "Radeon RX 7900XTX"). The GGUF quants are integer formats, so the absence of FP8/FP4 tensor hardware on RDNA3 is irrelevant here — Ornith's Q4_K_M runs on standard HIP kernels.
2. Download the Q4_K_M GGUF
llama-server can pull the GGUF straight from Hugging Face and cache it locally. The -hf flag takes <user>/<model>[:quant], and per the canonical llama.cpp flag definitions the quant is optional and defaults to Q4_K_M — which is exactly the quant this recipe wants:
# Downloads ornith-1.0-35b-Q4_K_M.gguf (21.2 GB) into the llama.cpp cache on first use
./build/bin/llama-server -hf deepreinforce-ai/Ornith-1.0-35B-GGUF:Q4_K_M --port 8000
The first launch downloads ~21 GB; subsequent launches reuse the cached file. If you prefer to download the file explicitly first, grab ornith-1.0-35b-Q4_K_M.gguf from the GGUF repo files tab and pass it with -m <path> instead.
3. Install the OpenHands coding agent
OpenHands is an open-source agentic-coding client that drives any OpenAI-compatible endpoint. The Ornith-1.0-35B-GGUF model card documents the CLI install:
pip install openhands-ai
Alternatively, run the OpenHands Docker image — in that case point its base URL at http://host.docker.internal:8000/v1 so the container can reach the llama.cpp server running on your host.
Running
1. Serve Ornith with a capped, KV-quantized context
The model card's llama.cpp example, per the Ornith-1.0-35B-GGUF model card, is llama-server -hf deepreinforce-ai/Ornith-1.0-35B-GGUF --port 8000 -c 262144 — but -c 262144 is the full 256K context and will not fit on a 24GB card once the 21.2 GB of weights are resident. On the RX 7900 XTX, cap the context and quantize the KV cache so it fits in the ~2–3 GB that remains. All flags below are from the canonical llama.cpp flag definitions (-c/--ctx-size, -ctk/--cache-type-k, -ctv/--cache-type-v, -fa/--flash-attn, -ngl/--gpu-layers):
./build/bin/llama-server \
-hf deepreinforce-ai/Ornith-1.0-35B-GGUF:Q4_K_M \
--port 8000 \
-ngl 99 \
-c 16384 \
-fa on \
-ctk q8_0 -ctv q8_0
-ngl 99offloads all layers to the GPU (the 21.2 GB Q4 file must sit in VRAM — see the MoE note above).-c 16384caps context at 16K. This is a deliberate, tight starting point; raise or lower it while watchingrocm-smi(see Troubleshooting). Reasoning models spend thousands of tokens per turn inside<think>blocks, so KV pressure is higher than a plain chat model at the same context setting.-fa onenables Flash Attention, and-ctk q8_0 -ctv q8_0quantize the K and V caches to 8-bit — roughly halving KV-cache memory versus the fp16 default, which is what buys back usable context on a card this tight.
This exposes an OpenAI-compatible API at http://localhost:8000/v1. (Ollama is a valid alternative — ollama run hf.co/deepreinforce-ai/Ornith-1.0-35B-GGUF, per the same model card — but Ollama's own context default is small and must likewise be raised deliberately rather than set to the model's full 256K; Ollama uses the ROCm runtime you installed above and runs the gfx1100 card natively.)
2. Point OpenHands at the local server
OpenHands routes through LiteLLM, so a custom OpenAI-compatible endpoint uses an openai/ model prefix, per the OpenHands local-LLM docs. Using the CLI env-var form documented on the Ornith-1.0-35B-GGUF model card:
export LLM_MODEL="openai/deepreinforce-ai/Ornith-1.0-35B"
export LLM_BASE_URL="http://localhost:8000/v1"
export LLM_API_KEY="EMPTY" # any non-empty string; local servers don't check it
openhands
OpenHands will now use Ornith to plan, run shell commands, and edit files in your workspace. The model's <think> reasoning drives its planning and its native tool-calling drives the file/shell actions.
Results
- Speed: No benchmark for Ornith 1.0 35B on the RX 7900 XTX exists yet —
/check/ornith-1-0-35b/rx-7900-xtxhas no benchmark data, and this is a brand-new model, so we do not quote a tok/s figure rather than invent one or borrow one from different hardware. (Note that a reasoning model's effective throughput is lower than a raw tok/s number suggests, because much of each turn is<think>content you discard.) - VRAM usage: The Q4_K_M weights are 21.2 GB on disk and must be held resident, leaving ~2–3 GB of the RX 7900 XTX's 24 GB for the KV cache and activations — which is why context is capped and the KV cache is 8-bit-quantized above. Weight and per-quant file sizes are verified via the Ornith-1.0-35B-GGUF file tree.
- Quality notes: DeepReinforce reports the 35B scoring SWE-bench Verified 75.6 and Terminal-Bench 2.1 (Terminus-2) 64.2 — state-of-the-art among similarly-sized open models — per the vendor benchmark table on the Ornith-1.0-35B-GGUF model card. Those are the vendor's own agentic-coding evals, not a measurement on this GPU. Recommended sampling per the model card is temperature 0.6, top_p 0.95, top_k 20.
For the full benchmark data, see /check/ornith-1-0-35b/rx-7900-xtx.
Troubleshooting
Out of memory at launch, or the KV cache won't fit
The 21.2 GB of weights leave almost no room on a 24GB card, so OOM at startup usually means the context is set too high. Lower -c (try -c 8192), keep -ctk q8_0 -ctv q8_0 and -fa on enabled, and close any other GPU app before launching. Watch rocm-smi during a real agent task — a hard coding problem produces a long <think> block that grows the KV cache mid-generation, so size for the peak, not the idle load. If you need a larger working context than the 7900 XTX can give at Q4, that requires a 32GB-class card.
Ollama or llama.cpp runs on the CPU instead of the GPU
Confirm the ROCm v7 driver is installed (rocm-smi should list the 7900 XTX) and that your user is in the render and video groups (groups should show both — log out and back in after the usermod step). Per the Ollama AMD GPU docs, ROCm is a separate install from Ollama; if it's missing, Ollama silently falls back to CPU. For a source llama.cpp build, confirm you compiled with -DGGML_HIP=ON and that -ngl 99 is offloading layers. The RX 7900 XTX (gfx1100) is natively supported, so you should not need HSA_OVERRIDE_GFX_VERSION — only unsupported cards need that masquerade.
Token generation feels slower than expected — try the Vulkan backend
On RDNA3 the ROCm/HIP backend can be slower at token generation than llama.cpp's Vulkan backend. Per llama.cpp issue #20934, on the RX 7900 XTX (gfx1100) Vulkan (RADV) reached ~167–177 tok/s on Llama 7B Q4_0 while ROCm landed at ~129–144 tok/s across ROCm 6.4.4–7.x. If your generation rate disappoints under ROCm, build llama.cpp with -DGGML_VULKAN=ON instead of -DGGML_HIP=ON and re-benchmark with llama-bench — Vulkan often wins for pure generation on this card. (These are Llama-7B figures cited only to show the ROCm-vs-Vulkan gap on this GPU, not Ornith 35B numbers.)
The model gets stuck in a recursive loop / repeats tokens
Multiple community users report on the GGUF model card discussions that llama.cpp runs can fall into repetitive/recursive generation once the conversation grows past roughly 22K tokens. This is another reason the recipe caps context (-c 16384) — staying well under that threshold sidesteps the reported failure mode. These are community reports on the model card discussions, not a vendor-confirmed fix; if you hit it, keep sessions short and the context capped.
The agent won't edit files / botches tool calls
Community users report (on the GGUF model card discussions) that with the default chat template the model sometimes fails to edit files or emit clean tool calls in agent clients. The community workaround shared in that thread is to pass a corrected Qwen-family chat template to llama-server via --chat-template-file <path-to-template.jinja>. Because Ornith is post-trained on a Gemma 4 + Qwen 3.5 lineage, its chat template is Qwen-shaped; if agentic editing misbehaves, try a corrected template as those users did. Treat this as a community-reported workaround, not an official fix.