self-hosted/ai
§01·recipe · llm

DeepSeek-R1-Distill-Qwen-14B on RTX 4090 via Ollama Q4_K_M GGUF

llmbeginner10GB+ VRAMMay 20, 2026
models
tools
prerequisites
  • NVIDIA RTX 4090 (24 GB VRAM) or equivalent Ada-class 24 GB card
  • Python 3.10+ (only if you use the llama-cpp-python path)
  • Ollama 0.5.7+ installed, OR a recent llama.cpp build (b4514 or newer)

What You'll Build

A local DeepSeek-R1-Distill-Qwen-14B reasoning chatbot running on a single RTX 4090, served via Ollama with the default Q4_K_M GGUF quantization. The 24 GB card leaves plenty of headroom for the model's characteristic long <think> chain-of-thought traces — up to 64K context.

Hardware data: RTX 4090 (24 GB VRAM) · ~58.62 tok/s eval rate at Q4_K_M · See benchmark data

ℹ️ This is the Qwen2.5-14B distill, NOT Qwen3-14B. Per the official DeepSeek-R1 model card, DeepSeek-R1-Distill-Qwen-14B is fine-tuned from Qwen/Qwen2.5-14B with 800K samples generated by the full DeepSeek-R1 671B teacher. It is a different model from DeepSeek-R1-Distill-Qwen-1.5B, -Qwen-32B, and from the original DeepSeek-R1 (671B MoE). Slug/title disambiguation matters — copying a 1.5B or 32B install snippet against this 14B variant will silently fetch the wrong weights.

Requirements

ComponentMinimumTested
GPU12 GB VRAM (for Q4_K_M GGUF, default context)RTX 4090 (24 GB)
RAM16 GB system RAM
Storage~10 GB (Q4_K_M GGUF, 8.99 GB per bartowski's per-tier table)
SoftwareOllama 0.5.7+ or llama.cpp b4514+Ollama 0.5.7

Installation

1. Install Ollama

If you don't already have Ollama, follow the official install guide at ollama.com/download. On Linux:

curl -fsSL https://ollama.com/install.sh | sh

2. Pull the Q4_K_M GGUF

The default ollama pull deepseek-r1:14b already fetches the Q4_K_M quantization of DeepSeek-R1-Distill-Qwen-14B (~9 GB on disk, per the databasemart Ollama 0.5.7 benchmark):

ollama pull deepseek-r1:14b

If you prefer the explicit Unsloth GGUF mirror (same upstream, identical Q4_K_M file size of 8.99 GB):

ollama run hf.co/unsloth/DeepSeek-R1-Distill-Qwen-14B-GGUF:Q4_K_M

3. (Alternative) Use llama.cpp directly

If you want finer control over context length and KV-cache quantization than Ollama exposes, use llama.cpp (b4514 or newer) with the bartowski GGUF:

llama-server -hf bartowski/DeepSeek-R1-Distill-Qwen-14B-GGUF:Q4_K_M \
  --ctx-size 32768 \
  --cache-type-k q8_0 \
  --cache-type-v q8_0 \
  --flash-attn \
  --n-gpu-layers -1

--cache-type-k/v q8_0 plus --flash-attn keeps the KV cache compact — important for a reasoning model that routinely emits 4K+ token <think> blocks (see Troubleshooting).

Running

With Ollama:

ollama run deepseek-r1:14b

You'll get an interactive REPL. Because DeepSeek-R1 is a reasoning model, do not add a system prompt — the official model card is explicit: "Avoid adding a system prompt; all instructions should be contained within the user prompt." Recommended sampling per the model card: temperature 0.6 (range 0.5–0.7), top_p 0.95, to "prevent endless repetitions or incoherent outputs."

Every response will open with a <think>...</think> block where the model reasons step-by-step, then emit its final answer below. For math, the model card recommends appending: "Please reason step by step, and put your final answer within \boxed{}."

Results

  • Speed: ~58.62 tok/s eval rate at Q4_K_M, single-stream, measured by databasemart's Ollama 0.5.7 benchmark which explicitly tested "Deepseek-R1, 14b, 9GB, Q4" on an RTX 4090. Single-source measurement — please contribute corroborating numbers via /contribute.
  • VRAM usage: ~9 GB measured at Q4_K_M / default Ollama context per the same databasemart benchmark; Q4_K_M on-disk file size is 8.99 GB per bartowski's per-quant-tier table. 24 GB leaves enormous headroom for KV cache growth (see Troubleshooting).
  • Quality notes: The model card reports AIME 2024 pass@1 = 69.7, MATH-500 pass@1 = 93.9, GPQA Diamond pass@1 = 59.1 — strong math/reasoning benchmarks for a 14B-parameter open-weights model. Quality at Q4_K_M is degraded vs. the reference FP16, but Q4_K_M is the default Ollama tag for this model (ollama.com/library/deepseek-r1:14b) and the standard "balanced size/quality" K-quant tier for 14B-class models on consumer hardware. If you want to step up quality at the cost of throughput, bartowski's per-tier table lists Q5_K_M (10.5 GB), Q6_K (12.1 GB), and Q8_0 (15.7 GB) — all fit RTX 4090 24 GB with room for context.

For the full benchmark data, see /check/deepseek-r1-distill-qwen-14b/rtx-4090.

Troubleshooting

Long <think> traces eat KV cache far faster than a regular chat model

The R1-distill family emits explicit chain-of-thought wrapped in <think>...</think> before answering. Single-question <think> blocks routinely run 2K–4K tokens (and on hard math/code problems, much longer), so your effective KV cache pressure is 5–10× a plain Q&A model at the same context-window setting. On a 24 GB card you can push to 64K context at Q4_K_M with --cache-type-k q8_0 --cache-type-v q8_0 --flash-attn per jamesflare's RTX 4090 deployment writeup ("If running the same configuration for a 14B q4_K_M quantized model, it can achieve a context length of 64K"), but at default 32K context you'll usually be fine. If you start seeing OOM mid-generation, lower --ctx-size first before downgrading the quant.

Model produces empty <think> blocks or skips reasoning

Per the official model card: "To ensure that the model engages in thorough reasoning, we recommend enforcing the model to initiate its response with <think>\n at the beginning of every output." If your chat client/template strips the leading <think>\n, the model may bypass reasoning entirely. Ollama's built-in template handles this correctly; if you're using llama-cpp-python or transformers directly, set the assistant message prefix to <think>\n explicitly.

Adding a system prompt degrades responses

Same model card: "Avoid adding a system prompt; all instructions should be contained within the user prompt." This is unusual versus most chat-tuned models. If you're routing through a wrapper (LangChain, LiteLLM, etc.) that auto-injects a default system message, disable it for this model.

License clarification

The model is released under the MIT License — commercial use, redistribution, and derivative works (including further distillation) are permitted. Note that the base Qwen2.5-14B it's distilled from is Apache 2.0; the distilled weights inherit MIT terms per the DeepSeek-R1 repository.