What You'll Build
A local DeepSeek-R1-Distill-Qwen-14B reasoning chatbot running on a single RTX 4090, served via Ollama with the default Q4_K_M GGUF quantization. The 24 GB card leaves plenty of headroom for the model's characteristic long <think> chain-of-thought traces — up to 64K context.
Hardware data: RTX 4090 (24 GB VRAM) · ~58.62 tok/s eval rate at Q4_K_M · See benchmark data
ℹ️ This is the Qwen2.5-14B distill, NOT Qwen3-14B. Per the official DeepSeek-R1 model card,
DeepSeek-R1-Distill-Qwen-14Bis fine-tuned fromQwen/Qwen2.5-14Bwith 800K samples generated by the full DeepSeek-R1 671B teacher. It is a different model fromDeepSeek-R1-Distill-Qwen-1.5B,-Qwen-32B, and from the originalDeepSeek-R1(671B MoE). Slug/title disambiguation matters — copying a 1.5B or 32B install snippet against this 14B variant will silently fetch the wrong weights.
Requirements
| Component | Minimum | Tested |
|---|---|---|
| GPU | 12 GB VRAM (for Q4_K_M GGUF, default context) | RTX 4090 (24 GB) |
| RAM | 16 GB system RAM | — |
| Storage | ~10 GB (Q4_K_M GGUF, 8.99 GB per bartowski's per-tier table) | — |
| Software | Ollama 0.5.7+ or llama.cpp b4514+ | Ollama 0.5.7 |
Installation
1. Install Ollama
If you don't already have Ollama, follow the official install guide at ollama.com/download. On Linux:
curl -fsSL https://ollama.com/install.sh | sh
2. Pull the Q4_K_M GGUF
The default ollama pull deepseek-r1:14b already fetches the Q4_K_M quantization of DeepSeek-R1-Distill-Qwen-14B (~9 GB on disk, per the databasemart Ollama 0.5.7 benchmark):
ollama pull deepseek-r1:14b
If you prefer the explicit Unsloth GGUF mirror (same upstream, identical Q4_K_M file size of 8.99 GB):
ollama run hf.co/unsloth/DeepSeek-R1-Distill-Qwen-14B-GGUF:Q4_K_M
3. (Alternative) Use llama.cpp directly
If you want finer control over context length and KV-cache quantization than Ollama exposes, use llama.cpp (b4514 or newer) with the bartowski GGUF:
llama-server -hf bartowski/DeepSeek-R1-Distill-Qwen-14B-GGUF:Q4_K_M \
--ctx-size 32768 \
--cache-type-k q8_0 \
--cache-type-v q8_0 \
--flash-attn \
--n-gpu-layers -1
--cache-type-k/v q8_0 plus --flash-attn keeps the KV cache compact — important for a reasoning model that routinely emits 4K+ token <think> blocks (see Troubleshooting).
Running
With Ollama:
ollama run deepseek-r1:14b
You'll get an interactive REPL. Because DeepSeek-R1 is a reasoning model, do not add a system prompt — the official model card is explicit: "Avoid adding a system prompt; all instructions should be contained within the user prompt." Recommended sampling per the model card: temperature 0.6 (range 0.5–0.7), top_p 0.95, to "prevent endless repetitions or incoherent outputs."
Every response will open with a <think>...</think> block where the model reasons step-by-step, then emit its final answer below. For math, the model card recommends appending: "Please reason step by step, and put your final answer within \boxed{}."
Results
- Speed: ~58.62 tok/s eval rate at Q4_K_M, single-stream, measured by databasemart's Ollama 0.5.7 benchmark which explicitly tested "Deepseek-R1, 14b, 9GB, Q4" on an RTX 4090. Single-source measurement — please contribute corroborating numbers via /contribute.
- VRAM usage: ~9 GB measured at Q4_K_M / default Ollama context per the same databasemart benchmark; Q4_K_M on-disk file size is 8.99 GB per bartowski's per-quant-tier table. 24 GB leaves enormous headroom for KV cache growth (see Troubleshooting).
- Quality notes: The model card reports AIME 2024 pass@1 = 69.7, MATH-500 pass@1 = 93.9, GPQA Diamond pass@1 = 59.1 — strong math/reasoning benchmarks for a 14B-parameter open-weights model. Quality at Q4_K_M is degraded vs. the reference FP16, but Q4_K_M is the default Ollama tag for this model (ollama.com/library/deepseek-r1:14b) and the standard "balanced size/quality" K-quant tier for 14B-class models on consumer hardware. If you want to step up quality at the cost of throughput, bartowski's per-tier table lists Q5_K_M (10.5 GB), Q6_K (12.1 GB), and Q8_0 (15.7 GB) — all fit RTX 4090 24 GB with room for context.
For the full benchmark data, see /check/deepseek-r1-distill-qwen-14b/rtx-4090.
Troubleshooting
Long <think> traces eat KV cache far faster than a regular chat model
The R1-distill family emits explicit chain-of-thought wrapped in <think>...</think> before answering. Single-question <think> blocks routinely run 2K–4K tokens (and on hard math/code problems, much longer), so your effective KV cache pressure is 5–10× a plain Q&A model at the same context-window setting. On a 24 GB card you can push to 64K context at Q4_K_M with --cache-type-k q8_0 --cache-type-v q8_0 --flash-attn per jamesflare's RTX 4090 deployment writeup ("If running the same configuration for a 14B q4_K_M quantized model, it can achieve a context length of 64K"), but at default 32K context you'll usually be fine. If you start seeing OOM mid-generation, lower --ctx-size first before downgrading the quant.
Model produces empty <think> blocks or skips reasoning
Per the official model card: "To ensure that the model engages in thorough reasoning, we recommend enforcing the model to initiate its response with <think>\n at the beginning of every output." If your chat client/template strips the leading <think>\n, the model may bypass reasoning entirely. Ollama's built-in template handles this correctly; if you're using llama-cpp-python or transformers directly, set the assistant message prefix to <think>\n explicitly.
Adding a system prompt degrades responses
Same model card: "Avoid adding a system prompt; all instructions should be contained within the user prompt." This is unusual versus most chat-tuned models. If you're routing through a wrapper (LangChain, LiteLLM, etc.) that auto-injects a default system message, disable it for this model.
License clarification
The model is released under the MIT License — commercial use, redistribution, and derivative works (including further distillation) are permitted. Note that the base Qwen2.5-14B it's distilled from is Apache 2.0; the distilled weights inherit MIT terms per the DeepSeek-R1 repository.