What You'll Build
A local DeepSeek-R1-Distill-Qwen-14B reasoning chatbot running on a single RTX 3090, served via Ollama with the default Q4_K_M GGUF quantization. The 24 GB Ampere card leaves ample headroom for the model's characteristic long <think> chain-of-thought traces — comfortably up to 64K context with KV-cache quantization.
Hardware data: RTX 3090 (24 GB VRAM) · ~35-40 tok/s eval rate at Q4_K_M · See benchmark data
ℹ️ This is the Qwen2.5-14B distill, NOT Qwen3-14B. Per the official DeepSeek-R1 model card,
DeepSeek-R1-Distill-Qwen-14Bis fine-tuned fromQwen/Qwen2.5-14Bwith 800K samples generated by the full DeepSeek-R1 671B teacher. It is a different model fromDeepSeek-R1-Distill-Qwen-1.5B,-Qwen-32B, and from the originalDeepSeek-R1(671B MoE). Slug/title disambiguation matters — copying a 1.5B or 32B install snippet against this 14B variant will silently fetch the wrong weights.
Requirements
| Component | Minimum | Tested |
|---|---|---|
| GPU | 12 GB VRAM (for Q4_K_M GGUF, default context) | RTX 3090 (24 GB) |
| RAM | 16 GB system RAM | — |
| Storage | ~10 GB (Q4_K_M GGUF, 8.99 GB per bartowski's per-tier table) | — |
| Software | Ollama 0.5.7+ or llama.cpp b4514+ | Ollama 0.5.7 |
Installation
1. Install Ollama
If you don't already have Ollama, follow the official install guide at ollama.com/download. On Linux:
curl -fsSL https://ollama.com/install.sh | sh
2. Pull the Q4_K_M GGUF
The default ollama pull deepseek-r1:14b fetches the Q4_K_M quantization of DeepSeek-R1-Distill-Qwen-14B (9.0 GB on disk per the official Ollama library tag):
ollama pull deepseek-r1:14b
If you prefer the explicit Unsloth GGUF mirror (same upstream, identical Q4_K_M file size of 8.99 GB):
ollama run hf.co/unsloth/DeepSeek-R1-Distill-Qwen-14B-GGUF:Q4_K_M
3. (Alternative) Use llama.cpp directly
If you want finer control over context length and KV-cache quantization than Ollama exposes, use llama.cpp (b4514 or newer) with the bartowski GGUF:
llama-server -hf bartowski/DeepSeek-R1-Distill-Qwen-14B-GGUF:Q4_K_M \
--ctx-size 16384 \
--cache-type-k q8_0 \
--cache-type-v q8_0 \
--flash-attn \
--n-gpu-layers -1
--cache-type-k/v q8_0 plus --flash-attn keeps the KV cache compact — important for a reasoning model that routinely emits 4K+ token <think> blocks (see Troubleshooting). FlashAttention-2 kernels are fully supported on Ampere sm_86, so the --flash-attn flag delivers real KV-memory and throughput wins on the 3090.
Running
With Ollama:
ollama run deepseek-r1:14b
You'll get an interactive REPL. Because DeepSeek-R1 is a reasoning model, do not add a system prompt — the official model card is explicit: "Avoid adding a system prompt; all instructions should be contained within the user prompt." Recommended sampling per the model card: temperature 0.6 (range 0.5–0.7), top_p 0.95, to "prevent endless repetitions or incoherent outputs."
Every response will open with a <think>...</think> block where the model reasons step-by-step, then emit its final answer below. For math, the model card recommends appending: "Please reason step by step, and put your final answer within \boxed{}."
Results
- Speed: ~35–40 tok/s eval rate at Q4_K_M, single-stream, per Groundy's "Real Throughput Numbers Across Configurations" table (row: RTX 3090 (24 GB) · R1-Distill-14B · Q4_K_M · ~35–40 · Ollama, article updated 2026-05-09). Single-source measurement; please contribute corroborating numbers via /contribute. For context, the same article measures ~58 tok/s on RTX 4090 at the same model/quant — roughly consistent with the 3090's ~93% memory bandwidth (936 GB/s vs 1008 GB/s).
- VRAM usage: ~9 GB weights-resident at Q4_K_M (8.99 GB on-disk file size per bartowski's per-quant-tier table; 9.0 GB per Ollama's deepseek-r1:14b tag). On a 24 GB card you have ~15 GB of headroom for KV cache, activations, and context growth — see Troubleshooting for sizing the actual peak under reasoning workloads.
- Quality notes: The model card reports AIME 2024 pass@1 = 69.7, MATH-500 pass@1 = 93.9, GPQA Diamond pass@1 = 59.1 — strong math/reasoning benchmarks for a 14B-parameter open-weights model. Quality at Q4_K_M is degraded vs. the reference FP16, but Q4_K_M is the default Ollama tag for this model and the standard "balanced size/quality" K-quant tier for 14B-class models on consumer hardware. If you want to step up quality at the cost of throughput, bartowski's per-tier table lists Q5_K_M (10.51 GB), Q6_K (12.12 GB), and Q8_0 (15.70 GB) — all fit RTX 3090 24 GB with room for context.
For the full benchmark data, see /check/deepseek-r1-distill-qwen-14b/rtx-3090.
Troubleshooting
Long <think> traces eat KV cache far faster than a regular chat model
The R1-distill family emits explicit chain-of-thought wrapped in <think>...</think> before answering. Single-question <think> blocks routinely run 2K–4K tokens (and on hard math/code problems, much longer), so your effective KV cache pressure is 5–10× a plain Q&A model at the same context-window setting. On a 24 GB card running Q4_K_M (~9 GB weights), you have ~15 GB free for KV + activations — plenty for 16K context with default fp16 KV. To push to 32K or 64K, quantize the KV cache (24 GB card required — this exceeds the recipe's documented 12 GB minimum):
# Requires RTX 3090 / 4090 / 5090-class 24 GB card; not for the 12 GB minimum tier.
llama-server -hf bartowski/DeepSeek-R1-Distill-Qwen-14B-GGUF:Q4_K_M \
--ctx-size 32768 \
--cache-type-k q8_0 \
--cache-type-v q8_0 \
--flash-attn \
--n-gpu-layers -1
The --cache-type-k/v q8_0 flags halve KV memory at no perceptible quality loss for reasoning workloads, and --flash-attn (FA2) further reduces activation memory — both supported on Ampere sm_86. If you start seeing OOM mid-generation, lower --ctx-size first before downgrading the quant.
Model produces empty <think> blocks or skips reasoning
Per the official model card: "To ensure that the model engages in thorough reasoning, we recommend enforcing the model to initiate its response with <think>\n at the beginning of every output." If your chat client/template strips the leading <think>\n, the model may bypass reasoning entirely. Ollama's built-in template handles this correctly; if you're using llama-cpp-python or transformers directly, set the assistant message prefix to <think>\n explicitly.
Adding a system prompt degrades responses
Same model card: "Avoid adding a system prompt; all instructions should be contained within the user prompt." This is unusual versus most chat-tuned models. If you're routing through a wrapper (LangChain, LiteLLM, etc.) that auto-injects a default system message, disable it for this model.
Chat-class tok/s numbers over-estimate reasoning throughput
The ~35-40 tok/s figure measures raw token emission rate. Because most of the output is <think> content the user ultimately discards, effective "answer tokens per second" is typically 30-50% lower. If you're benchmarking against a chat-tuned model, compare end-to-end answer latency rather than raw tok/s — DeepSeek-R1-distill spends most of its output budget on the reasoning trace.
License clarification
The model is released under the MIT License — commercial use, redistribution, and derivative works (including further distillation) are permitted. Note that the base Qwen2.5-14B it's distilled from is Apache 2.0; the distilled weights inherit MIT terms per the DeepSeek-R1 repository.