DeepSeek-R1-Distill-Qwen-14B on RTX 3090 via Ollama Q4_K_M GGUF

What You'll Build

A local DeepSeek-R1-Distill-Qwen-14B reasoning chatbot running on a single RTX 3090, served via Ollama with the default Q4_K_M GGUF quantization. The 24 GB Ampere card leaves ample headroom for the model's characteristic long <think> chain-of-thought traces — comfortably up to 64K context with KV-cache quantization.

Hardware data: RTX 3090 (24 GB VRAM) · ~35-40 tok/s eval rate at Q4_K_M · See benchmark data

ℹ️ This is the Qwen2.5-14B distill, NOT Qwen3-14B. Per the official DeepSeek-R1 model card, DeepSeek-R1-Distill-Qwen-14B is fine-tuned from Qwen/Qwen2.5-14B with 800K samples generated by the full DeepSeek-R1 671B teacher. It is a different model from DeepSeek-R1-Distill-Qwen-1.5B, -Qwen-32B, and from the original DeepSeek-R1 (671B MoE). Slug/title disambiguation matters — copying a 1.5B or 32B install snippet against this 14B variant will silently fetch the wrong weights.

Requirements

Component	Minimum	Tested
GPU	12 GB VRAM (for Q4_K_M GGUF, default context)	RTX 3090 (24 GB)
RAM	16 GB system RAM	—
Storage	~10 GB (Q4_K_M GGUF, 8.99 GB per bartowski's per-tier table)	—
Software	Ollama 0.5.7+ or llama.cpp b4514+	Ollama 0.5.7

Installation

1. Install Ollama

If you don't already have Ollama, follow the official install guide at ollama.com/download. On Linux:

curl -fsSL https://ollama.com/install.sh | sh

2. Pull the Q4_K_M GGUF

The default ollama pull deepseek-r1:14b fetches the Q4_K_M quantization of DeepSeek-R1-Distill-Qwen-14B (9.0 GB on disk per the official Ollama library tag):

ollama pull deepseek-r1:14b

If you prefer the explicit Unsloth GGUF mirror (same upstream, identical Q4_K_M file size of 8.99 GB):

ollama run hf.co/unsloth/DeepSeek-R1-Distill-Qwen-14B-GGUF:Q4_K_M

3. (Alternative) Use llama.cpp directly

If you want finer control over context length and KV-cache quantization than Ollama exposes, use llama.cpp (b4514 or newer) with the bartowski GGUF:

llama-server -hf bartowski/DeepSeek-R1-Distill-Qwen-14B-GGUF:Q4_K_M \
  --ctx-size 16384 \
  --cache-type-k q8_0 \
  --cache-type-v q8_0 \
  --flash-attn \
  --n-gpu-layers -1

--cache-type-k/v q8_0 plus --flash-attn keeps the KV cache compact — important for a reasoning model that routinely emits 4K+ token <think> blocks (see Troubleshooting). FlashAttention-2 kernels are fully supported on Ampere sm_86, so the --flash-attn flag delivers real KV-memory and throughput wins on the 3090.

Running

With Ollama:

ollama run deepseek-r1:14b

You'll get an interactive REPL. Because DeepSeek-R1 is a reasoning model, do not add a system prompt — the official model card is explicit: "Avoid adding a system prompt; all instructions should be contained within the user prompt." Recommended sampling per the model card: temperature 0.6 (range 0.5–0.7), top_p 0.95, to "prevent endless repetitions or incoherent outputs."

Every response will open with a <think>...</think> block where the model reasons step-by-step, then emit its final answer below. For math, the model card recommends appending: "Please reason step by step, and put your final answer within \boxed{}."

Results

Speed: ~35–40 tok/s eval rate at Q4_K_M, single-stream, per Groundy's "Real Throughput Numbers Across Configurations" table (row: RTX 3090 (24 GB) · R1-Distill-14B · Q4_K_M · ~35–40 · Ollama, article updated 2026-05-09). Single-source measurement; please contribute corroborating numbers via /contribute. For context, the same article measures ~58 tok/s on RTX 4090 at the same model/quant — roughly consistent with the 3090's ~93% memory bandwidth (936 GB/s vs 1008 GB/s).
VRAM usage: ~9 GB weights-resident at Q4_K_M (8.99 GB on-disk file size per bartowski's per-quant-tier table; 9.0 GB per Ollama's deepseek-r1:14b tag). On a 24 GB card you have ~15 GB of headroom for KV cache, activations, and context growth — see Troubleshooting for sizing the actual peak under reasoning workloads.
Quality notes: The model card reports AIME 2024 pass@1 = 69.7, MATH-500 pass@1 = 93.9, GPQA Diamond pass@1 = 59.1 — strong math/reasoning benchmarks for a 14B-parameter open-weights model. Quality at Q4_K_M is degraded vs. the reference FP16, but Q4_K_M is the default Ollama tag for this model and the standard "balanced size/quality" K-quant tier for 14B-class models on consumer hardware. If you want to step up quality at the cost of throughput, bartowski's per-tier table lists Q5_K_M (10.51 GB), Q6_K (12.12 GB), and Q8_0 (15.70 GB) — all fit RTX 3090 24 GB with room for context.

For the full benchmark data, see /check/deepseek-r1-distill-qwen-14b/rtx-3090.

Troubleshooting

Long `<think>` traces eat KV cache far faster than a regular chat model

The R1-distill family emits explicit chain-of-thought wrapped in <think>...</think> before answering. Single-question <think> blocks routinely run 2K–4K tokens (and on hard math/code problems, much longer), so your effective KV cache pressure is 5–10× a plain Q&A model at the same context-window setting. On a 24 GB card running Q4_K_M (~9 GB weights), you have ~15 GB free for KV + activations — plenty for 16K context with default fp16 KV. To push to 32K or 64K, quantize the KV cache (24 GB card required — this exceeds the recipe's documented 12 GB minimum):

# Requires RTX 3090 / 4090 / 5090-class 24 GB card; not for the 12 GB minimum tier.
llama-server -hf bartowski/DeepSeek-R1-Distill-Qwen-14B-GGUF:Q4_K_M \
  --ctx-size 32768 \
  --cache-type-k q8_0 \
  --cache-type-v q8_0 \
  --flash-attn \
  --n-gpu-layers -1

The --cache-type-k/v q8_0 flags halve KV memory at no perceptible quality loss for reasoning workloads, and --flash-attn (FA2) further reduces activation memory — both supported on Ampere sm_86. If you start seeing OOM mid-generation, lower --ctx-size first before downgrading the quant.

Model produces empty `<think>` blocks or skips reasoning

Per the official model card: "To ensure that the model engages in thorough reasoning, we recommend enforcing the model to initiate its response with <think>\n at the beginning of every output." If your chat client/template strips the leading <think>\n, the model may bypass reasoning entirely. Ollama's built-in template handles this correctly; if you're using llama-cpp-python or transformers directly, set the assistant message prefix to <think>\n explicitly.

Adding a system prompt degrades responses

Same model card: "Avoid adding a system prompt; all instructions should be contained within the user prompt." This is unusual versus most chat-tuned models. If you're routing through a wrapper (LangChain, LiteLLM, etc.) that auto-injects a default system message, disable it for this model.

Chat-class tok/s numbers over-estimate reasoning throughput

The ~35-40 tok/s figure measures raw token emission rate. Because most of the output is <think> content the user ultimately discards, effective "answer tokens per second" is typically 30-50% lower. If you're benchmarking against a chat-tuned model, compare end-to-end answer latency rather than raw tok/s — DeepSeek-R1-distill spends most of its output budget on the reasoning trace.

License clarification

The model is released under the MIT License — commercial use, redistribution, and derivative works (including further distillation) are permitted. Note that the base Qwen2.5-14B it's distilled from is Apache 2.0; the distilled weights inherit MIT terms per the DeepSeek-R1 repository.