What You'll Build
A local DeepSeek-R1-Distill-Qwen-14B reasoning chatbot running on a single RTX 5090, served via Ollama with the default Q4_K_M GGUF quantization. Unlike the 24 GB sibling cards where 32K-context reasoning is comfortable but 128K is gated on aggressive KV-cache quantization, the 5090's 32 GB envelope clears the model's full native 131,072-token context with FP16 KV and still leaves room for a co-resident smaller model — or alternately, lets you step up to Q8_0 for the best quality your reasoning chain can get on a single consumer card.
Hardware data: RTX 5090 (32 GB VRAM, Blackwell sm_120) · ~9 GB resident at Q4_K_M · headroom for 128K reasoning context · See benchmark data
ℹ️ This is the Qwen2.5-14B distill, NOT Qwen3-14B. Per the official DeepSeek-R1 model card,
DeepSeek-R1-Distill-Qwen-14Bis fine-tuned fromQwen/Qwen2.5-14Bwith 800K samples generated by the full DeepSeek-R1 671B teacher. It is a different model fromDeepSeek-R1-Distill-Qwen-1.5B,-Qwen-32B, and from the originalDeepSeek-R1(671B MoE). Slug/title disambiguation matters — copying a 1.5B or 32B install snippet against this 14B variant will silently fetch the wrong weights.
Requirements
| Component | Minimum | Tested |
|---|---|---|
| GPU | 10 GB VRAM (Q4_K_M GGUF, default context) | RTX 5090 (32 GB, Blackwell sm_120) |
| RAM | 16 GB system RAM | — |
| Storage | ~10 GB (Q4_K_M GGUF, 8.99 GB per bartowski's per-tier table); ~16 GB if you opt for Q8_0 | — |
| Software | CUDA 12.8+ runtime, Ollama 0.5.7+ or llama.cpp b4514+ | Ollama 0.5.7 |
Installation
1. Install Ollama (with CUDA 12.8 runtime for Blackwell)
If you don't already have Ollama, follow the official install guide at ollama.com/download. On Linux:
curl -fsSL https://ollama.com/install.sh | sh
Ollama's bundled CUDA 12.8 runtime supports Blackwell sm_120 natively as of recent releases. If you build llama.cpp from source instead, you must compile with -DCMAKE_CUDA_ARCHITECTURES=120 against CUDA Toolkit 12.8 or later — older toolchains do not emit sm_120 kernels.
2. Pull the Q4_K_M GGUF
The default ollama pull deepseek-r1:14b fetches the Q4_K_M quantization of DeepSeek-R1-Distill-Qwen-14B (9.0 GB on disk per the official Ollama library tag):
ollama pull deepseek-r1:14b
If you prefer the explicit Unsloth GGUF mirror (same upstream, identical Q4_K_M file size of 8.99 GB):
ollama run hf.co/unsloth/DeepSeek-R1-Distill-Qwen-14B-GGUF:Q4_K_M
3. (Quality upgrade) Pull Q8_0 instead — 32 GB unlocks it
The 5090's 32 GB envelope fits Q8_0 (15.7 GB weights per bartowski's per-tier table) with comfortable room for KV cache and activations — a configuration that's tight on 24 GB cards once you push context past 32K. Q8_0 is near-lossless versus the reference BF16 and is the quant of choice when single-stream quality matters more than peak throughput:
ollama run hf.co/bartowski/DeepSeek-R1-Distill-Qwen-14B-GGUF:Q8_0
4. (Long-context option) Use llama.cpp directly with KV-cache controls
If you want to push past Ollama's default context window, use llama.cpp (b4514 or newer, built with CUDA 12.8) with the bartowski GGUF:
llama-server -hf bartowski/DeepSeek-R1-Distill-Qwen-14B-GGUF:Q4_K_M \
--ctx-size 65536 \
--cache-type-k f16 \
--cache-type-v f16 \
--flash-attn \
--n-gpu-layers -1
At 64K context with FP16 KV on a 5090: ~9 GB weights + ~12 GB KV cache + activations ≈ 21–23 GB peak — well under the 32 GB envelope, no KV quantization needed. See Results for the math on pushing to 128K.
Running
With Ollama:
ollama run deepseek-r1:14b
You'll get an interactive REPL. Because DeepSeek-R1 is a reasoning model, do not add a system prompt — the official model card is explicit: "Avoid adding a system prompt; all instructions should be contained within the user prompt." Recommended sampling per the model card: "Set the temperature within the range of 0.5-0.7 (0.6 is recommended) to prevent endless repetitions or incoherent outputs."
Every response will open with a <think>...</think> block where the model reasons step-by-step, then emit its final answer below. For math, the model card recommends appending: "Please reason step by step, and put your final answer within \boxed{}."
Results
- Speed: No first-party RTX 5090 measurement for DeepSeek-R1-Distill-Qwen-14B Q4_K_M exists at the time of this writing — neither Hardware Corner's RTX 5090 LLM benchmark page (updated March 2026) nor LocalScore's RTX 5090 accelerator page carries a DeepSeek-R1 14B row. For an architecture-equivalent reference point: LocalScore's RTX 5090 page measures the base
Qwen2.5 14B Instruct (Q4_K - Medium)model — which DeepSeek-R1-Distill-Qwen-14B is fine-tuned directly from, with identical layer count, hidden size, and KV-head topology per the model's config.json — at 45.5 tok/s generation, 3678 tok/s prompt processing, 536 ms time-to-first-token, LocalScore 708. Token-generation throughput at a fixed quant is bandwidth-bound and architecture-bound, not weights-bound; the distill should land in the same neighborhood. Please contribute corroborating direct measurements via /contribute. - VRAM usage (Q4_K_M, this recipe's installed path): ~9 GB weights-resident at Q4_K_M — the same databasemart Ollama 0.5.7 benchmark cited for the RTX 4090 sibling recipe lists 9 GB for the 4090, and on a 32 GB 5090 the binding constraint has cleared by a wide margin. On-disk file size is 8.99 GB per bartowski's per-quant-tier table. With ~23 GB of headroom you can fully unlock KV cache or load a co-resident model (see Troubleshooting).
- Quality notes: The model card reports AIME 2024 pass@1 = 69.7, AIME 2024 cons@64 = 80.0, MATH-500 pass@1 = 93.9, GPQA Diamond pass@1 = 59.1, LiveCodeBench pass@1 = 53.1, CodeForces rating = 1481 — strong math/reasoning benchmarks for a 14B-parameter open-weights model. Quality at Q4_K_M is degraded vs. the reference BF16, but on a 5090 you don't have to settle for Q4_K_M to fit — Q8_0 (15.7 GB) is near-lossless versus BF16 and fits comfortably with full 128K context room (see Installation step 3 and Troubleshooting).
For the full benchmark data, see /check/deepseek-r1-distill-qwen-14b/rtx-5090.
Troubleshooting
Spending the headroom — unlock the full 128K reasoning context
The R1-distill family emits explicit chain-of-thought wrapped in <think>...</think> before answering. Single-question <think> blocks routinely run 2K–4K tokens (and on hard math/code problems, much longer), so your effective KV cache pressure is 5–10× a plain Q&A model at the same context-window setting. On a 24 GB 3090/4090, that's why the sibling recipes cap practical context at 32K-with-Q8_0-KV or 64K-with-Q8_0-KV. On the 5090's 32 GB envelope the math is materially looser. The model's native context is 131,072 tokens per the model's config.json (max_position_embeddings: 131072). With 48 layers × 8 GQA KV heads × 128-dim × 2 (k,v), FP16 KV is ~0.19 MB per token. Derived envelopes:
| Context | Q4_K_M weights + FP16 KV | Q4_K_M weights + Q8_0 KV | Q8_0 weights + FP16 KV | Q8_0 weights + Q8_0 KV |
|---|---|---|---|---|
| 32K | ~9 GB + ~6 GB = ~15 GB | ~9 GB + ~3 GB = ~12 GB | ~16 GB + ~6 GB = ~22 GB | ~16 GB + ~3 GB = ~19 GB |
| 64K | ~9 GB + ~12 GB = ~21 GB | ~9 GB + ~6 GB = ~15 GB | ~16 GB + ~12 GB = ~28 GB | ~16 GB + ~6 GB = ~22 GB |
| 128K | ~9 GB + ~24 GB = ~33 GB (tight, use Q8_0 KV) | ~9 GB + ~12 GB = ~21 GB | ~16 GB + ~24 GB = ~40 GB (use Q8_0 KV) | ~16 GB + ~12 GB = ~28 GB |
These are derived envelopes (weight file sizes from bartowski's per-tier table; KV math from the model's config.json GQA shape) and don't account for activation memory and CUDA workspace (typically +1–2 GB). Practical recipe: at Q4_K_M, run 64K context with FP16 KV (default --cache-type-k/v f16) — leaves ~10 GB free. To push to 128K cleanly, switch to --cache-type-k q8_0 --cache-type-v q8_0:
llama-server -hf bartowski/DeepSeek-R1-Distill-Qwen-14B-GGUF:Q4_K_M \
--ctx-size 131072 \
--cache-type-k q8_0 \
--cache-type-v q8_0 \
--flash-attn \
--n-gpu-layers -1
If you start seeing OOM mid-generation at 128K, lower --ctx-size first before downgrading the weights quant — the KV cache scales linearly with --ctx-size and almost always dominates the OOM picture for this model on this card.
Co-locating a second model in the spare ~23 GB
With Q4_K_M and default context, the 5090 has roughly 23 GB free after DeepSeek-R1-Distill-14B is resident. Concrete colocation options:
- Llama 3.1 8B Q4_K_M (~4.9 GB) for fast routing / draft generation — pair the reasoning model with a faster non-reasoning model that handles trivial queries; combined footprint ~14 GB leaves room for a 64K context on the distill.
- Whisper-large-v3 (~3 GB) for voice → reasoning pipelines — DeepSeek-R1 answers spoken questions with full chain-of-thought; combined ~12 GB still leaves over half the card free for KV.
- Kokoro-82M (~1 GB) for spoken responses — round-trip the reasoning model's answer through TTS without a second GPU.
Each combination is sized from the cited Q4_K_M file footprints; verify combined VRAM behavior on first run since real-world overhead (activations, CUDA workspace, fragmentation) adds 1–2 GB per model.
Model produces empty <think> blocks or skips reasoning
Per the official model card: "To ensure that the model engages in thorough reasoning, we recommend enforcing the model to initiate its response with <think>\n at the beginning of every output." If your chat client/template strips the leading <think>\n, the model may bypass reasoning entirely. Ollama's built-in template handles this correctly; if you're using llama-cpp-python or transformers directly, set the assistant message prefix to <think>\n explicitly.
Adding a system prompt degrades responses
Same model card: "Avoid adding a system prompt; all instructions should be contained within the user prompt." This is unusual versus most chat-tuned models. If you're routing through a wrapper (LangChain, LiteLLM, etc.) that auto-injects a default system message, disable it for this model.
FlashAttention-2 may fail on Blackwell sm_120 in non-GGUF paths
Ollama and llama.cpp's --flash-attn flag run their own attention kernels and work cleanly on the 5090 with a CUDA 12.8 build. However, if you bypass GGUF and run the BF16 weights via raw transformers with attn_implementation="flash_attention_2", the Dao-AILab/flash-attention wheels may not ship sm_120 kernels yet — the canonical tracking issue is Dao-AILab/flash-attention#2168 ("[Blackwell/RTX 5090] CUDA error with flash-attention on RTX 5090 in WSL2") which is still open. On the transformers path, use attn_implementation="sdpa" (PyTorch's scaled dot-product attention) — which has full sm_120 coverage via cu128 wheels — as the always-works fallback. The GGUF/Ollama path documented in this recipe is unaffected.
License clarification
The model is released under the MIT License — commercial use, redistribution, and derivative works (including further distillation) are permitted. Note that the base Qwen2.5-14B it's distilled from is Apache 2.0; the distilled weights inherit MIT terms per the DeepSeek-R1 repository.