self-hosted/ai
§01·recipe · llm

DeepSeek-R1-Distill-Qwen-14B on RTX 5090: 128K Reasoning Context Unlocked

llmbeginner10GB+ VRAMMay 24, 2026
models
tools
prerequisites
  • NVIDIA RTX 5090 (32 GB VRAM, Blackwell sm_120) or any 24 GB+ Ampere/Ada/Blackwell card
  • CUDA 12.8+ runtime (required for Blackwell sm_120)
  • Ollama 0.5.7+ OR a recent llama.cpp build (b4514 or newer) compiled with CUDA 12.8

What You'll Build

A local DeepSeek-R1-Distill-Qwen-14B reasoning chatbot running on a single RTX 5090, served via Ollama with the default Q4_K_M GGUF quantization. Unlike the 24 GB sibling cards where 32K-context reasoning is comfortable but 128K is gated on aggressive KV-cache quantization, the 5090's 32 GB envelope clears the model's full native 131,072-token context with FP16 KV and still leaves room for a co-resident smaller model — or alternately, lets you step up to Q8_0 for the best quality your reasoning chain can get on a single consumer card.

Hardware data: RTX 5090 (32 GB VRAM, Blackwell sm_120) · ~9 GB resident at Q4_K_M · headroom for 128K reasoning context · See benchmark data

ℹ️ This is the Qwen2.5-14B distill, NOT Qwen3-14B. Per the official DeepSeek-R1 model card, DeepSeek-R1-Distill-Qwen-14B is fine-tuned from Qwen/Qwen2.5-14B with 800K samples generated by the full DeepSeek-R1 671B teacher. It is a different model from DeepSeek-R1-Distill-Qwen-1.5B, -Qwen-32B, and from the original DeepSeek-R1 (671B MoE). Slug/title disambiguation matters — copying a 1.5B or 32B install snippet against this 14B variant will silently fetch the wrong weights.

Requirements

ComponentMinimumTested
GPU10 GB VRAM (Q4_K_M GGUF, default context)RTX 5090 (32 GB, Blackwell sm_120)
RAM16 GB system RAM
Storage~10 GB (Q4_K_M GGUF, 8.99 GB per bartowski's per-tier table); ~16 GB if you opt for Q8_0
SoftwareCUDA 12.8+ runtime, Ollama 0.5.7+ or llama.cpp b4514+Ollama 0.5.7

Installation

1. Install Ollama (with CUDA 12.8 runtime for Blackwell)

If you don't already have Ollama, follow the official install guide at ollama.com/download. On Linux:

curl -fsSL https://ollama.com/install.sh | sh

Ollama's bundled CUDA 12.8 runtime supports Blackwell sm_120 natively as of recent releases. If you build llama.cpp from source instead, you must compile with -DCMAKE_CUDA_ARCHITECTURES=120 against CUDA Toolkit 12.8 or later — older toolchains do not emit sm_120 kernels.

2. Pull the Q4_K_M GGUF

The default ollama pull deepseek-r1:14b fetches the Q4_K_M quantization of DeepSeek-R1-Distill-Qwen-14B (9.0 GB on disk per the official Ollama library tag):

ollama pull deepseek-r1:14b

If you prefer the explicit Unsloth GGUF mirror (same upstream, identical Q4_K_M file size of 8.99 GB):

ollama run hf.co/unsloth/DeepSeek-R1-Distill-Qwen-14B-GGUF:Q4_K_M

3. (Quality upgrade) Pull Q8_0 instead — 32 GB unlocks it

The 5090's 32 GB envelope fits Q8_0 (15.7 GB weights per bartowski's per-tier table) with comfortable room for KV cache and activations — a configuration that's tight on 24 GB cards once you push context past 32K. Q8_0 is near-lossless versus the reference BF16 and is the quant of choice when single-stream quality matters more than peak throughput:

ollama run hf.co/bartowski/DeepSeek-R1-Distill-Qwen-14B-GGUF:Q8_0

4. (Long-context option) Use llama.cpp directly with KV-cache controls

If you want to push past Ollama's default context window, use llama.cpp (b4514 or newer, built with CUDA 12.8) with the bartowski GGUF:

llama-server -hf bartowski/DeepSeek-R1-Distill-Qwen-14B-GGUF:Q4_K_M \
  --ctx-size 65536 \
  --cache-type-k f16 \
  --cache-type-v f16 \
  --flash-attn \
  --n-gpu-layers -1

At 64K context with FP16 KV on a 5090: ~9 GB weights + ~12 GB KV cache + activations ≈ 21–23 GB peak — well under the 32 GB envelope, no KV quantization needed. See Results for the math on pushing to 128K.

Running

With Ollama:

ollama run deepseek-r1:14b

You'll get an interactive REPL. Because DeepSeek-R1 is a reasoning model, do not add a system prompt — the official model card is explicit: "Avoid adding a system prompt; all instructions should be contained within the user prompt." Recommended sampling per the model card: "Set the temperature within the range of 0.5-0.7 (0.6 is recommended) to prevent endless repetitions or incoherent outputs."

Every response will open with a <think>...</think> block where the model reasons step-by-step, then emit its final answer below. For math, the model card recommends appending: "Please reason step by step, and put your final answer within \boxed{}."

Results

  • Speed: No first-party RTX 5090 measurement for DeepSeek-R1-Distill-Qwen-14B Q4_K_M exists at the time of this writing — neither Hardware Corner's RTX 5090 LLM benchmark page (updated March 2026) nor LocalScore's RTX 5090 accelerator page carries a DeepSeek-R1 14B row. For an architecture-equivalent reference point: LocalScore's RTX 5090 page measures the base Qwen2.5 14B Instruct (Q4_K - Medium) model — which DeepSeek-R1-Distill-Qwen-14B is fine-tuned directly from, with identical layer count, hidden size, and KV-head topology per the model's config.json — at 45.5 tok/s generation, 3678 tok/s prompt processing, 536 ms time-to-first-token, LocalScore 708. Token-generation throughput at a fixed quant is bandwidth-bound and architecture-bound, not weights-bound; the distill should land in the same neighborhood. Please contribute corroborating direct measurements via /contribute.
  • VRAM usage (Q4_K_M, this recipe's installed path): ~9 GB weights-resident at Q4_K_M — the same databasemart Ollama 0.5.7 benchmark cited for the RTX 4090 sibling recipe lists 9 GB for the 4090, and on a 32 GB 5090 the binding constraint has cleared by a wide margin. On-disk file size is 8.99 GB per bartowski's per-quant-tier table. With ~23 GB of headroom you can fully unlock KV cache or load a co-resident model (see Troubleshooting).
  • Quality notes: The model card reports AIME 2024 pass@1 = 69.7, AIME 2024 cons@64 = 80.0, MATH-500 pass@1 = 93.9, GPQA Diamond pass@1 = 59.1, LiveCodeBench pass@1 = 53.1, CodeForces rating = 1481 — strong math/reasoning benchmarks for a 14B-parameter open-weights model. Quality at Q4_K_M is degraded vs. the reference BF16, but on a 5090 you don't have to settle for Q4_K_M to fit — Q8_0 (15.7 GB) is near-lossless versus BF16 and fits comfortably with full 128K context room (see Installation step 3 and Troubleshooting).

For the full benchmark data, see /check/deepseek-r1-distill-qwen-14b/rtx-5090.

Troubleshooting

Spending the headroom — unlock the full 128K reasoning context

The R1-distill family emits explicit chain-of-thought wrapped in <think>...</think> before answering. Single-question <think> blocks routinely run 2K–4K tokens (and on hard math/code problems, much longer), so your effective KV cache pressure is 5–10× a plain Q&A model at the same context-window setting. On a 24 GB 3090/4090, that's why the sibling recipes cap practical context at 32K-with-Q8_0-KV or 64K-with-Q8_0-KV. On the 5090's 32 GB envelope the math is materially looser. The model's native context is 131,072 tokens per the model's config.json (max_position_embeddings: 131072). With 48 layers × 8 GQA KV heads × 128-dim × 2 (k,v), FP16 KV is ~0.19 MB per token. Derived envelopes:

ContextQ4_K_M weights + FP16 KVQ4_K_M weights + Q8_0 KVQ8_0 weights + FP16 KVQ8_0 weights + Q8_0 KV
32K~9 GB + ~6 GB = ~15 GB~9 GB + ~3 GB = ~12 GB~16 GB + ~6 GB = ~22 GB~16 GB + ~3 GB = ~19 GB
64K~9 GB + ~12 GB = ~21 GB~9 GB + ~6 GB = ~15 GB~16 GB + ~12 GB = ~28 GB~16 GB + ~6 GB = ~22 GB
128K~9 GB + ~24 GB = ~33 GB (tight, use Q8_0 KV)~9 GB + ~12 GB = ~21 GB~16 GB + ~24 GB = ~40 GB (use Q8_0 KV)~16 GB + ~12 GB = ~28 GB

These are derived envelopes (weight file sizes from bartowski's per-tier table; KV math from the model's config.json GQA shape) and don't account for activation memory and CUDA workspace (typically +1–2 GB). Practical recipe: at Q4_K_M, run 64K context with FP16 KV (default --cache-type-k/v f16) — leaves ~10 GB free. To push to 128K cleanly, switch to --cache-type-k q8_0 --cache-type-v q8_0:

llama-server -hf bartowski/DeepSeek-R1-Distill-Qwen-14B-GGUF:Q4_K_M \
  --ctx-size 131072 \
  --cache-type-k q8_0 \
  --cache-type-v q8_0 \
  --flash-attn \
  --n-gpu-layers -1

If you start seeing OOM mid-generation at 128K, lower --ctx-size first before downgrading the weights quant — the KV cache scales linearly with --ctx-size and almost always dominates the OOM picture for this model on this card.

Co-locating a second model in the spare ~23 GB

With Q4_K_M and default context, the 5090 has roughly 23 GB free after DeepSeek-R1-Distill-14B is resident. Concrete colocation options:

  • Llama 3.1 8B Q4_K_M (~4.9 GB) for fast routing / draft generation — pair the reasoning model with a faster non-reasoning model that handles trivial queries; combined footprint ~14 GB leaves room for a 64K context on the distill.
  • Whisper-large-v3 (~3 GB) for voice → reasoning pipelines — DeepSeek-R1 answers spoken questions with full chain-of-thought; combined ~12 GB still leaves over half the card free for KV.
  • Kokoro-82M (~1 GB) for spoken responses — round-trip the reasoning model's answer through TTS without a second GPU.

Each combination is sized from the cited Q4_K_M file footprints; verify combined VRAM behavior on first run since real-world overhead (activations, CUDA workspace, fragmentation) adds 1–2 GB per model.

Model produces empty <think> blocks or skips reasoning

Per the official model card: "To ensure that the model engages in thorough reasoning, we recommend enforcing the model to initiate its response with <think>\n at the beginning of every output." If your chat client/template strips the leading <think>\n, the model may bypass reasoning entirely. Ollama's built-in template handles this correctly; if you're using llama-cpp-python or transformers directly, set the assistant message prefix to <think>\n explicitly.

Adding a system prompt degrades responses

Same model card: "Avoid adding a system prompt; all instructions should be contained within the user prompt." This is unusual versus most chat-tuned models. If you're routing through a wrapper (LangChain, LiteLLM, etc.) that auto-injects a default system message, disable it for this model.

FlashAttention-2 may fail on Blackwell sm_120 in non-GGUF paths

Ollama and llama.cpp's --flash-attn flag run their own attention kernels and work cleanly on the 5090 with a CUDA 12.8 build. However, if you bypass GGUF and run the BF16 weights via raw transformers with attn_implementation="flash_attention_2", the Dao-AILab/flash-attention wheels may not ship sm_120 kernels yet — the canonical tracking issue is Dao-AILab/flash-attention#2168 ("[Blackwell/RTX 5090] CUDA error with flash-attention on RTX 5090 in WSL2") which is still open. On the transformers path, use attn_implementation="sdpa" (PyTorch's scaled dot-product attention) — which has full sm_120 coverage via cu128 wheels — as the always-works fallback. The GGUF/Ollama path documented in this recipe is unaffected.

License clarification

The model is released under the MIT License — commercial use, redistribution, and derivative works (including further distillation) are permitted. Note that the base Qwen2.5-14B it's distilled from is Apache 2.0; the distilled weights inherit MIT terms per the DeepSeek-R1 repository.