self-hosted/ai
§01·recipe · multimodal

Qwen3.5 27B on RTX 5090: Q4_K GGUF local chat via llama.cpp

multimodalintermediate20GB+ VRAMJun 27, 2026

This intermediate recipe sets up Qwen3.5 27B on the RTX 5090, needing about 20 GB of VRAM.

models
tools
prerequisites
  • NVIDIA RTX 5090 (32 GB VRAM) or equivalent — the Q4_K envelope fits comfortably with large KV headroom
  • Recent NVIDIA driver with CUDA 12.8+ (Blackwell sm_120 — see the build note below)
  • ~17 GB free disk for the Q4_K_M GGUF
  • llama.cpp (recent build, b9222+) or LM Studio installed
  • Python 3.10+ (only for the optional huggingface-cli download step)

What You'll Build

A local Qwen3.5 27B assistant running on a single RTX 5090 (32 GB VRAM) through llama.cpp with a Q4_K GGUF quant (~17 GB on disk). Qwen3.5 27B is Alibaba's dense 27B-parameter model — at Q4_K the weights occupy roughly half the 5090's 32 GB, so unlike a 24 GB card this pairing has plenty of room left over for a large KV cache and long contexts. The binding constraint here is no longer "does it fit" but "how much context can you afford" — and the answer is: a lot.

Hardware data: RTX 5090 (32 GB VRAM) · Q4_K GGUF · 58.8 tok/s generation @ 4K context · See benchmark data

ℹ️ Multimodal model, run here as a text LLM. Per the Qwen3.5 27B model card, Qwen3.5 27B ships with a vision projector (mmproj) for image input. This recipe documents the text-chat path via the main GGUF (image input needs the separate mmproj vision projector, covered briefly in Troubleshooting). Qwen3.5 27B sits in our multimodal vertical because it ships that vision projector; the text-chat path documented here is what benchmarks on the RTX 5090.

⚠️ Disambiguation — Qwen3.5 27B, not 35B and not Qwen3. This recipe targets Qwen/Qwen3.5-27B (Qwen3.5, the dense 27B model). It is a distinct model from the older Qwen3 family (no ".5"). The hardware figures below are for the 27B only.

Requirements

ComponentMinimumTested
GPU20 GB VRAM (Q4_K + KV)RTX 5090 (32 GB)
RAM32 GB system
Storage~17 GB for the Q4_K_M GGUF per unsloth/Qwen3.5-27B-GGUF
DriverCUDA 12.8+ runtime (Blackwell sm_120)
Runtimellama.cpp / LM Studiollama.cpp b9222+

The model is released under Apache 2.0 — commercial use permitted. The Q4_K_M GGUF is ~16.7 GB on disk per the unsloth tier table, so the weights take up roughly half of the 5090's 32 GB and leave ~15 GB for the KV cache, activations, and a long context window.

⚠️ Blackwell (sm_120) build note. The RTX 5090 is Blackwell — compute capability sm_120. You need a CUDA 12.8+ toolkit and a llama.cpp / driver stack that targets sm_120. The pre-built CUDA-12.8 release binaries and Ollama's bundled cu12.8 runtime both cover this; if you build llama.cpp from source, pass -DCMAKE_CUDA_ARCHITECTURES=120. Do not pip install flash-attn — FlashAttention-2 wheels do not yet ship sm_120 kernels. Use llama.cpp's own --flash-attn (-fa) flag instead, which has native Blackwell support (details in Installation and Troubleshooting).

ℹ️ llama.cpp build matters. Qwen3.5 support landed recently in llama.cpp; bartowski's GGUFs were quantized with release b9222. Use a recent build, not a year-old binary.

Installation

1. Install or build llama.cpp (with sm_120 support)

# Linux — pre-built CUDA 12.8 binary (covers Blackwell sm_120)
# Download the latest "llama-bXXXX-bin-ubuntu-cuda-12.8-x64.zip" from:
#   https://github.com/ggml-org/llama.cpp/releases
# Extract and add the bin/ directory to PATH.

To build from source with CUDA support targeting Blackwell instead, add -DCMAKE_CUDA_ARCHITECTURES=120 so the kernels are compiled for sm_120:

apt-get update
apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
git clone https://github.com/ggml-org/llama.cpp
cmake llama.cpp -B llama.cpp/build \
    -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON \
    -DCMAKE_CUDA_ARCHITECTURES=120
cmake --build llama.cpp/build --config Release -j --clean-first --target llama-cli llama-mtmd-cli llama-server llama-gguf-split
cp llama.cpp/build/bin/llama-* llama.cpp

The -DCMAKE_CUDA_ARCHITECTURES=120 flag requires a CUDA 12.8+ toolkit — older toolkits do not know the sm_120 target and the build will fail. Do not add pip install flash-attn to this flow; FlashAttention-2 has no sm_120 kernels and llama.cpp's own --flash-attn already covers Blackwell.

2. Get a Q4_K GGUF

The simplest path lets llama.cpp pull the file for you with -hf (shown in the Running section). To download explicitly instead, pick one Q4_K file from either mirror — both link back to the canonical Qwen/Qwen3.5-27B:

pip install -U "huggingface_hub[cli]"

# Option A — Unsloth (UD-Q4_K_XL = 17.62 GB on disk, or Q4_K_M = 16.74 GB)
huggingface-cli download unsloth/Qwen3.5-27B-GGUF \
  --include "Qwen3.5-27B-Q4_K_M.gguf" --local-dir ./Qwen3.5-27B-GGUF

# Option B — bartowski imatrix (Q4_K_M = 17.98 GB, Q4_K_S = 16.93 GB)
huggingface-cli download bartowski/Qwen_Qwen3.5-27B-GGUF \
  --include "Qwen_Qwen3.5-27B-Q4_K_M.gguf" --local-dir ./Qwen3.5-27B-GGUF

Per the per-tier file-size tables on unsloth/Qwen3.5-27B-GGUF and bartowski/Qwen_Qwen3.5-27B-GGUF, the Q4_K tier lands at ~16.7–18 GB on disk. On a 32 GB card you have headroom to go higher if you want more quality: Q5_K_M is ~19.6 GB and Q6_K is ~22.5 GB on the unsloth mirror, and both still leave room for a usable KV cache on the 5090 — see "Going bigger than Q4_K" in Troubleshooting.

Running

Quick start — let llama.cpp fetch the weights

The -hf flag downloads and caches the GGUF on first run. Thinking mode (the default), general tasks:

export LLAMA_CACHE="unsloth/Qwen3.5-27B-GGUF"
./llama.cpp/llama-cli \
    -hf unsloth/Qwen3.5-27B-GGUF:UD-Q4_K_XL \
    --flash-attn \
    --n-gpu-layers 99 \
    --temp 1.0 \
    --top-p 0.95 \
    --top-k 20 \
    --min-p 0.00

--flash-attn (-fa) enables llama.cpp's native flash-attention path, which has Blackwell sm_120 kernels — this is the correct way to get flash attention on the 5090, not a flash-attn pip wheel. --n-gpu-layers 99 offloads all layers to the GPU, which the 32 GB card handles comfortably for the Q4_K weights.

OpenAI-compatible HTTP server

Non-thinking (instruct) mode, general tasks:

export LLAMA_CACHE="unsloth/Qwen3.5-27B-GGUF"
./llama.cpp/llama-server \
    -hf unsloth/Qwen3.5-27B-GGUF:UD-Q4_K_XL \
    --flash-attn \
    --n-gpu-layers 99 \
    --ctx-size 32768 \
    --temp 0.7 \
    --top-p 0.8 \
    --top-k 20 \
    --min-p 0.00 \
    --chat-template-kwargs '{"enable_thinking":false}'

llama-server exposes an OpenAI-compatible /v1/chat/completions endpoint (default port 8080). Because the Q4_K weights leave ~15 GB of headroom on the 5090, you can run a generous --ctx-size (32768 shown above) without quantizing the KV cache — a luxury the 24 GB cards don't have. Push it further if your workload needs it.

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [{"role": "user", "content": "Explain Mixture-of-Experts in three sentences."}],
    "temperature": 0.7, "top_p": 0.8, "top_k": 20, "min_p": 0.0
  }'

Sampling parameters

Per the Qwen3.5 27B model card Best Practices: thinking mode, general tasks uses temperature=1.0, top_p=0.95, top_k=20, min_p=0.0; thinking mode, precise coding uses temperature=0.6; non-thinking (instruct) mode, general tasks uses temperature=0.7, top_p=0.8, top_k=20, min_p=0.0.

Results

  • Speed: Per Hardware Corner's RTX 5090 LLM benchmark page, Qwen3.5 27B at Q4_K records 58.8 tokens/s generation at 4K context, with prompt-processing prefill at 3,004.1 tokens/s @ 4K. Confirmed by /check/qwen3-5-27b/rtx-5090 (Hardware Corner-sourced, last verified 2026-05-15). The 58.8 t/s figure is end-user token generation; prefill (prompt ingestion) is the much faster 3,004.1 t/s number — don't conflate the two. That generation rate is roughly 1.75× the RTX 3090's 33.5 t/s on the same quant, as you'd expect from Blackwell's higher memory bandwidth.
  • VRAM usage: The Q4_K_M GGUF is ~16.74 GB on disk (unsloth tier table); with the KV cache and activations on top, plan on ~20 GB resident for a moderate context, leaving comfortable headroom inside the 5090's 32 GB. See the live data at /check/qwen3-5-27b/rtx-5090.
  • Quality notes: On a 32 GB card, Q4_K is no longer the quality ceiling the way it is on a 24 GB card — you have room to step up to Q5_K_M (~19.6 GB) or Q6_K (~22.5 GB) and still fit a usable context (see Troubleshooting). bartowski's GGUFs are imatrix-calibrated; Unsloth's UD-Q4_K_XL (17.62 GB) applies per-layer dynamic bit-allocation per the Unsloth Dynamic 2.0 docs.

For the full benchmark data and other-GPU comparisons, see /check/qwen3-5-27b/rtx-5090.

Troubleshooting

CUDA error / no kernel image on first inference (Blackwell sm_120)

The RTX 5090 is Blackwell (sm_120). A llama.cpp binary built for older architectures, or against a CUDA toolkit earlier than 12.8, can throw no kernel image is available for execution on the device at the first GPU call. Fix: use a CUDA-12.8 release binary, or rebuild from source with -DCMAKE_CUDA_ARCHITECTURES=120 on a CUDA 12.8+ toolkit (see Installation). Do not try to "fix" attention by installing flash-attn from pip — FlashAttention-2 has no sm_120 kernels yet; use llama.cpp's --flash-attn flag, which ships native Blackwell support.

Going bigger than Q4_K

With 32 GB you are not stuck at Q4_K. Per the unsloth tier table, Q5_K_M is ~19.61 GB and Q6_K is ~22.45 GB on disk — both fit the 5090 with room for a KV cache. Swap the -hf tag (e.g. :Q5_K_M or :Q6_K) to trade a little speed for higher fidelity. The benchmark figures above are for the Q4_K row specifically; a heavier quant will generate somewhat slower.

Want image input (vision)?

Qwen3.5 27B is multimodal, but the GGUF text path above is text-only. To feed images, download the vision projector (mmproj-F16.gguf, ~0.93 GB, on unsloth/Qwen3.5-27B-GGUF) and run the multimodal CLI, e.g. llama-mtmd-cli --model <q4_k.gguf> --mmproj mmproj-F16.gguf .... The vision encoder adds VRAM on top of the text envelope — on the 5090 you have the headroom for it, unlike the 24 GB cards.

Measured something different on your card?

Speed and VRAM vary with quant, context length, and llama.cpp build. If your numbers differ, contribute them via the submission form — community measurements are what keep /check/qwen3-5-27b/rtx-5090 accurate for the next person.

common questions
How much VRAM does Qwen3.5 27B need?

About 20 GB — the minimum this recipe targets.

Which GPUs is Qwen3.5 27B tested on?

RTX 5090 (32 GB).

How hard is this setup?

Intermediate — follow the steps above.