What You'll Build
A local Qwen3.5-27B assistant running on a single RTX 3090 (24 GB VRAM) through llama.cpp with a Q4_K GGUF quant (~17 GB on disk). Qwen3.5-27B is Alibaba's February-2026 native-multimodal model — its 27B-parameter language model uses a hybrid Gated DeltaNet + Gated Attention + sparse Mixture-of-Experts layout — and at Q4_K it sits right at the edge of single-RTX-3090 territory: the quant fits with usable KV-cache headroom at short context, but the envelope is full at the 24 GB ceiling, so context-window discipline matters.
Hardware data: RTX 3090 (24 GB VRAM) · Q4_K GGUF · 33.5 tok/s generation @ 4K context · See benchmark data
ℹ️ Native-multimodal model, run here as a text LLM. Per the Qwen3.5-27B model card, Qwen3.5 is a "Causal Language Model with Vision Encoder" — it understands images and video in addition to text. This recipe documents the text-chat path via the main GGUF (image input needs the separate
mmprojvision projector, covered briefly in Troubleshooting). Qwen3.5-27B is in ourllmvertical because that path is what fits and benchmarks on a 24 GB consumer card.
⚠️ Disambiguation — Qwen3.5-27B, not 35B and not Qwen3. This recipe targets
Qwen/Qwen3.5-27B(Qwen3.5, the 27B model). It is a distinct model fromQwen3.5-35B-A3B(a larger sparse-MoE sibling) and from the older Qwen3 family (no ".5"). The hardware figures below are for the 27B only.
Requirements
| Component | Minimum | Tested |
|---|---|---|
| GPU | 24 GB VRAM (Q4_K) | RTX 3090 (24 GB) |
| RAM | 32 GB system | — |
| Storage | ~17 GB for the Q4_K_M GGUF per unsloth/Qwen3.5-27B-GGUF | — |
| Driver | CUDA 12.x runtime (Ampere sm_86) | — |
| Runtime | llama.cpp / LM Studio | llama.cpp b9222+ |
The model is released under Apache 2.0 — commercial use permitted. Per the Qwen3.5-27B model card, the language model is 27B parameters across 64 layers (hidden dimension 5120) with a native context length of 262,144 tokens, extensible up to ~1,010,000 via rope scaling.
ℹ️ llama.cpp build matters. Qwen3.5 support landed recently in llama.cpp; bartowski's GGUFs were quantized with release b9222, and the optional MTP (multi-token-prediction) layers need release b9180 or newer. Use a recent build, not a year-old binary.
Installation
1. Install or build llama.cpp
# macOS (Homebrew) — quickest
brew install llama.cpp
# Linux — pre-built CUDA binary
# Download the latest "llama-bXXXX-bin-ubuntu-cuda-12.x-x64.zip" from:
# https://github.com/ggml-org/llama.cpp/releases
# Extract and add the bin/ directory to PATH.
To build from source with CUDA support instead, per Unsloth's Qwen3.5 run guide:
apt-get update
apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
git clone https://github.com/ggml-org/llama.cpp
cmake llama.cpp -B llama.cpp/build \
-DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON
cmake --build llama.cpp/build --config Release -j --clean-first --target llama-cli llama-mtmd-cli llama-server llama-gguf-split
cp llama.cpp/build/bin/llama-* llama.cpp
2. Get a Q4_K GGUF
The simplest path lets llama.cpp pull the file for you with -hf (shown in the Running section). To download explicitly instead, pick one Q4_K file from either mirror — both link back to the canonical Qwen/Qwen3.5-27B:
pip install -U "huggingface_hub[cli]"
# Option A — Unsloth (UD-Q4_K_XL = 17.6 GB on disk, or Q4_K_M = 16.7 GB)
huggingface-cli download unsloth/Qwen3.5-27B-GGUF \
--include "Qwen3.5-27B-Q4_K_M.gguf" --local-dir ./Qwen3.5-27B-GGUF
# Option B — bartowski imatrix (Q4_K_M = 17.98 GB, Q4_K_S = 16.93 GB)
huggingface-cli download bartowski/Qwen_Qwen3.5-27B-GGUF \
--include "Qwen_Qwen3.5-27B-Q4_K_M.gguf" --local-dir ./Qwen3.5-27B-GGUF
Per the per-tier file-size tables on unsloth/Qwen3.5-27B-GGUF and bartowski/Qwen_Qwen3.5-27B-GGUF, the Q4_K tier lands at ~16.7–18 GB on disk — the largest quant that leaves room for a KV cache on a 24 GB card. Q5_K_M (~19.6–21 GB) and Q6_K (~22.5–24 GB) are too tight alongside any context; IQ4_XS (~15.8 GB) trims a bit more if you want extra headroom (see "Picking a quant" below).
Running
Quick start — let llama.cpp fetch the weights
Per Unsloth's Qwen3.5 run guide, the -hf flag downloads and caches the GGUF on first run. Thinking mode (the default), general tasks:
export LLAMA_CACHE="unsloth/Qwen3.5-27B-GGUF"
./llama.cpp/llama-cli \
-hf unsloth/Qwen3.5-27B-GGUF:UD-Q4_K_XL \
--temp 1.0 \
--top-p 0.95 \
--top-k 20 \
--min-p 0.00
For precise coding/WebDev tasks, Unsloth recommends dropping the temperature to 0.6 (same top-p/top-k/min-p).
OpenAI-compatible HTTP server
Non-thinking (instruct) mode, general tasks — the canonical llama-server invocation from Unsloth's run guide:
export LLAMA_CACHE="unsloth/Qwen3.5-27B-GGUF"
./llama.cpp/llama-server \
-hf unsloth/Qwen3.5-27B-GGUF:UD-Q4_K_XL \
--temp 0.7 \
--top-p 0.8 \
--top-k 20 \
--min-p 0.00 \
--chat-template-kwargs '{"enable_thinking":false}'
llama-server exposes an OpenAI-compatible /v1/chat/completions endpoint (default port 8080). On a 24 GB card you should add --ctx-size 8192 --cache-type-k q8_0 --cache-type-v q8_0 --flash-attn --n-gpu-layers 99 to cap context and quantize the KV cache — the Q4_K envelope is full at the 24 GB ceiling, so this is what keeps you off the OOM line (see Troubleshooting).
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [{"role": "user", "content": "Explain Mixture-of-Experts in three sentences."}],
"temperature": 0.7, "top_p": 0.8, "top_k": 20, "min_p": 0.0
}'
Sampling parameters
Per the Qwen3.5-27B model card Best Practices: thinking mode, general tasks uses temperature=1.0, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=1.5; thinking mode, precise coding uses temperature=0.6; non-thinking (instruct) mode, general tasks uses temperature=0.7, top_p=0.8, top_k=20, min_p=0.0, presence_penalty=1.5. The card notes you can adjust presence_penalty between 0 and 2 to reduce endless repetitions: "However, using a higher value may occasionally result in language mixing and a slight decrease in model performance."
Results
- Speed: Per Hardware Corner's RTX 3090 LLM benchmark page, Qwen3.5 27B at Q4_K records 33.5 tokens/s generation at 4K context (32.3 t/s @ 16K, 31.0 t/s @ 32K), with prompt-processing prefill at 1,104.2 tokens/s @ 4K (977.4 @ 16K, 848.2 @ 32K). Confirmed by
/check/qwen3-5-27b/rtx-3090(Hardware Corner-sourced, last verified 2026-05-15). The 33.5 t/s figure is end-user token generation; prefill (prompt ingestion) is the much faster 1,104.2 t/s number — don't conflate the two. - VRAM usage: 24 GB peak at Q4_K on the 24 GB card per the backend benchmark at
/check/qwen3-5-27b/rtx-3090— i.e. the full envelope is in use at 4K context. The Q4_K_M GGUF is ~16.7 GB on disk (unsloth tier table); the remaining headroom holds the KV cache and activations. Unsloth's hardware table lists the 27B model at 4-bit needing 17 GB of memory. - Quality notes: Q4_K is the quality ceiling you can run on a single 24 GB card — Q5_K_M (~19.6–21 GB) and Q6_K (~22.5–24 GB) leave no room for KV cache. If you want more quality you need a larger GPU or a multi-GPU setup. bartowski's GGUFs are imatrix-calibrated; Unsloth's
UD-Q4_K_XL(17.6 GB) applies per-layer dynamic bit-allocation per the Unsloth Dynamic 2.0 docs.
For the full benchmark data and other-GPU comparisons, see /check/qwen3-5-27b/rtx-3090.
Troubleshooting
Ollama doesn't load these GGUFs
This is a Qwen3.5-specific gotcha. Per Unsloth's Qwen3.5 run guide: "Currently no Qwen3.5 GGUF works in Ollama due to separate mmproj vision files." Use llama.cpp (or an LM Studio build with current llama.cpp) instead — that is why this recipe pins llama.cpp rather than the usual one-line ollama run.
Generation slows or OOMs past 8K–16K context
The Q4_K footprint already pushes peak VRAM to the 24 GB ceiling at 4K per /check/qwen3-5-27b/rtx-3090, so the KV cache is the binding constraint as context grows. KV-cache discipline ladder, in order of how much it helps:
- Cap
--ctx-sizeat 8192 to start, then raise it only as far as it fits. - Quantize the KV cache with
--cache-type-k q8_0 --cache-type-v q8_0 --flash-attn(halves KV memory at minor quality cost);--cache-type-k q4_0cuts it further. - Drop to a smaller quant — IQ4_XS (~15.8 GB on disk per the unsloth tier table) frees ~1 GB for KV cache versus Q4_K_M.
Want image input (vision)?
Qwen3.5-27B is multimodal, but the GGUF text path above is text-only. To feed images, download the vision projector (mmproj-F16.gguf, ~0.93 GB, on unsloth/Qwen3.5-27B-GGUF) and run the multimodal CLI, e.g. llama-mtmd-cli --model <q4_k.gguf> --mmproj mmproj-F16.gguf .... The vision encoder adds VRAM on top of the text envelope — on a full-at-24 GB Q4_K text load you will likely need to lower --ctx-size to make room.
Ampere (RTX 3090) vs Ada/Hopper — anything special?
No. The RTX 3090 is Ampere (sm_86) — fully supported by mainline CUDA, llama.cpp's CUDA backend, and FlashAttention v2 (used by --flash-attn). The default pre-built llama-bXXXX-bin-ubuntu-cuda-12.x-x64.zip releases work out of the box; no special wheel selection or attention-implementation override is required. Ampere has no FP8 tensor cores, but this recipe ships only Q4_K GGUF weights, so that limitation never bites. There is also a Qwen/Qwen3.5-27B-FP8 mirror — it is intended for newer-architecture vLLM/SGLang serving, not single-3090 llama.cpp, so stick with the Q4_K GGUF here.
Measured something different on your card?
Speed and VRAM vary with quant, context length, and llama.cpp build. If your numbers differ, contribute them via the submission form — community measurements are what keep /check/qwen3-5-27b/rtx-3090 accurate for the next person.