self-hosted/ai
§01·recipe · llm

Qwen3.5-27B on RTX 3090: Q4_K GGUF local chat via llama.cpp

llmintermediate24GB+ VRAMJun 27, 2026

This intermediate recipe sets up Qwen3.5 27B on the RTX 3090, needing about 24 GB of VRAM.

models
tools
prerequisites
  • NVIDIA RTX 3090 (24 GB VRAM) — required, not optional (the Q4_K envelope is full at 4K context)
  • Recent NVIDIA driver with CUDA 12.x support (Ampere sm_86 — default pre-built llama.cpp CUDA binaries work)
  • ~17 GB free disk for the Q4_K_M GGUF
  • llama.cpp (recent build, b9222+) or LM Studio installed
  • Python 3.10+ (only for the optional huggingface-cli download step)

What You'll Build

A local Qwen3.5-27B assistant running on a single RTX 3090 (24 GB VRAM) through llama.cpp with a Q4_K GGUF quant (~17 GB on disk). Qwen3.5-27B is Alibaba's February-2026 native-multimodal model — its 27B-parameter language model uses a hybrid Gated DeltaNet + Gated Attention + sparse Mixture-of-Experts layout — and at Q4_K it sits right at the edge of single-RTX-3090 territory: the quant fits with usable KV-cache headroom at short context, but the envelope is full at the 24 GB ceiling, so context-window discipline matters.

Hardware data: RTX 3090 (24 GB VRAM) · Q4_K GGUF · 33.5 tok/s generation @ 4K context · See benchmark data

ℹ️ Native-multimodal model, run here as a text LLM. Per the Qwen3.5-27B model card, Qwen3.5 is a "Causal Language Model with Vision Encoder" — it understands images and video in addition to text. This recipe documents the text-chat path via the main GGUF (image input needs the separate mmproj vision projector, covered briefly in Troubleshooting). Qwen3.5-27B is in our llm vertical because that path is what fits and benchmarks on a 24 GB consumer card.

⚠️ Disambiguation — Qwen3.5-27B, not 35B and not Qwen3. This recipe targets Qwen/Qwen3.5-27B (Qwen3.5, the 27B model). It is a distinct model from Qwen3.5-35B-A3B (a larger sparse-MoE sibling) and from the older Qwen3 family (no ".5"). The hardware figures below are for the 27B only.

Requirements

ComponentMinimumTested
GPU24 GB VRAM (Q4_K)RTX 3090 (24 GB)
RAM32 GB system
Storage~17 GB for the Q4_K_M GGUF per unsloth/Qwen3.5-27B-GGUF
DriverCUDA 12.x runtime (Ampere sm_86)
Runtimellama.cpp / LM Studiollama.cpp b9222+

The model is released under Apache 2.0 — commercial use permitted. Per the Qwen3.5-27B model card, the language model is 27B parameters across 64 layers (hidden dimension 5120) with a native context length of 262,144 tokens, extensible up to ~1,010,000 via rope scaling.

ℹ️ llama.cpp build matters. Qwen3.5 support landed recently in llama.cpp; bartowski's GGUFs were quantized with release b9222, and the optional MTP (multi-token-prediction) layers need release b9180 or newer. Use a recent build, not a year-old binary.

Installation

1. Install or build llama.cpp

# macOS (Homebrew) — quickest
brew install llama.cpp

# Linux — pre-built CUDA binary
# Download the latest "llama-bXXXX-bin-ubuntu-cuda-12.x-x64.zip" from:
#   https://github.com/ggml-org/llama.cpp/releases
# Extract and add the bin/ directory to PATH.

To build from source with CUDA support instead, per Unsloth's Qwen3.5 run guide:

apt-get update
apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
git clone https://github.com/ggml-org/llama.cpp
cmake llama.cpp -B llama.cpp/build \
    -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON
cmake --build llama.cpp/build --config Release -j --clean-first --target llama-cli llama-mtmd-cli llama-server llama-gguf-split
cp llama.cpp/build/bin/llama-* llama.cpp

2. Get a Q4_K GGUF

The simplest path lets llama.cpp pull the file for you with -hf (shown in the Running section). To download explicitly instead, pick one Q4_K file from either mirror — both link back to the canonical Qwen/Qwen3.5-27B:

pip install -U "huggingface_hub[cli]"

# Option A — Unsloth (UD-Q4_K_XL = 17.6 GB on disk, or Q4_K_M = 16.7 GB)
huggingface-cli download unsloth/Qwen3.5-27B-GGUF \
  --include "Qwen3.5-27B-Q4_K_M.gguf" --local-dir ./Qwen3.5-27B-GGUF

# Option B — bartowski imatrix (Q4_K_M = 17.98 GB, Q4_K_S = 16.93 GB)
huggingface-cli download bartowski/Qwen_Qwen3.5-27B-GGUF \
  --include "Qwen_Qwen3.5-27B-Q4_K_M.gguf" --local-dir ./Qwen3.5-27B-GGUF

Per the per-tier file-size tables on unsloth/Qwen3.5-27B-GGUF and bartowski/Qwen_Qwen3.5-27B-GGUF, the Q4_K tier lands at ~16.7–18 GB on disk — the largest quant that leaves room for a KV cache on a 24 GB card. Q5_K_M (~19.6–21 GB) and Q6_K (~22.5–24 GB) are too tight alongside any context; IQ4_XS (~15.8 GB) trims a bit more if you want extra headroom (see "Picking a quant" below).

Running

Quick start — let llama.cpp fetch the weights

Per Unsloth's Qwen3.5 run guide, the -hf flag downloads and caches the GGUF on first run. Thinking mode (the default), general tasks:

export LLAMA_CACHE="unsloth/Qwen3.5-27B-GGUF"
./llama.cpp/llama-cli \
    -hf unsloth/Qwen3.5-27B-GGUF:UD-Q4_K_XL \
    --temp 1.0 \
    --top-p 0.95 \
    --top-k 20 \
    --min-p 0.00

For precise coding/WebDev tasks, Unsloth recommends dropping the temperature to 0.6 (same top-p/top-k/min-p).

OpenAI-compatible HTTP server

Non-thinking (instruct) mode, general tasks — the canonical llama-server invocation from Unsloth's run guide:

export LLAMA_CACHE="unsloth/Qwen3.5-27B-GGUF"
./llama.cpp/llama-server \
    -hf unsloth/Qwen3.5-27B-GGUF:UD-Q4_K_XL \
    --temp 0.7 \
    --top-p 0.8 \
    --top-k 20 \
    --min-p 0.00 \
    --chat-template-kwargs '{"enable_thinking":false}'

llama-server exposes an OpenAI-compatible /v1/chat/completions endpoint (default port 8080). On a 24 GB card you should add --ctx-size 8192 --cache-type-k q8_0 --cache-type-v q8_0 --flash-attn --n-gpu-layers 99 to cap context and quantize the KV cache — the Q4_K envelope is full at the 24 GB ceiling, so this is what keeps you off the OOM line (see Troubleshooting).

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [{"role": "user", "content": "Explain Mixture-of-Experts in three sentences."}],
    "temperature": 0.7, "top_p": 0.8, "top_k": 20, "min_p": 0.0
  }'

Sampling parameters

Per the Qwen3.5-27B model card Best Practices: thinking mode, general tasks uses temperature=1.0, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=1.5; thinking mode, precise coding uses temperature=0.6; non-thinking (instruct) mode, general tasks uses temperature=0.7, top_p=0.8, top_k=20, min_p=0.0, presence_penalty=1.5. The card notes you can adjust presence_penalty between 0 and 2 to reduce endless repetitions: "However, using a higher value may occasionally result in language mixing and a slight decrease in model performance."

Results

  • Speed: Per Hardware Corner's RTX 3090 LLM benchmark page, Qwen3.5 27B at Q4_K records 33.5 tokens/s generation at 4K context (32.3 t/s @ 16K, 31.0 t/s @ 32K), with prompt-processing prefill at 1,104.2 tokens/s @ 4K (977.4 @ 16K, 848.2 @ 32K). Confirmed by /check/qwen3-5-27b/rtx-3090 (Hardware Corner-sourced, last verified 2026-05-15). The 33.5 t/s figure is end-user token generation; prefill (prompt ingestion) is the much faster 1,104.2 t/s number — don't conflate the two.
  • VRAM usage: 24 GB peak at Q4_K on the 24 GB card per the backend benchmark at /check/qwen3-5-27b/rtx-3090 — i.e. the full envelope is in use at 4K context. The Q4_K_M GGUF is ~16.7 GB on disk (unsloth tier table); the remaining headroom holds the KV cache and activations. Unsloth's hardware table lists the 27B model at 4-bit needing 17 GB of memory.
  • Quality notes: Q4_K is the quality ceiling you can run on a single 24 GB card — Q5_K_M (~19.6–21 GB) and Q6_K (~22.5–24 GB) leave no room for KV cache. If you want more quality you need a larger GPU or a multi-GPU setup. bartowski's GGUFs are imatrix-calibrated; Unsloth's UD-Q4_K_XL (17.6 GB) applies per-layer dynamic bit-allocation per the Unsloth Dynamic 2.0 docs.

For the full benchmark data and other-GPU comparisons, see /check/qwen3-5-27b/rtx-3090.

Troubleshooting

Ollama doesn't load these GGUFs

This is a Qwen3.5-specific gotcha. Per Unsloth's Qwen3.5 run guide: "Currently no Qwen3.5 GGUF works in Ollama due to separate mmproj vision files." Use llama.cpp (or an LM Studio build with current llama.cpp) instead — that is why this recipe pins llama.cpp rather than the usual one-line ollama run.

Generation slows or OOMs past 8K–16K context

The Q4_K footprint already pushes peak VRAM to the 24 GB ceiling at 4K per /check/qwen3-5-27b/rtx-3090, so the KV cache is the binding constraint as context grows. KV-cache discipline ladder, in order of how much it helps:

  1. Cap --ctx-size at 8192 to start, then raise it only as far as it fits.
  2. Quantize the KV cache with --cache-type-k q8_0 --cache-type-v q8_0 --flash-attn (halves KV memory at minor quality cost); --cache-type-k q4_0 cuts it further.
  3. Drop to a smaller quant — IQ4_XS (~15.8 GB on disk per the unsloth tier table) frees ~1 GB for KV cache versus Q4_K_M.

Want image input (vision)?

Qwen3.5-27B is multimodal, but the GGUF text path above is text-only. To feed images, download the vision projector (mmproj-F16.gguf, ~0.93 GB, on unsloth/Qwen3.5-27B-GGUF) and run the multimodal CLI, e.g. llama-mtmd-cli --model <q4_k.gguf> --mmproj mmproj-F16.gguf .... The vision encoder adds VRAM on top of the text envelope — on a full-at-24 GB Q4_K text load you will likely need to lower --ctx-size to make room.

Ampere (RTX 3090) vs Ada/Hopper — anything special?

No. The RTX 3090 is Ampere (sm_86) — fully supported by mainline CUDA, llama.cpp's CUDA backend, and FlashAttention v2 (used by --flash-attn). The default pre-built llama-bXXXX-bin-ubuntu-cuda-12.x-x64.zip releases work out of the box; no special wheel selection or attention-implementation override is required. Ampere has no FP8 tensor cores, but this recipe ships only Q4_K GGUF weights, so that limitation never bites. There is also a Qwen/Qwen3.5-27B-FP8 mirror — it is intended for newer-architecture vLLM/SGLang serving, not single-3090 llama.cpp, so stick with the Q4_K GGUF here.

Measured something different on your card?

Speed and VRAM vary with quant, context length, and llama.cpp build. If your numbers differ, contribute them via the submission form — community measurements are what keep /check/qwen3-5-27b/rtx-3090 accurate for the next person.

common questions
How much VRAM does Qwen3.5 27B need?

About 24 GB — the minimum this recipe targets.

Which GPUs is Qwen3.5 27B tested on?

RTX 3090 (24 GB).

How hard is this setup?

Intermediate — follow the steps above.