Voxtral Mini 3B on RTX 4060 Ti 16GB: local speech understanding in ~9.5 GB

What You'll Build

A local audio-understanding pipeline running Mistral's Voxtral Mini 3B on an RTX 4060 Ti 16GB. The model handles speech transcription, speech translation, audio Q&A, summarization, and function-calling from voice in eight languages — unlike pure text-to-speech models (Kokoro, VoxCPM), Voxtral is a multimodal audio+text LLM that consumes audio and produces text.

Hardware data: RTX 4060 Ti 16GB · ~9.5 GB peak in bf16/fp16 per the official model card · See benchmark data

ℹ️ Not a TTS model. Voxtral understands audio — it does not synthesize speech. For text-to-speech on this GPU, see Kokoro or VoxCPM. Voxtral is in our tts vertical because the wider catalogue groups audio-input-or-output models together; the model card is explicit that this is speech-to-text + audio Q&A.

Requirements

Component	Minimum	Tested
GPU	12 GB VRAM consumer card	RTX 4060 Ti (16 GB, Ada sm_89)
RAM	16 GB	—
Storage	~10 GB for weights + cache	—
Software	Python 3.10+, PyTorch with CUDA, `transformers >= 4.54.0`, `mistral-common[audio] >= 1.8.1`	—

Installation

1. Install Transformers and mistral-common

The Transformers integration shipped in v4.54.0. Both packages are required — Voxtral uses mistral-common's audio tokenizer:

pip install -U transformers
pip install --upgrade "mistral-common[audio]"

Verify the audio extras are present:

python -c "import mistral_common; print(mistral_common.__version__)"

You should see 1.8.1 or newer; the model card pins this exact version as the audio-tokenizer floor.

Unlike Blackwell GPUs (sm_120), no special CUDA wheel selection is required for the 4060 Ti — the default pip install torch already ships sm_89 kernels with full FlashAttention-2 support, so the standard install path works out of the box.

2. (Optional) Install vLLM for high-throughput serving

vLLM gives the fastest token throughput but reserves a large KV cache. The Transformers backend is recommended for local desktop use on a 16 GB card; reach for vLLM only when you need batched throughput or an OpenAI-compatible HTTP API:

uv pip install -U "vllm[audio]" --system

This pulls vllm >= 0.10.0 and a compatible mistral_common >= 1.8.1. See the Troubleshooting section for the --max-model-len cap you will likely need on a 16 GB card.

3. (Optional) Use the FP8 mirror to halve VRAM

For tighter memory, the RedHatAI FP8-dynamic mirror is a community FP8 quantization of the same Mistral base weights, also Apache-2.0. Per the mirror's card, it ships symmetric-per-channel weights and dynamic-per-token activations (linear layers only — the audio tower and multi-modal projector stay in full precision):

vllm serve RedHatAI/Voxtral-Mini-3B-2507-FP8-dynamic \
  --tokenizer_mode mistral --config_format mistral --load_format mistral

The mirror documents a ~50% reduction in both GPU memory and disk size versus the bf16 release, plus a ~2× matmul throughput uplift on hardware with native FP8 support — Ada Lovelace (RTX 4060 Ti / 4090) qualifies. See the RedHatAI card for the full quantization recipe.

Running

Transformers — single-file audio Q&A

The canonical example adapted from the Voxtral model card loads the model in bf16 and feeds it an audio clip plus a text question:

from transformers import VoxtralForConditionalGeneration, AutoProcessor
import torch

device = "cuda"
repo_id = "mistralai/Voxtral-Mini-3B-2507"

processor = AutoProcessor.from_pretrained(repo_id)
model = VoxtralForConditionalGeneration.from_pretrained(
    repo_id, torch_dtype=torch.bfloat16, device_map=device
)

conversation = [
    {
        "role": "user",
        "content": [
            {"type": "audio", "path": "your-clip.mp3"},
            {"type": "text", "text": "Transcribe and summarise this clip."},
        ],
    }
]

inputs = processor.apply_chat_template(conversation).to(device, dtype=torch.bfloat16)
outputs = model.generate(**inputs, max_new_tokens=500)
print(processor.batch_decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)[0])

The model handles audio clips up to 30 minutes for transcription, or 40 minutes for understanding, within a 32k-token context window (Mistral announcement).

vLLM — server mode

For batched inference or multi-client setups:

vllm serve mistralai/Voxtral-Mini-3B-2507 \
  --tokenizer_mode mistral --config_format mistral --load_format mistral

The server exposes an OpenAI-compatible API on localhost:8000. Audio is sent as a URL or base64 string inside the standard chat-completions payload. On a 16 GB card you will almost certainly need to cap context — see Troubleshooting.

Results

VRAM usage: Running Voxtral-Mini-3B-2507 on GPU requires ~9.5 GB of GPU RAM in bf16 or fp16, per the model card (quoted in its vLLM serving section). Independent corroboration from the Transformers runtime: a user reported on the official HF discussion running the Transformers version with VRAM sitting "around 10 GB during normal use" on a 12 GB GPU. The 16 GB headroom on the 4060 Ti leaves comfortable room for long audio and the eventual KV growth that comes with the 30-minute transcription window.
Speed: Empirical speed for this exact GPU is not yet available. The 4060 Ti's 288 GB/s memory bandwidth sits below higher-tier Ada cards on memory-bound audio decoding, but Voxtral's 3B parameter count keeps inference latency dominated by compute rather than bandwidth on this hardware. Submit a benchmark via /contribute once you've measured it.
Quality notes: Mistral's announcement claims Voxtral "comprehensively outperforms Whisper large-v3, the current leading open-source Speech Transcription model" and supports 30-minute transcription windows. A community report from the same HF discussion notes transcription quality "starts to slip a bit when the audio is noisy or mixes multiple languages" and that "Whisper Large v3 still feels a bit more robust in those tricky cases" (HF discussion).
License: Apache-2.0.

For the full benchmark data once community submissions land, see /check/voxtral/rtx-4060-ti-16gb.

Troubleshooting

vLLM consumes far more than 9.5 GB

Reported on the HF model discussion: vLLM can grow to "almost 40 GB VRAM" because of its KV-cache reservation policy. A separate vLLM-side bug report (vllm-project/vllm#38233) tracks 16 GB users hitting encoder_cache saturation on the realtime variant — same root cause class.

The ~9.5 GB figure on the Mistral card describes resident weight memory, not vLLM's pre-allocated KV reservation. To bring vLLM into a 16 GB budget on the 4060 Ti, cap --max-model-len (the 32k default reserves the full audio-encoder cache up front) and consider --gpu-memory-utilization 0.85 to leave activation headroom. The exact ceiling depends on your concurrent-stream count — start with --max-model-len 8192 and raise it until you hit OOM, or stick with the Transformers backend for single-user desktop work.

`ImportError` or version mismatch on import

Voxtral was added in transformers >= 4.54.0 and needs mistral-common[audio] >= 1.8.1. The HF card calls these out explicitly. If you see cannot import name 'VoxtralForConditionalGeneration', your transformers is too old — upgrade with pip install -U transformers.

GGUF / llama.cpp builds

Per the HF discussion thread, GGUF conversion only works with decoder-only architectures — "we can't convert the full Voxtral with audio encoder yet. The text-only quant works, but it's not useful for transcription." Stick with the Transformers or vLLM paths above. The official FP8 mirror covers the "smaller weights" use case without requiring GGUF.

Should I use the 24B variant instead?

No — not on this GPU. Voxtral Small 24B is the same architecture at a larger scale, but its model card quotes ~55 GB of GPU RAM in bf16/fp16 — about 5× the 4060 Ti's 16 GB envelope, and out of reach even with FP8. Voxtral Mini 3B is the right variant for any consumer GPU under 24 GB.