self-hosted/ai
§01·recipe · tts

Voxtral Mini 3B on RTX 5060 Ti: local speech understanding in ~9.5 GB

ttsintermediate10GB+ VRAMMay 18, 2026
models
tools
prerequisites
  • NVIDIA RTX 5060 Ti (16 GB VRAM) or any consumer GPU with 12 GB+ VRAM
  • Python 3.10+
  • transformers >= 4.54.0 and mistral-common[audio] (CUDA-capable PyTorch)

What You'll Build

A local audio-understanding pipeline running Mistral's Voxtral Mini 3B on an RTX 5060 Ti. The model handles speech transcription, speech translation, audio Q&A, summarization, and function-calling from voice in eight languages — unlike pure text-to-speech models (Kokoro, VoxCPM), Voxtral is a multimodal audio+text LLM that consumes audio and produces text.

Hardware data: RTX 5060 Ti (16 GB VRAM) · ~9.5 GB peak in bf16/fp16 per the official model card · See benchmark data

ℹ️ Not a TTS model. Voxtral understands audio — it does not synthesize speech. For text-to-speech on this GPU, see Kokoro or VoxCPM. Voxtral is in our tts vertical because the wider catalogue groups audio-input-or-output models together; the model card is explicit that this is speech-to-text + audio Q&A.

Requirements

ComponentMinimumTested
GPU12 GB VRAM consumer cardRTX 5060 Ti (16 GB)
RAM16 GB
Storage~10 GB for weights + cache
SoftwarePython 3.10+, PyTorch with CUDA, transformers >= 4.54.0, mistral-common[audio] >= 1.8.1

Installation

1. Install Transformers and mistral-common

The Transformers integration shipped in v4.54.0. Both packages are required — Voxtral uses mistral-common's audio tokenizer:

pip install -U transformers
pip install --upgrade "mistral-common[audio]"

Verify the audio extras are present:

python -c "import mistral_common; print(mistral_common.__version__)"

2. (Optional) Install vLLM for high-throughput serving

vLLM gives the fastest token throughput but reserves a large KV cache. On a 16 GB card you may need --max-model-len 4864 to fit, per the DataCamp tutorial:

uv pip install -U "vllm[audio]" --system

This pulls vllm >= 0.10.0 and a compatible mistral_common >= 1.8.1.

3. (Optional) Use the FP8 mirror to halve VRAM

For tighter memory, the RedHatAI FP8-dynamic mirror is a community FP8 quantization of the same Mistral base weights, also Apache-2.0:

vllm serve RedHatAI/Voxtral-Mini-3B-2507-FP8-dynamic \
  --tokenizer_mode mistral --config_format mistral --load_format mistral

It reduces VRAM and disk by approximately 50% versus the bf16 release per the model card.

Running

Transformers — single-file audio Q&A

The canonical example from the Voxtral model card loads the model in bf16 and feeds it an audio clip plus a text question:

from transformers import VoxtralForConditionalGeneration, AutoProcessor
import torch

device = "cuda"
repo_id = "mistralai/Voxtral-Mini-3B-2507"

processor = AutoProcessor.from_pretrained(repo_id)
model = VoxtralForConditionalGeneration.from_pretrained(
    repo_id, torch_dtype=torch.bfloat16, device_map=device
)

conversation = [
    {
        "role": "user",
        "content": [
            {"type": "audio", "path": "your-clip.mp3"},
            {"type": "text", "text": "Transcribe and summarise this clip."},
        ],
    }
]

inputs = processor.apply_chat_template(conversation).to(device, dtype=torch.bfloat16)
outputs = model.generate(**inputs, max_new_tokens=500)
print(processor.batch_decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)[0])

vLLM — server mode

For batched inference or multi-client setups:

vllm serve mistralai/Voxtral-Mini-3B-2507 \
  --tokenizer_mode mistral --config_format mistral --load_format mistral

The server exposes an OpenAI-compatible API on localhost:8000. Audio is sent as a URL or base64 string inside the standard chat-completions payload.

Results

  • VRAM usage: Running Voxtral-Mini-3B-2507 on GPU requires ~9.5 GB of GPU RAM in bf16 or fp16, per the model card. Independent confirmation: a user reported on the official HF discussion running the Transformers version with VRAM sitting "around 10 GB during normal use" on a 12 GB GPU. The 16 GB headroom on the 5060 Ti leaves comfortable room for long audio.
  • Quality notes: Mistral's announcement claims Voxtral Mini "outperforms Whisper large-v3 on transcription tasks" and supports 30–40 minute audio contexts. A community report notes transcription quality "starts to slip a bit when the audio is noisy or mixes multiple languages" and that Whisper Large v3 remains slightly more robust in those edge cases (HF discussion).
  • License: Apache-2.0.

For the full benchmark data once community submissions land, see /check/voxtral/rtx-5060-ti.

Troubleshooting

vLLM consumes far more than 9.5 GB

Reported on the HF model discussion: vLLM can grow to "almost 40 GB VRAM" because of its KV-cache reservation policy. The ~9.5 GB figure on the model card refers to the Transformers runtime. To bring vLLM into a 16 GB budget on the 5060 Ti, pass --max-model-len 4864 (or smaller). For ad-hoc local use, the Transformers backend is preferred; reach for vLLM only when you need batched throughput.

ImportError or version mismatch on import

Voxtral was added in transformers >= 4.54.0 and needs mistral-common[audio] >= 1.8.1. The HF card calls these out explicitly. If you see cannot import name 'VoxtralForConditionalGeneration', your transformers is too old — upgrade with pip install -U transformers.

GGUF / llama.cpp builds

Per the HF discussion thread, GGUF conversion is limited for encoder-decoder audio-text architectures like Voxtral; stick with the Transformers or vLLM paths above. The official FP8 mirror covers the "smaller weights" use case without requiring GGUF.