What You'll Build
A local audio-understanding pipeline running Mistral's Voxtral Mini 3B on an RTX 5060 Ti. The model handles speech transcription, speech translation, audio Q&A, summarization, and function-calling from voice in eight languages — unlike pure text-to-speech models (Kokoro, VoxCPM), Voxtral is a multimodal audio+text LLM that consumes audio and produces text.
Hardware data: RTX 5060 Ti (16 GB VRAM) · ~9.5 GB peak in bf16/fp16 per the official model card · See benchmark data
ℹ️ Not a TTS model. Voxtral understands audio — it does not synthesize speech. For text-to-speech on this GPU, see Kokoro or VoxCPM. Voxtral is in our
ttsvertical because the wider catalogue groups audio-input-or-output models together; the model card is explicit that this is speech-to-text + audio Q&A.
Requirements
| Component | Minimum | Tested |
|---|---|---|
| GPU | 12 GB VRAM consumer card | RTX 5060 Ti (16 GB) |
| RAM | 16 GB | — |
| Storage | ~10 GB for weights + cache | — |
| Software | Python 3.10+, PyTorch with CUDA, transformers >= 4.54.0, mistral-common[audio] >= 1.8.1 | — |
Installation
1. Install Transformers and mistral-common
The Transformers integration shipped in v4.54.0. Both packages are required — Voxtral uses mistral-common's audio tokenizer:
pip install -U transformers
pip install --upgrade "mistral-common[audio]"
Verify the audio extras are present:
python -c "import mistral_common; print(mistral_common.__version__)"
2. (Optional) Install vLLM for high-throughput serving
vLLM gives the fastest token throughput but reserves a large KV cache. On a 16 GB card you may need --max-model-len 4864 to fit, per the DataCamp tutorial:
uv pip install -U "vllm[audio]" --system
This pulls vllm >= 0.10.0 and a compatible mistral_common >= 1.8.1.
3. (Optional) Use the FP8 mirror to halve VRAM
For tighter memory, the RedHatAI FP8-dynamic mirror is a community FP8 quantization of the same Mistral base weights, also Apache-2.0:
vllm serve RedHatAI/Voxtral-Mini-3B-2507-FP8-dynamic \
--tokenizer_mode mistral --config_format mistral --load_format mistral
It reduces VRAM and disk by approximately 50% versus the bf16 release per the model card.
Running
Transformers — single-file audio Q&A
The canonical example from the Voxtral model card loads the model in bf16 and feeds it an audio clip plus a text question:
from transformers import VoxtralForConditionalGeneration, AutoProcessor
import torch
device = "cuda"
repo_id = "mistralai/Voxtral-Mini-3B-2507"
processor = AutoProcessor.from_pretrained(repo_id)
model = VoxtralForConditionalGeneration.from_pretrained(
repo_id, torch_dtype=torch.bfloat16, device_map=device
)
conversation = [
{
"role": "user",
"content": [
{"type": "audio", "path": "your-clip.mp3"},
{"type": "text", "text": "Transcribe and summarise this clip."},
],
}
]
inputs = processor.apply_chat_template(conversation).to(device, dtype=torch.bfloat16)
outputs = model.generate(**inputs, max_new_tokens=500)
print(processor.batch_decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)[0])
vLLM — server mode
For batched inference or multi-client setups:
vllm serve mistralai/Voxtral-Mini-3B-2507 \
--tokenizer_mode mistral --config_format mistral --load_format mistral
The server exposes an OpenAI-compatible API on localhost:8000. Audio is sent as a URL or base64 string inside the standard chat-completions payload.
Results
- VRAM usage: Running Voxtral-Mini-3B-2507 on GPU requires ~9.5 GB of GPU RAM in bf16 or fp16, per the model card. Independent confirmation: a user reported on the official HF discussion running the Transformers version with VRAM sitting "around 10 GB during normal use" on a 12 GB GPU. The 16 GB headroom on the 5060 Ti leaves comfortable room for long audio.
- Quality notes: Mistral's announcement claims Voxtral Mini "outperforms Whisper large-v3 on transcription tasks" and supports 30–40 minute audio contexts. A community report notes transcription quality "starts to slip a bit when the audio is noisy or mixes multiple languages" and that Whisper Large v3 remains slightly more robust in those edge cases (HF discussion).
- License: Apache-2.0.
For the full benchmark data once community submissions land, see /check/voxtral/rtx-5060-ti.
Troubleshooting
vLLM consumes far more than 9.5 GB
Reported on the HF model discussion: vLLM can grow to "almost 40 GB VRAM" because of its KV-cache reservation policy. The ~9.5 GB figure on the model card refers to the Transformers runtime. To bring vLLM into a 16 GB budget on the 5060 Ti, pass --max-model-len 4864 (or smaller). For ad-hoc local use, the Transformers backend is preferred; reach for vLLM only when you need batched throughput.
ImportError or version mismatch on import
Voxtral was added in transformers >= 4.54.0 and needs mistral-common[audio] >= 1.8.1. The HF card calls these out explicitly. If you see cannot import name 'VoxtralForConditionalGeneration', your transformers is too old — upgrade with pip install -U transformers.
GGUF / llama.cpp builds
Per the HF discussion thread, GGUF conversion is limited for encoder-decoder audio-text architectures like Voxtral; stick with the Transformers or vLLM paths above. The official FP8 mirror covers the "smaller weights" use case without requiring GGUF.