What You'll Build
A local audio-understanding pipeline running Mistral's Voxtral Mini 3B on an RTX 5080. The model handles speech transcription, speech translation, audio Q&A, summarization, and function-calling from voice in eight languages — unlike pure text-to-speech models (Kokoro, VoxCPM), Voxtral is a multimodal audio+text LLM that consumes audio and produces text.
Hardware data: RTX 5080 (16 GB VRAM) · ~9.5 GB peak in bf16/fp16 per the official model card · See benchmark data
ℹ️ Not a TTS model. Voxtral understands audio — it does not synthesize speech. For text-to-speech on this GPU, see Kokoro or VoxCPM. Voxtral is in our
ttsvertical because the wider catalogue groups audio-input-or-output models together; the model card is explicit that this is speech-to-text + audio Q&A.
Requirements
| Component | Minimum | Tested |
|---|---|---|
| GPU | 12 GB VRAM consumer card | RTX 5080 (16 GB, Blackwell sm_120) |
| RAM | 16 GB | — |
| Storage | ~10 GB for weights + cache | ~9.4 GB bf16 weights per HF Files tab |
| Software | Python 3.10+, PyTorch with CUDA (cu128), transformers >= 4.54.0, mistral-common[audio] >= 1.8.1 | — |
Installation
1. Install a Blackwell-ready PyTorch (cu128)
The RTX 5080 is a Blackwell card (compute capability sm_120). Install a PyTorch build that ships sm_120 kernels — the CUDA 12.8 wheel — before anything else, so the audio tower and decoder both get native Blackwell kernels:
pip install --upgrade torch --index-url https://download.pytorch.org/whl/cu128
Unlike some Blackwell recipes, Voxtral does not need an attn_implementation override: the model card's reference snippet calls from_pretrained(...) without forcing flash_attention_2, so PyTorch's built-in SDPA kernels (which have full sm_120 coverage) are used by default. There is no FlashAttention-2 sm_120 wheel gap to work around here.
2. Install Transformers and mistral-common
The Transformers integration shipped in v4.54.0. Both packages are required — Voxtral uses mistral-common's audio tokenizer:
pip install -U transformers
pip install --upgrade "mistral-common[audio]"
Verify the audio extras are present:
python -c "import mistral_common; print(mistral_common.__version__)"
You should see 1.8.1 or newer; the model card pins this exact version as the audio-tokenizer floor.
3. (Optional) Install vLLM for high-throughput serving
vLLM gives the fastest token throughput but reserves a large KV cache. The Transformers backend is recommended for local desktop use on a 16 GB card; reach for vLLM only when you need batched throughput or an OpenAI-compatible HTTP API:
uv pip install -U "vllm[audio]" --system
This pulls vllm >= 0.10.0 and a compatible mistral_common >= 1.8.1. See the Troubleshooting section for the --max-model-len cap you will likely need on a 16 GB card.
4. (Optional) Use the FP8 mirror to halve VRAM
For tighter memory, the RedHatAI FP8-dynamic mirror is a community FP8 quantization of the same Mistral base weights, also Apache-2.0. Per the mirror's card, weights are quantized with a "symmetric static per-channel scheme" and activations with a "symmetric dynamic per-token scheme" (linear layers only):
vllm serve RedHatAI/Voxtral-Mini-3B-2507-FP8-dynamic \
--tokenizer_mode mistral --config_format mistral --load_format mistral
The mirror's card states the FP8 optimization is "reducing GPU memory requirements (by approximately 50%) and increasing matrix-multiply compute throughput (by approximately 2x)", plus a matching ~50% disk-size reduction. Blackwell (RTX 5080, sm_120) has native FP8 tensor cores, so it sees the throughput uplift in addition to the memory saving. See the RedHatAI card for the full quantization recipe.
Running
Transformers — single-file audio Q&A
The canonical example adapted from the Voxtral model card loads the model in bf16 and feeds it an audio clip plus a text question:
from transformers import VoxtralForConditionalGeneration, AutoProcessor
import torch
device = "cuda"
repo_id = "mistralai/Voxtral-Mini-3B-2507"
processor = AutoProcessor.from_pretrained(repo_id)
model = VoxtralForConditionalGeneration.from_pretrained(
repo_id, torch_dtype=torch.bfloat16, device_map=device
)
conversation = [
{
"role": "user",
"content": [
{"type": "audio", "path": "your-clip.mp3"},
{"type": "text", "text": "Transcribe and summarise this clip."},
],
}
]
inputs = processor.apply_chat_template(conversation).to(device, dtype=torch.bfloat16)
outputs = model.generate(**inputs, max_new_tokens=500)
print(processor.batch_decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)[0])
Per the model card, "With a 32k token context length, Voxtral handles audios up to 30 minutes for transcription, or 40 minutes for understanding" — so a single clip can be most of an hour-long meeting.
vLLM — server mode
For batched inference or multi-client setups:
vllm serve mistralai/Voxtral-Mini-3B-2507 \
--tokenizer_mode mistral --config_format mistral --load_format mistral
The server exposes an OpenAI-compatible API on localhost:8000. Audio is sent as a URL or base64 string inside the standard chat-completions payload. On a 16 GB card you will almost certainly need to cap context — see Troubleshooting.
Results
- VRAM usage: The model card states "Running Voxtral-Mini-3B-2507 on GPU requires ~9.5 GB of GPU RAM in bf16 or fp16" (per the card's vLLM serving section). Independent corroboration from the Transformers runtime: a user reported on the official HF discussion that "I’m running the Hugging Face version on a 12GB GPU with no problem VRAM sits around 10GB during normal use." The 16 GB on the RTX 5080 leaves comfortable headroom for long audio and the KV growth that comes with the 30-minute transcription window.
- Speed: Empirical speed for this exact GPU is not yet available. The RTX 5080's memory bandwidth sits below the flagship 5090 but well above prior-gen midrange cards, and Voxtral's 3B parameter count keeps inference latency compute-bound rather than bandwidth-bound on this hardware — but no source publishes a 5080-named Voxtral measurement, so we omit a number rather than extrapolate. Submit a benchmark via /contribute once you've measured it.
- Quality notes: Mistral's announcement positions Voxtral as a state-of-the-art open speech-understanding model with multilingual transcription, translation and audio Q&A. A community report from the HF discussion notes that transcription "starts to slip a bit when the audio is noisy or mixes multiple languages" and that "Whisper Large v3 still feels a bit more robust in those tricky cases" (HF discussion).
- License: Apache-2.0.
For the full benchmark data once community submissions land, see /check/voxtral/rtx-5080.
Troubleshooting
vLLM consumes far more than 9.5 GB
A user on the HF model discussion reported that vLLM "takes up almost 40GB VRAM for me" — vLLM pre-allocates a large KV cache, which is why it overshoots the ~9.5 GB resident-weights figure on the Mistral card. To bring vLLM into a 16 GB budget on the 5080, cap --max-model-len (the 32k default reserves the full audio-encoder cache up front) and consider --gpu-memory-utilization 0.85 to leave activation headroom. The exact ceiling depends on your concurrent-stream count — start with --max-model-len 8192 and raise it until you hit OOM, or stick with the Transformers backend for single-user desktop work.
ImportError or version mismatch on import
Voxtral was added in transformers >= 4.54.0 and needs mistral-common[audio] >= 1.8.1. The HF card calls these out explicitly. If you see cannot import name 'VoxtralForConditionalGeneration', your transformers is too old — upgrade with pip install -U transformers.
GGUF / llama.cpp builds
Per the HF discussion thread, a community user notes that "GGUF only works with decoder-only models like LLaMA, so we can't convert the full Voxtral with audio encoder yet." Stick with the Transformers or vLLM paths above. The official FP8 mirror covers the "smaller weights" use case without requiring GGUF.
Should I use the 24B variant instead?
No — not on this GPU. Voxtral Small 24B is the same architecture at a larger scale, but its model card states "Running Voxtral-Small-24B-2507 on GPU requires ~55 GB of GPU RAM in bf16 or fp16" — about 5× the 5080's 16 GB envelope, and out of reach even with FP8. Voxtral Mini 3B is the right variant for any consumer GPU under 24 GB.