What You'll Build
A local audio-understanding pipeline running Mistral's Voxtral Mini 3B on an RTX 4080 SUPER. The model handles speech transcription, speech translation, audio Q&A, summarization, and function-calling from voice — unlike pure text-to-speech models (Kokoro, VoxCPM), Voxtral is a multimodal audio+text LLM that consumes audio and produces text.
Hardware data: RTX 4080 SUPER (16 GB VRAM, Ada AD103 sm_89) · ~9.5 GB peak in bf16/fp16 per the official model card · See benchmark data
ℹ️ Not a TTS model. Voxtral understands audio — it does not synthesize speech. For text-to-speech on this GPU, see Kokoro or VoxCPM. Voxtral is in our
ttsvertical because the wider catalogue groups audio-input-or-output models together; the model card is explicit that this is speech-to-text + audio understanding.
Requirements
| Component | Minimum | Tested |
|---|---|---|
| GPU | 12 GB VRAM consumer card | RTX 4080 SUPER (16 GB, Ada AD103 sm_89) |
| RAM | 16 GB | — |
| Storage | ~10 GB for weights + cache | ~9.4 GB of bf16 safetensors per the HF Files tab |
| Software | Python 3.10+, PyTorch with CUDA, transformers >= 4.54.0, mistral-common[audio] >= 1.8.1 | — |
Installation
1. Install Transformers and mistral-common
Voxtral runs natively in Transformers starting with transformers >= 4.54.0. Both packages are required — Voxtral uses mistral-common's audio tokenizer:
pip install -U transformers
pip install --upgrade "mistral-common[audio]"
Verify the audio extras are present:
python -c "import mistral_common; print(mistral_common.__version__)"
You should see 1.8.1 or newer; the model card pins this exact version as the audio-tokenizer floor.
Unlike Blackwell GPUs (sm_120), no special CUDA wheel selection is required for the RTX 4080 SUPER — the default pip install torch already ships sm_89 kernels with full FlashAttention-2 support, so the standard install path works out of the box.
2. (Optional) Install vLLM for high-throughput serving
vLLM gives the fastest token throughput but reserves a large KV cache. The Transformers backend is recommended for local desktop use on a 16 GB card; reach for vLLM only when you need batched throughput or an OpenAI-compatible HTTP API:
uv pip install -U "vllm[audio]" --system
This pulls a recent vLLM and a compatible mistral_common >= 1.8.1. See the Troubleshooting section for the --max-model-len cap you will likely need on a 16 GB card.
3. (Optional) Use the FP8 mirror to halve VRAM
For tighter memory, the RedHatAI FP8-dynamic mirror is a community FP8 quantization of the same Mistral base weights, also Apache-2.0. Per the mirror's card, only the linear operators inside the language model's transformer blocks are quantized — weights with a symmetric static per-channel scheme and activations with a symmetric dynamic per-token scheme — while the audio tower and multi-modal projector stay in full precision:
vllm serve RedHatAI/Voxtral-Mini-3B-2507-FP8-dynamic \
--tokenizer_mode mistral --config_format mistral --load_format mistral
The mirror documents roughly a 50% reduction in both GPU memory and disk size versus the bf16 release. The RTX 4080 SUPER's 4th-generation Ada tensor cores have native E4M3/E5M2 FP8 support, so the FP8 path is both smaller and faster on this card. See the RedHatAI card for the full quantization recipe.
Running
Transformers — audio Q&A
The canonical example adapted from the Voxtral model card loads the model in bf16 and feeds it an audio clip plus a text question:
from transformers import VoxtralForConditionalGeneration, AutoProcessor
import torch
device = "cuda"
repo_id = "mistralai/Voxtral-Mini-3B-2507"
processor = AutoProcessor.from_pretrained(repo_id)
model = VoxtralForConditionalGeneration.from_pretrained(
repo_id, torch_dtype=torch.bfloat16, device_map=device
)
conversation = [
{
"role": "user",
"content": [
{"type": "audio", "path": "your-clip.mp3"},
{"type": "text", "text": "Transcribe and summarise this clip."},
],
}
]
inputs = processor.apply_chat_template(conversation)
inputs = inputs.to(device, dtype=torch.bfloat16)
outputs = model.generate(**inputs, max_new_tokens=500)
print(processor.batch_decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)[0])
Per the model card's Key Features, Voxtral has a 32k-token context length and handles audios up to 30 minutes for transcription, or 40 minutes for understanding (model card, Mistral announcement).
vLLM — server mode
For batched inference or multi-client setups:
vllm serve mistralai/Voxtral-Mini-3B-2507 \
--tokenizer_mode mistral --config_format mistral --load_format mistral
The server exposes an OpenAI-compatible API on localhost:8000. Audio is sent as a URL or base64 string inside the standard chat-completions payload. On a 16 GB card you will almost certainly need to cap context — see Troubleshooting.
Results
- VRAM usage: Running Voxtral-Mini-3B-2507 on GPU requires ~9.5 GB of GPU RAM in bf16 or fp16, per the model card (the figure is published in the card's vLLM serving section). Independent corroboration from the Transformers runtime: a user reported on the official HF discussion running the Transformers version with VRAM sitting around 10 GB during normal use on a 12 GB GPU. The bf16 weights are ~9.4 GB on disk across two safetensors shards (HF Files tab), consistent with the ~9.5 GB resident figure. The RTX 4080 SUPER's full 16 GB leaves comfortable room for long audio and the KV growth that comes with the 30-minute transcription window.
- Speed: Empirical speed for the RTX 4080 SUPER specifically is not yet available — there is no benchmark on /check/voxtral/rtx-4080-super and no first-party 4080 SUPER measurement in the model's discussion or issue trackers. The 4080 SUPER's 736 GB/s memory bandwidth is well ahead of lower-tier Ada cards on memory-bound audio decoding, and Voxtral's 3B parameter count keeps inference latency manageable. Submit a benchmark via /contribute once you've measured it.
- Quality notes: Mistral's announcement positions Voxtral as outperforming Whisper large-v3 on speech transcription. A community report from the official HF discussion notes transcription quality can slip on noisy audio or recordings that mix multiple languages, where Whisper large-v3 still feels more robust.
- License: Apache-2.0 (model card).
For the full benchmark data once community submissions land, see /check/voxtral/rtx-4080-super.
Troubleshooting
vLLM consumes far more than 9.5 GB
Reported on the HF model discussion: vLLM can grow to nearly 40 GB of VRAM because of its KV-cache reservation policy. A separate vLLM-side bug report (vllm-project/vllm#38233) tracks 16 GB users hitting encoder_cache saturation on the Realtime Voxtral variant — a related cache-reservation failure mode rather than a report against this exact 3B-2507 release.
The ~9.5 GB figure on the Mistral card describes resident weight memory, not vLLM's pre-allocated KV reservation. To bring vLLM into a 16 GB budget on the 4080 SUPER, cap --max-model-len (the 32k default reserves the full audio-encoder cache up front) and consider --gpu-memory-utilization 0.85 to leave activation headroom. The exact ceiling depends on your concurrent-stream count — start with --max-model-len 8192 and raise it until you hit OOM, or stick with the Transformers backend for single-user desktop work.
ImportError or version mismatch on import
Voxtral was added in transformers >= 4.54.0 and needs mistral-common[audio] >= 1.8.1. The HF card calls these out explicitly. If you see cannot import name 'VoxtralForConditionalGeneration', your transformers is too old — upgrade with pip install -U transformers.
GGUF / llama.cpp builds
Per the HF discussion thread, GGUF conversion currently only covers decoder-only architectures, so the full Voxtral with its audio encoder cannot be converted yet — a text-only quant builds but is useless for transcription. Stick with the Transformers or vLLM paths above. The official FP8 mirror covers the "smaller weights" use case without requiring GGUF.
Should I use the 24B variant instead?
No — not on this GPU. Voxtral Small 24B is the same architecture at a larger scale, but its model card quotes ~55 GB of GPU RAM in bf16/fp16 — roughly 3.4× the RTX 4080 SUPER's 16 GB envelope, and out of reach even with FP8. Voxtral Mini 3B is the right variant for any consumer GPU under 24 GB.