What You'll Build
A local audio-understanding pipeline running Mistral's Voxtral Mini 3B on an RTX 5070 Ti. The model handles speech transcription, speech translation, audio Q&A, summarization, and function-calling from voice in eight languages — unlike pure text-to-speech models (Kokoro, VoxCPM), Voxtral is a multimodal audio+text LLM that consumes audio and produces text.
Hardware data: RTX 5070 Ti (16 GB VRAM) · ~9.5 GB peak in bf16/fp16 per the official model card · See benchmark data
ℹ️ Not a TTS model. Voxtral understands audio — it does not synthesize speech. For text-to-speech on this GPU, see Kokoro or VoxCPM. Voxtral is in our
ttsvertical because the wider catalogue groups audio-input-or-output models together; the model card is explicit that this is speech-to-text + audio Q&A.
Requirements
| Component | Minimum | Tested |
|---|---|---|
| GPU | 12 GB VRAM consumer card | RTX 5070 Ti (16 GB GDDR7, Blackwell GB203 sm_120) |
| RAM | 16 GB | — |
| Storage | ~10 GB for weights + cache | ~9.4 GB bf16 weights per HF Files tab |
| Software | Python 3.10+, PyTorch with CUDA (cu128), transformers >= 4.54.0, mistral-common[audio] >= 1.8.1 | — |
Installation
1. Install a Blackwell-ready PyTorch (cu128)
The RTX 5070 Ti is a Blackwell card (GB203 die, compute capability sm_120, 8960 CUDA cores, ~896 GB/s memory bandwidth, 300 W). Install a PyTorch build that ships sm_120 kernels — the CUDA 12.8 wheel — before anything else, so the audio tower and decoder both get native Blackwell kernels:
pip install --upgrade torch --index-url https://download.pytorch.org/whl/cu128
Unlike some Blackwell recipes, Voxtral does not need an attn_implementation override: the model card's reference snippet calls from_pretrained(...) without forcing flash_attention_2, so PyTorch's built-in SDPA kernels (which have full sm_120 coverage) are used by default. There is no FlashAttention-2 sm_120 wheel gap to work around here.
2. Install Transformers and mistral-common
The Transformers integration shipped in v4.54.0. Both packages are required — Voxtral uses mistral-common's audio tokenizer:
pip install -U transformers
pip install --upgrade "mistral-common[audio]"
Verify the audio extras are present:
python -c "import mistral_common; print(mistral_common.__version__)"
You should see 1.8.1 or newer; the model card pins this exact version as the audio-tokenizer floor.
3. (Optional) Install vLLM for high-throughput serving
vLLM gives the fastest token throughput but reserves a large KV cache. The Transformers backend is recommended for local desktop use on a 16 GB card; reach for vLLM only when you need batched throughput or an OpenAI-compatible HTTP API:
uv pip install -U "vllm[audio]" --system
This pulls vllm >= 0.10.0 and a compatible mistral_common >= 1.8.1. See the Troubleshooting section for the --max-model-len cap you will likely need on a 16 GB card.
4. (Optional) Use the FP8 mirror to halve VRAM
For tighter memory, the RedHatAI FP8-dynamic mirror is a community FP8 quantization of the same Mistral base weights, also Apache-2.0. Per the mirror's card, weights are quantized with a "symmetric static per-channel scheme" and activations with a "symmetric dynamic per-token scheme" (linear layers only):
vllm serve RedHatAI/Voxtral-Mini-3B-2507-FP8-dynamic \
--tokenizer_mode mistral --config_format mistral --load_format mistral
The mirror's card states the FP8 optimization is "reducing GPU memory requirements (by approximately 50%) and increasing matrix-multiply compute throughput (by approximately 2x)", plus a matching ~50% disk-size reduction. Blackwell (RTX 5070 Ti, GB203 sm_120) has native FP8 tensor cores, so it sees the throughput uplift in addition to the memory saving. See the RedHatAI card for the full quantization recipe.
Running
Transformers — single-file audio Q&A
The canonical example adapted from the Voxtral model card loads the model in bf16 and feeds it an audio clip plus a text question:
from transformers import VoxtralForConditionalGeneration, AutoProcessor
import torch
device = "cuda"
repo_id = "mistralai/Voxtral-Mini-3B-2507"
processor = AutoProcessor.from_pretrained(repo_id)
model = VoxtralForConditionalGeneration.from_pretrained(
repo_id, torch_dtype=torch.bfloat16, device_map=device
)
conversation = [
{
"role": "user",
"content": [
{"type": "audio", "path": "your-clip.mp3"},
{"type": "text", "text": "Transcribe and summarise this clip."},
],
}
]
inputs = processor.apply_chat_template(conversation).to(device, dtype=torch.bfloat16)
outputs = model.generate(**inputs, max_new_tokens=500)
print(processor.batch_decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)[0])
Per the model card, "With a 32k token context length, Voxtral handles audios up to 30 minutes for transcription, or 40 minutes for understanding" — so a single clip can be most of an hour-long meeting.
vLLM — server mode
For batched inference or multi-client setups:
vllm serve mistralai/Voxtral-Mini-3B-2507 \
--tokenizer_mode mistral --config_format mistral --load_format mistral
The server exposes an OpenAI-compatible API on localhost:8000. Audio is sent as a URL or base64 string inside the standard chat-completions payload. On a 16 GB card you will almost certainly need to cap context — see Troubleshooting.
Results
- VRAM usage: The model card states "Running Voxtral-Mini-3B-2507 on GPU requires ~9.5 GB of GPU RAM in bf16 or fp16" (per the card's vLLM serving section). Independent corroboration from the Transformers runtime: a user reported on the official HF discussion that "I’m running the Hugging Face version on a 12GB GPU with no problem VRAM sits around 10GB during normal use." The 16 GB on the RTX 5070 Ti leaves comfortable headroom for long audio and the KV growth that comes with the 30-minute transcription window.
- Speed: Empirical speed for this exact GPU is not yet available. No source publishes an RTX 5070 Ti-named Voxtral measurement, and we do not extrapolate one from a different card — so we omit a number here rather than guess. Submit a benchmark via /contribute once you've measured it.
- Quality notes: Mistral's announcement positions Voxtral as a state-of-the-art open speech-understanding model with multilingual transcription, translation and audio Q&A. A community report from the HF discussion notes that transcription "starts to slip a bit when the audio is noisy or mixes multiple languages" and that "Whisper Large v3 still feels a bit more robust in those tricky cases" (HF discussion).
- License: Apache-2.0.
For the full benchmark data once community submissions land, see /check/voxtral/rtx-5070-ti.
Troubleshooting
vLLM consumes far more than 9.5 GB
A user on the HF model discussion reported that vLLM "takes up almost 40GB VRAM for me" — vLLM pre-allocates a large KV cache, which is why it overshoots the ~9.5 GB resident-weights figure on the Mistral card. To bring vLLM into a 16 GB budget on the 5070 Ti, cap --max-model-len (the 32k default reserves the full audio-encoder cache up front) and consider --gpu-memory-utilization 0.85 to leave activation headroom. The exact ceiling depends on your concurrent-stream count — start with --max-model-len 8192 and raise it until you hit OOM, or stick with the Transformers backend for single-user desktop work.
ImportError or version mismatch on import
Voxtral was added in transformers >= 4.54.0 and needs mistral-common[audio] >= 1.8.1. The HF card calls these out explicitly. If you see cannot import name 'VoxtralForConditionalGeneration', your transformers is too old — upgrade with pip install -U transformers.
GGUF / llama.cpp builds
There is no full GGUF build of Voxtral yet: per the HF discussion thread, a community user explains that GGUF currently only supports decoder-only architectures like LLaMA, so the audio encoder cannot be converted — only a text-only quant exists, which defeats the purpose of an audio model. Stick with the Transformers or vLLM paths above. The official FP8 mirror covers the "smaller weights" use case without requiring GGUF.
Should I use the 24B variant instead?
No — not on this GPU. Voxtral Small 24B is the same architecture at a larger scale, but its model card states "Running Voxtral-Small-24B-2507 on GPU requires ~55 GB of GPU RAM in bf16 or fp16" — about 5× the 5070 Ti's 16 GB envelope, and out of reach even with FP8. Voxtral Mini 3B is the right variant for any consumer GPU under 24 GB.