self-hosted/ai
§01·recipe · tts

Voxtral Mini 3B on RTX 4080: local speech understanding in ~9.5 GB

ttsintermediate10GB+ VRAMMay 30, 2026
models
tools
prerequisites
  • NVIDIA RTX 4080 (16 GB VRAM) or any consumer GPU with 12 GB+ VRAM
  • Python 3.10+
  • transformers >= 4.54.0 and mistral-common[audio] >= 1.8.1 (CUDA-capable PyTorch)

What You'll Build

A local audio-understanding pipeline running Mistral's Voxtral Mini 3B on an RTX 4080. The model handles speech transcription, speech translation, audio Q&A, summarization, and function-calling from voice — unlike pure text-to-speech models (Kokoro, VoxCPM), Voxtral is a multimodal audio+text LLM that consumes audio and produces text.

Hardware data: RTX 4080 (16 GB VRAM, Ada sm_89) · ~9.5 GB peak in bf16/fp16 per the official model card · See benchmark data

ℹ️ Not a TTS model. Voxtral understands audio — it does not synthesize speech. For text-to-speech on this GPU, see Kokoro or VoxCPM. Voxtral is in our tts vertical because the wider catalogue groups audio-input-or-output models together; the model card is explicit that this is speech-to-text + audio understanding.

Requirements

ComponentMinimumTested
GPU12 GB VRAM consumer cardRTX 4080 (16 GB, Ada sm_89)
RAM16 GB
Storage~10 GB for weights + cache~9.4 GB of bf16 safetensors per the HF Files tab
SoftwarePython 3.10+, PyTorch with CUDA, transformers >= 4.54.0, mistral-common[audio] >= 1.8.1

Installation

1. Install Transformers and mistral-common

Voxtral runs natively in Transformers starting with transformers >= 4.54.0. Both packages are required — Voxtral uses mistral-common's audio tokenizer:

pip install -U transformers
pip install --upgrade "mistral-common[audio]"

Verify the audio extras are present:

python -c "import mistral_common; print(mistral_common.__version__)"

You should see 1.8.1 or newer; the model card pins this exact version as the audio-tokenizer floor.

Unlike Blackwell GPUs (sm_120), no special CUDA wheel selection is required for the RTX 4080 — the default pip install torch already ships sm_89 kernels with full FlashAttention-2 support, so the standard install path works out of the box.

2. (Optional) Install vLLM for high-throughput serving

vLLM gives the fastest token throughput but reserves a large KV cache. The Transformers backend is recommended for local desktop use on a 16 GB card; reach for vLLM only when you need batched throughput or an OpenAI-compatible HTTP API:

uv pip install -U "vllm[audio]" --system

This pulls a recent vLLM and a compatible mistral_common >= 1.8.1. See the Troubleshooting section for the --max-model-len cap you will likely need on a 16 GB card.

3. (Optional) Use the FP8 mirror to halve VRAM

For tighter memory, the RedHatAI FP8-dynamic mirror is a community FP8 quantization of the same Mistral base weights, also Apache-2.0. Per the mirror's card it ships symmetric-per-channel weights and dynamic-per-token activations on the linear layers only (the audio tower and multi-modal projector stay in full precision):

vllm serve RedHatAI/Voxtral-Mini-3B-2507-FP8-dynamic \
  --tokenizer_mode mistral --config_format mistral --load_format mistral

The mirror documents roughly a 50% reduction in both GPU memory and disk size versus the bf16 release. The RTX 4080's 4th-generation Ada tensor cores have native E4M3/E5M2 FP8 support, so the FP8 path is both smaller and faster on this card. See the RedHatAI card for the full quantization recipe.

Running

Transformers — audio Q&A

The canonical example adapted from the Voxtral model card loads the model in bf16 and feeds it an audio clip plus a text question:

from transformers import VoxtralForConditionalGeneration, AutoProcessor
import torch

device = "cuda"
repo_id = "mistralai/Voxtral-Mini-3B-2507"

processor = AutoProcessor.from_pretrained(repo_id)
model = VoxtralForConditionalGeneration.from_pretrained(
    repo_id, torch_dtype=torch.bfloat16, device_map=device
)

conversation = [
    {
        "role": "user",
        "content": [
            {"type": "audio", "path": "your-clip.mp3"},
            {"type": "text", "text": "Transcribe and summarise this clip."},
        ],
    }
]

inputs = processor.apply_chat_template(conversation)
inputs = inputs.to(device, dtype=torch.bfloat16)

outputs = model.generate(**inputs, max_new_tokens=500)
print(processor.batch_decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)[0])

Per the model card's Key Features, Voxtral has a 32k-token context length and handles audios up to 30 minutes for transcription, or 40 minutes for understanding (model card, Mistral announcement).

vLLM — server mode

For batched inference or multi-client setups:

vllm serve mistralai/Voxtral-Mini-3B-2507 \
  --tokenizer_mode mistral --config_format mistral --load_format mistral

The server exposes an OpenAI-compatible API on localhost:8000. Audio is sent as a URL or base64 string inside the standard chat-completions payload. On a 16 GB card you will almost certainly need to cap context — see Troubleshooting.

Results

  • VRAM usage: Running Voxtral-Mini-3B-2507 on GPU requires ~9.5 GB of GPU RAM in bf16 or fp16, per the model card (the figure is published in the card's vLLM serving section). Independent corroboration from the Transformers runtime: a user reported on the official HF discussion running the Transformers version with VRAM sitting around 10 GB during normal use on a 12 GB GPU. The bf16 weights are ~9.4 GB on disk across two safetensors shards (HF Files tab), consistent with the ~9.5 GB resident figure. The RTX 4080's full 16 GB leaves comfortable room for long audio and the KV growth that comes with the 30-minute transcription window.
  • Speed: Empirical speed for the RTX 4080 specifically is not yet available — there is no benchmark on /check/voxtral/rtx-4080 and no first-party 4080 measurement in the model's discussion or issue trackers. The 4080's 716.8 GB/s memory bandwidth is well ahead of lower-tier Ada cards on memory-bound audio decoding, and Voxtral's 3B parameter count keeps inference latency manageable. Submit a benchmark via /contribute once you've measured it.
  • Quality notes: Mistral's announcement positions Voxtral as outperforming Whisper large-v3 on speech transcription. A community report from the official HF discussion notes transcription quality can slip on noisy audio or recordings that mix multiple languages, where Whisper large-v3 still feels more robust.
  • License: Apache-2.0 (model card).

For the full benchmark data once community submissions land, see /check/voxtral/rtx-4080.

Troubleshooting

vLLM consumes far more than 9.5 GB

Reported on the HF model discussion: vLLM can grow to nearly 40 GB of VRAM because of its KV-cache reservation policy. A separate vLLM-side bug report (vllm-project/vllm#38233) tracks 16 GB users hitting encoder_cache saturation on the Realtime Voxtral variant — a related cache-reservation failure mode rather than a report against this exact 3B-2507 release.

The ~9.5 GB figure on the Mistral card describes resident weight memory, not vLLM's pre-allocated KV reservation. To bring vLLM into a 16 GB budget on the 4080, cap --max-model-len (the 32k default reserves the full audio-encoder cache up front) and consider --gpu-memory-utilization 0.85 to leave activation headroom. The exact ceiling depends on your concurrent-stream count — start with --max-model-len 8192 and raise it until you hit OOM, or stick with the Transformers backend for single-user desktop work.

ImportError or version mismatch on import

Voxtral was added in transformers >= 4.54.0 and needs mistral-common[audio] >= 1.8.1. The HF card calls these out explicitly. If you see cannot import name 'VoxtralForConditionalGeneration', your transformers is too old — upgrade with pip install -U transformers.

GGUF / llama.cpp builds

Per the HF discussion thread, GGUF conversion currently only covers decoder-only architectures, so the full Voxtral with its audio encoder cannot be converted yet — a text-only quant builds but is useless for transcription. Stick with the Transformers or vLLM paths above. The official FP8 mirror covers the "smaller weights" use case without requiring GGUF.

Should I use the 24B variant instead?

No — not on this GPU. Voxtral Small 24B is the same architecture at a larger scale, but its model card quotes ~55 GB of GPU RAM in bf16/fp16 — roughly 3.4× the RTX 4080's 16 GB envelope, and out of reach even with FP8. Voxtral Mini 3B is the right variant for any consumer GPU under 24 GB.