Voxtral Mini 3B on RTX 5070: local speech understanding in ~9.5 GB

What You'll Build

A local audio-understanding pipeline running Mistral's Voxtral Mini 3B on an RTX 5070. The model handles speech transcription, speech translation, audio Q&A, summarization, and function-calling from voice in eight languages — unlike pure text-to-speech models (Kokoro, VoxCPM), Voxtral is a multimodal audio+text LLM that consumes audio and produces text.

Hardware data: RTX 5070 (12 GB VRAM) · ~9.5 GB peak in bf16/fp16 per the official model card · See benchmark data

ℹ️ Not a TTS model. Voxtral understands audio — it does not synthesize speech. For text-to-speech on this GPU, see Kokoro or VoxCPM. Voxtral is in our tts vertical because the wider catalogue groups audio-input-or-output models together; the model card is explicit that this is speech-to-text + audio Q&A.

The 3B model loads in bf16 at "~9.5 GB of GPU RAM in bf16 or fp16" per Mistral's card — comfortably inside the RTX 5070's 12 GB, though a desktop card driving a display typically exposes only ~10.5–11.3 GB usable, so expect a slim but workable margin (see Results for the on-card report).

Requirements

Component	Minimum	Tested
GPU	12 GB VRAM consumer card	RTX 5070 (12 GB GDDR7, Blackwell GB205 sm_120)
RAM	16 GB	—
Storage	~10 GB for weights + cache	~9.4 GB bf16 weights per HF Files tab
Software	Python 3.10+, PyTorch with CUDA (cu128), `transformers >= 4.54.0`, `mistral-common[audio] >= 1.8.1`	—

Installation

1. Install a Blackwell-ready PyTorch (cu128)

The RTX 5070 is a Blackwell card (GB205 die, compute capability sm_120, 6144 CUDA cores, ~672 GB/s memory bandwidth, 250 W). Install a PyTorch build that ships sm_120 kernels — the CUDA 12.8 wheel — before anything else, so the audio tower and decoder both get native Blackwell kernels:

pip install --upgrade torch --index-url https://download.pytorch.org/whl/cu128

Unlike some Blackwell recipes, Voxtral does not need an attn_implementation override: the model card's reference snippet calls from_pretrained(...) without forcing flash_attention_2, so PyTorch's built-in SDPA kernels (which have full sm_120 coverage) are used by default. There is no FlashAttention-2 sm_120 wheel gap to work around here — the cu128 wheel is the only Blackwell-specific step.

2. Install Transformers and mistral-common

The Transformers integration shipped in transformers >= 4.54.0. Both packages are required — Voxtral uses mistral-common's audio tokenizer:

pip install -U transformers
pip install --upgrade "mistral-common[audio]"

Verify the audio extras are present:

python -c "import mistral_common; print(mistral_common.__version__)"

You should see 1.8.1 or newer; the model card pins this exact version as the audio-tokenizer floor.

3. (Optional) Install vLLM for high-throughput serving

vLLM gives the fastest token throughput but reserves a large KV cache. The Transformers backend is recommended for local desktop use on a 12 GB card; reach for vLLM only when you need batched throughput or an OpenAI-compatible HTTP API:

uv pip install -U "vllm[audio]" --system

This pulls vllm >= 0.10.0 and a compatible mistral_common >= 1.8.1. See the Troubleshooting section for the --max-model-len cap you will need on a 12 GB card.

Running

Transformers — single-file audio Q&A

The canonical example from the Voxtral model card loads the model in bf16 and feeds it an audio clip plus a text question:

from transformers import VoxtralForConditionalGeneration, AutoProcessor
import torch

device = "cuda"
repo_id = "mistralai/Voxtral-Mini-3B-2507"

processor = AutoProcessor.from_pretrained(repo_id)
model = VoxtralForConditionalGeneration.from_pretrained(
    repo_id, torch_dtype=torch.bfloat16, device_map=device
)

conversation = [
    {
        "role": "user",
        "content": [
            {"type": "audio", "path": "your-clip.mp3"},
            {"type": "text", "text": "Transcribe and summarise this clip."},
        ],
    }
]

inputs = processor.apply_chat_template(conversation)
inputs = inputs.to(device, dtype=torch.bfloat16)
outputs = model.generate(**inputs, max_new_tokens=500)
print(processor.batch_decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)[0])

Per the model card, "With a 32k token context length, Voxtral handles audios up to 30 minutes for transcription, or 40 minutes for understanding" — so a single clip can be most of an hour-long meeting. On a 12 GB card, keep an eye on memory as the audio window grows: the resident-weights figure is the floor, and the KV cache scales with the context you actually fill.

vLLM — server mode

For batched inference or multi-client setups:

vllm serve mistralai/Voxtral-Mini-3B-2507 \
  --tokenizer_mode mistral --config_format mistral --load_format mistral

The server exposes an OpenAI-compatible API on localhost:8000. Audio is sent as a URL or base64 string inside the standard chat-completions payload. On a 12 GB card you will need to cap context — see Troubleshooting.

Results

VRAM usage: The model card states "Running Voxtral-Mini-3B-2507 on GPU requires ~9.5 GB of GPU RAM in bf16 or fp16." (per the card's vLLM serving section). That figure is directly relevant here: a community user reported on the official HF discussion that "I’m running the Hugging Face version on a 12GB GPU with no problem VRAM sits around 10GB during normal use." — the RTX 5070 is a 12 GB card, so this is a same-VRAM-class report. The margin is real but slim once a display claims a slice of VRAM, so close other GPU apps before long transcription jobs.
Speed: Empirical speed for this exact GPU is not yet available. No source publishes an RTX 5070-named Voxtral measurement, and we do not extrapolate one from a different card — the RTX 5070 has ~25% less memory bandwidth and ~31% fewer CUDA cores than the RTX 5070 Ti, so a Ti number would not transfer cleanly. We omit a speed figure here rather than guess; submit a benchmark via /contribute once you've measured it.
Quality notes: Mistral's announcement positions Voxtral as a state-of-the-art open speech-understanding model with multilingual transcription, translation and audio Q&A. A community report from the HF discussion notes that transcription "starts to slip a bit when the audio is noisy or mixes multiple languages" and that "Whisper Large v3 still feels a bit more robust in those tricky cases" (HF discussion).
License: Apache-2.0.

For the full benchmark data once community submissions land, see /check/voxtral/rtx-5070.

Troubleshooting

vLLM consumes far more than 9.5 GB

A user on the HF model discussion reported that vLLM "takes up almost 40GB VRAM for me" — vLLM pre-allocates a large KV cache, which is why it overshoots the ~9.5 GB resident-weights figure on the Mistral card. To bring vLLM into a 12 GB budget on the RTX 5070, cap --max-model-len (the 32k default reserves the full audio-encoder cache up front) and consider --gpu-memory-utilization 0.85 to leave activation headroom. The exact ceiling depends on your concurrent-stream count — start with --max-model-len 8192 and raise it until you hit OOM, or stick with the Transformers backend for single-user desktop work.

`ImportError` or version mismatch on import

Voxtral was added in transformers >= 4.54.0 and needs mistral-common[audio] >= 1.8.1. The HF card calls these out explicitly. If you see cannot import name 'VoxtralForConditionalGeneration', your transformers is too old — upgrade with pip install -U transformers.

Want smaller weights? The FP8 mirror

For tighter memory, the RedHatAI FP8-dynamic mirror is a community FP8 quantization of the same Mistral base weights, also Apache-2.0. Per the mirror's card, the quantization is "reducing GPU memory requirements (by approximately 50%) and increasing matrix-multiply compute throughput (by approximately 2x)", with weights quantized using a "symmetric static per-channel scheme" and activations using a "symmetric dynamic per-token scheme" (linear layers only). Blackwell (RTX 5070, GB205 sm_120) has native FP8 tensor cores, so it sees the throughput uplift in addition to the memory saving. On a 12 GB card the bf16 path already fits, so treat FP8 as optional headroom for long-audio runs or colocating a second model, not a requirement:

vllm serve RedHatAI/Voxtral-Mini-3B-2507-FP8-dynamic \
  --tokenizer_mode mistral --config_format mistral --load_format mistral

GGUF / llama.cpp builds

There is no full GGUF build of Voxtral yet: per the HF discussion thread, a community user explains that GGUF currently only supports decoder-only architectures like LLaMA, so the audio encoder cannot be converted — only a text-only quant exists, which defeats the purpose of an audio model. Stick with the Transformers or vLLM paths above. The official FP8 mirror covers the "smaller weights" use case without requiring GGUF.

Should I use the 24B variant instead?

No — not on this GPU. Voxtral Small 24B is the same architecture at a larger scale, but its model card states "Running Voxtral-Small-24B-2507 on GPU requires ~55 GB of GPU RAM in bf16 or fp16." — about 4.5× the RTX 5070's 12 GB envelope, and out of reach even with FP8. Voxtral Mini 3B is the right variant for any consumer GPU under 24 GB.