self-hosted/ai
§01·recipe · tts

Voxtral Mini 3B on RTX 4070: local speech understanding in ~9.5 GB

ttsintermediate10GB+ VRAMJun 6, 2026
models
tools
prerequisites
  • NVIDIA RTX 4070 (12 GB VRAM) or any consumer GPU with 12 GB+ VRAM
  • Python 3.10+
  • transformers >= 4.54.0 and mistral-common[audio] >= 1.8.1 (CUDA-capable PyTorch)

What You'll Build

A local audio-understanding pipeline running Mistral's Voxtral Mini 3B on an RTX 4070. The model handles speech transcription, speech translation, audio Q&A, summarization, and function-calling from voice in eight languages — but it is not a text-to-speech model. Voxtral consumes audio and produces text: it transcribes and understands speech, it does not synthesize it. If you want spoken output, that is a different class of model entirely (see Kokoro or VoxCPM).

Hardware data: RTX 4070 (12 GB VRAM, Ada sm_89) · ~9.5 GB peak in bf16/fp16 per the official model card · See benchmark data

ℹ️ Not a TTS model. Voxtral understands audio — it does not synthesize speech. For text-to-speech on this GPU, see Kokoro or VoxCPM. Voxtral is in our tts vertical because the wider catalogue groups audio-input-or-output models together; the model card is explicit that this is speech-to-text + audio Q&A.

The 3B model loads in bf16 at "~9.5 GB of GPU RAM in bf16 or fp16" per Mistral's card — inside the RTX 4070's 12 GB, though a desktop card driving a display typically exposes only ~10.5–11.3 GB usable, so expect a slim but workable margin (see Results for an on-card community report).

Requirements

ComponentMinimumTested
GPU12 GB VRAM consumer cardRTX 4070 (12 GB GDDR6X, Ada AD104 sm_89)
RAM16 GB
Storage~10 GB for weights + cache~9.4 GB bf16 weights per HF Files tab
SoftwarePython 3.10+, PyTorch with CUDA, transformers >= 4.54.0, mistral-common[audio] >= 1.8.1

Installation

1. Install Transformers and mistral-common

Voxtral runs natively in Transformers starting with transformers >= 4.54.0. Both packages are required — Voxtral uses mistral-common's audio tokenizer:

pip install -U transformers
pip install --upgrade "mistral-common[audio]"

Verify the audio extras are present:

python -c "import mistral_common; print(mistral_common.__version__)"

You should see 1.8.1 or newer; the model card pins this exact version as the audio-tokenizer floor.

The RTX 4070 is an Ada Lovelace card (AD104 die, compute capability sm_89). The default pip install torch already ships sm_89 kernels with full FlashAttention-2 support, so no special CUDA wheel selection is required — the standard install path works out of the box. The model card's reference snippet calls from_pretrained(...) without forcing an attention backend, so PyTorch's built-in SDPA kernels are used by default; there is no attention-backend override to apply on this card.

2. (Optional) Install vLLM for high-throughput serving

vLLM gives the fastest token throughput but reserves a large KV cache. The Transformers backend is recommended for local desktop use on a 12 GB card; reach for vLLM only when you need batched throughput or an OpenAI-compatible HTTP API. Per the model card, install vllm >= 0.10.0 (the card recommends uv):

uv pip install -U "vllm[audio]" --system

This pulls a recent vLLM and a compatible mistral_common >= 1.8.1. See the Troubleshooting section for the --max-model-len cap you will need on a 12 GB card.

3. (Optional) Use the FP8 mirror for tighter memory

For tighter memory, the RedHatAI FP8-dynamic mirror is a community FP8 quantization of the same Mistral base weights, also Apache-2.0. Per the mirror's card, the optimization is "reducing GPU memory requirements (by approximately 50%) and increasing matrix-multiply compute throughput (by approximately 2x)", with weights quantized using a "symmetric static per-channel scheme" and activations using a "symmetric dynamic per-token scheme" (linear layers only). The RTX 4070's 4th-generation Ada tensor cores have native E4M3/E5M2 FP8 support, so it sees the throughput uplift in addition to the memory saving. On a 12 GB card the bf16 path already fits, so treat FP8 as optional headroom for long-audio runs or colocating a second model, not a requirement:

vllm serve RedHatAI/Voxtral-Mini-3B-2507-FP8-dynamic \
  --tokenizer_mode mistral --config_format mistral --load_format mistral

Running

Transformers — audio Q&A

The canonical example from the Voxtral model card loads the model in bf16 and feeds it an audio clip plus a text question:

from transformers import VoxtralForConditionalGeneration, AutoProcessor
import torch

device = "cuda"
repo_id = "mistralai/Voxtral-Mini-3B-2507"

processor = AutoProcessor.from_pretrained(repo_id)
model = VoxtralForConditionalGeneration.from_pretrained(
    repo_id, torch_dtype=torch.bfloat16, device_map=device
)

conversation = [
    {
        "role": "user",
        "content": [
            {"type": "audio", "path": "your-clip.mp3"},
            {"type": "text", "text": "Transcribe and summarise this clip."},
        ],
    }
]

inputs = processor.apply_chat_template(conversation)
inputs = inputs.to(device, dtype=torch.bfloat16)

outputs = model.generate(**inputs, max_new_tokens=500)
print(processor.batch_decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)[0])

Per the model card, "With a 32k token context length, Voxtral handles audios up to 30 minutes for transcription, or 40 minutes for understanding" — so a single clip can be most of an hour-long meeting. On a 12 GB card, keep an eye on memory as the audio window grows: the resident-weights figure is the floor, and the KV cache scales with the context you actually fill.

vLLM — server mode

For batched inference or multi-client setups:

vllm serve mistralai/Voxtral-Mini-3B-2507 \
  --tokenizer_mode mistral --config_format mistral --load_format mistral

The server exposes an OpenAI-compatible API on localhost:8000. Audio is sent as a URL or base64 string inside the standard chat-completions payload. On a 12 GB card you will need to cap context — see Troubleshooting.

Results

  • VRAM usage: The model card states "Running Voxtral-Mini-3B-2507 on GPU requires ~9.5 GB of GPU RAM in bf16 or fp16." (per the card's vLLM serving section). That figure is directly relevant here: a community user reported on the official HF discussion that "I’m running the Hugging Face version on a 12GB GPU with no problem VRAM sits around 10GB during normal use." — a same-VRAM-class report for a 12 GB card like the RTX 4070. The bf16 weights are ~9.4 GB on disk across two safetensors shards (HF Files tab), consistent with the ~9.5 GB resident figure. The margin is real but slim once a display claims a slice of VRAM, so close other GPU apps before long transcription jobs.
  • Speed: Empirical speed for the RTX 4070 specifically is not yet available — there is no benchmark on /check/voxtral/rtx-4070 and no first-party RTX 4070 measurement in the model's discussion or issue trackers. We do not extrapolate one from a different card. Submit a benchmark via /contribute once you've measured it.
  • Quality notes: Mistral's announcement positions Voxtral as a state-of-the-art open speech-understanding model with multilingual transcription, translation and audio Q&A. A community report from the official HF discussion notes that transcription "starts to slip a bit when the audio is noisy or mixes multiple languages" and that "Whisper Large v3 still feels a bit more robust in those tricky cases".
  • License: Apache-2.0 (model card).

For the full benchmark data once community submissions land, see /check/voxtral/rtx-4070.

Troubleshooting

vLLM consumes far more than 9.5 GB

A user on the HF model discussion reported that vLLM "takes up almost 40GB VRAM for me" — vLLM pre-allocates a large KV cache, which is why it overshoots the ~9.5 GB resident-weights figure on the Mistral card. A separate vLLM-side bug report (vllm-project/vllm#38233) tracks 16 GB users hitting encoder_cache saturation on the Realtime Voxtral variant — a related cache-reservation failure mode rather than a report against this exact 3B-2507 release. To bring vLLM into a 12 GB budget on the RTX 4070, cap --max-model-len (the 32k default reserves the full audio-encoder cache up front) and consider --gpu-memory-utilization 0.85 to leave activation headroom. The exact ceiling depends on your concurrent-stream count — start with --max-model-len 8192 and raise it until you hit OOM, or stick with the Transformers backend for single-user desktop work.

ImportError or version mismatch on import

Voxtral was added in transformers >= 4.54.0 and needs mistral-common[audio] >= 1.8.1. The HF card calls these out explicitly. If you see cannot import name 'VoxtralForConditionalGeneration', your transformers is too old — upgrade with pip install -U transformers.

GGUF / llama.cpp builds

There is no full GGUF build of Voxtral yet: per the HF discussion thread, a community user explains that GGUF currently only supports decoder-only models like LLaMA, so the audio encoder cannot be converted — only a text-only quant exists, which defeats the purpose of an audio model. Stick with the Transformers or vLLM paths above. The official FP8 mirror covers the "smaller weights" use case without requiring GGUF.

Should I use the 24B variant instead?

No — not on this GPU. Voxtral Small 24B is the same architecture at a larger scale, but its model card states "Running Voxtral-Small-24B-2507 on GPU requires ~55 GB of GPU RAM in bf16 or fp16." — about 4.6× the RTX 4070's 12 GB envelope, and out of reach even with FP8. Voxtral Mini 3B is the right variant for any consumer GPU under 24 GB.