How much VRAM does Voxtral Mini 3B need?

About 10 GB — the minimum this recipe targets.

How hard is this setup?

Intermediate — follow the steps above.

Voxtral Mini 3B on RX 7800 XT: local speech understanding on ROCm (~9.5 GB)

What You'll Build

A local audio-understanding pipeline running Mistral's Voxtral Mini 3B on a 16 GB Radeon RX 7800 XT (RDNA3, Navi 32, gfx1101) through the ROCm stack. The model handles speech transcription, speech translation, audio Q&A, and summarization — it consumes audio and produces text. Per the model card it "excels at speech transcription, translation and audio understanding". It runs natively in Transformers, which on ROCm routes attention through PyTorch's scaled-dot-product attention (SDPA) — no FlashAttention build required.

Hardware data: RX 7800 XT (16 GB VRAM, RDNA3 / gfx1101) · ~9.5 GB peak in BF16 per the official model card · ROCm 7.2 · See benchmark data

ℹ️ Not a TTS model. Voxtral understands audio — it transcribes and reasons over speech, it does not synthesize speech. It is a multimodal audio+text LLM (audio-in → text-out), in the same family as ASR systems like Whisper, not text-to-speech models like Kokoro or VoxCPM. Voxtral sits in our tts vertical only because the wider catalogue groups audio-input-or-output models together; the model card is explicit that this is speech-to-text + audio understanding.

⚠️ This is a ROCm recipe, not CUDA. The RX 7800 XT runs on AMD's ROCm/HIP stack — there is no CUDA wheel, no FlashAttention-2 build, and no FP8/FP4 path here. RDNA3 has no FP8/FP4 hardware (its WMMA units accept FP16, BF16, INT8, INT4 only), so an FP8 checkpoint would just upcast to BF16/FP16 with no memory saving — and at ~9.5 GB the native BF16 weights fit the 16 GB card comfortably anyway. Run the native BF16 weights. The attention path is PyTorch SDPA (the Transformers default on ROCm). If a guide tells you to pip install a CUDA torch wheel or a FlashAttention wheel for this card, it's written for the wrong vendor.

Requirements

Component	Minimum	Tested
GPU	12 GB VRAM (ROCm-supported AMD card)	RX 7800 XT (16 GB, RDNA3 / gfx1101)
RAM	16 GB system	—
Storage	~10 GB for weights + cache	~9.36 GB of BF16 safetensors across two shards per the HF Files tab
Driver	AMD ROCm 7.2.x on Linux	—
Software	Python 3.10+, PyTorch (ROCm build), `transformers >= 4.54.0`, `mistral-common[audio] >= 1.8.1`	—

The model is released under the Apache-2.0 license.

Installation

1. Install PyTorch for ROCm

The RX 7800 XT (gfx1101) is an officially ROCm-supported GPU on Linux — the ROCm install system-requirements matrix lists it as gfx1101 with full support — so it uses the stable ROCm PyTorch wheel. Install torch against the ROCm wheel index (do this before transformers, so pip resolves the GPU build):

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm7.2

ℹ️ Verify the ROCm tag before you copy it. The rocmX.Y wheel tag moves over time (6.3 → 6.4 → 7.x). Read the current stable line in the live PyTorch "Get Started" selector (pick Linux / Pip / ROCm) before running, and match it to your installed ROCm version. AMD also ships its own Radeon-tuned wheels via repo.radeon.com if you prefer the vendor build. There is also an experimental RDNA-3-specific nightly index (https://rocm.nightlies.amd.com/v2/gfx110X-all/) — its gfx110X-all name covers the whole RDNA3 family (gfx1100/gfx1101/gfx1102), so it is the right experimental index for the 7800 XT's gfx1101 if you ever need it. On officially-supported Linux you do not — the stable whl/rocm7.2 wheel above is the canonical path.

Confirm the build is the ROCm one and the GPU is visible:

python -c "import torch; print(torch.__version__); print(torch.cuda.is_available())"

The version string should carry a +rocm7.2-style suffix and torch.cuda.is_available() should return True (ROCm masquerades as the cuda device namespace under HIP).

2. Install Transformers and mistral-common

Voxtral runs natively in Transformers starting with transformers >= 4.54.0. Both packages are required — Voxtral uses mistral-common's audio tokenizer:

pip install -U transformers
pip install --upgrade "mistral-common[audio]"

Verify the audio extras are present:

python -c "import mistral_common; print(mistral_common.__version__)"

You should see 1.8.1 or newer; the model card pins this exact version as the audio-tokenizer floor.

3. (Optional) Install vLLM for high-throughput serving

The Transformers backend is the recommended path for local desktop use; reach for vLLM only when you need batched throughput or an OpenAI-compatible HTTP API. vLLM has a ROCm build, but on RDNA3 you must disable its Triton FlashAttention path (it overflows the stack frame on gfx1101). Set the env flag before launching the server:

export VLLM_USE_TRITON_FLASH_ATTN=0

Install vLLM for ROCm per the vLLM GPU install docs (open the ROCm tab) — the docs document ROCm-specific prebuilt wheels, a build-from-source path, and a ROCm Docker image. See Troubleshooting for the --max-model-len cap and the Triton-FA flag.

Running

Transformers — audio Q&A (recommended path)

The canonical example adapted from the Voxtral model card loads the model in BF16 and feeds it an audio clip plus a text question. On ROCm the "cuda" device string is correct — HIP exposes the GPU under the cuda namespace:

from transformers import VoxtralForConditionalGeneration, AutoProcessor
import torch

device = "cuda"  # HIP/ROCm exposes the AMD GPU under the cuda namespace
repo_id = "mistralai/Voxtral-Mini-3B-2507"

processor = AutoProcessor.from_pretrained(repo_id)
model = VoxtralForConditionalGeneration.from_pretrained(
    repo_id, torch_dtype=torch.bfloat16, device_map=device
)

conversation = [
    {
        "role": "user",
        "content": [
            {"type": "audio", "path": "your-clip.mp3"},
            {"type": "text", "text": "Transcribe and summarise this clip."},
        ],
    }
]

inputs = processor.apply_chat_template(conversation)
inputs = inputs.to(device, dtype=torch.bfloat16)

outputs = model.generate(**inputs, max_new_tokens=500)
print(processor.batch_decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)[0])

Loading in torch.bfloat16 is the right choice on RDNA3 — BF16 is a native WMMA input format on this card, and there is no FP8 hardware to quantize down to. Per the model card's Key Features, Voxtral has a 32k-token context length and handles audios up to 30 minutes for transcription, or 40 minutes for understanding (model card, Mistral announcement).

vLLM — server mode (optional)

For batched inference or multi-client setups, with the Triton-FA flag set first:

export VLLM_USE_TRITON_FLASH_ATTN=0
vllm serve mistralai/Voxtral-Mini-3B-2507 \
  --tokenizer_mode mistral --config_format mistral --load_format mistral

The server exposes an OpenAI-compatible API on localhost:8000. Audio is sent as a URL or base64 string inside the standard chat-completions payload. The 16 GB 7800 XT has room for the weights, but vLLM's KV-cache reservation can still balloon past the 16 GB envelope — see Troubleshooting.

Results

Speed: No RX-7800-XT-named Voxtral throughput benchmark was found that could be verified on the source page itself, and there is no measurement yet on /check/voxtral/rx-7800-xt. The 7800 XT has roughly two-thirds of the 7900 XTX's memory bandwidth (624 vs 960 GB/s) and fewer WMMA units, so transferring any RDNA3 sibling's number would mislead. Rather than transfer a number from a different GPU or vendor, the Speed figure is omitted. If you've measured Voxtral on a 7800 XT, please contribute it so it lands on /check/voxtral/rx-7800-xt.
VRAM usage: Running Voxtral-Mini-3B-2507 requires "~9.5 GB of GPU RAM in bf16 or fp16", per the model card (the figure is published in the card's vLLM serving section). The BF16 weights are ~9.36 GB on disk across two safetensors shards (HF Files tab), consistent with the ~9.5 GB resident figure. On the 16 GB RX 7800 XT that leaves ~6 GB free for the KV growth that comes with the 30-minute transcription window — the model fits comfortably with headroom, and there is no reason to quantize.
Quality notes: Mistral's announcement positions Voxtral as outperforming Whisper large-v3 on speech transcription. Quality is independent of GPU vendor — the BF16 weights you run on ROCm are bit-for-bit the same weights as on any other card, so transcription accuracy matches the NVIDIA path. As with any ASR model, accuracy can slip on very noisy audio or recordings that mix multiple languages.
License: Apache-2.0 (model card).

For the full benchmark data once community submissions land, see /check/voxtral/rx-7800-xt.

Troubleshooting

"Torch not compiled with CUDA enabled"

This means a CUDA build of PyTorch got installed instead of the ROCm build. Uninstall and reinstall against the ROCm wheel index:

pip uninstall torch
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm7.2

Confirm the installed build is the ROCm one: python -c "import torch; print(torch.__version__)" should print a +rocm7.2-style suffix, and torch.cuda.is_available() returns True (ROCm masquerades as the cuda device namespace under HIP).

A library ships only gfx1100 kernels and won't load on the 7800 XT

The 7800 XT is gfx1101 (Navi 32), while the flagship 7900 XTX is gfx1100 (Navi 31) — see the ROCm system-requirements matrix for the per-card gfx targets. Most of the ROCm stack ships kernels for both, but occasionally a library or prebuilt extension only carries gfx1100 kernels and refuses to run on gfx1101. The standard Linux-only fallback is to mask the card as gfx1100 at runtime:

HSA_OVERRIDE_GFX_VERSION=11.0.0 python your_script.py

This is a legacy fallback, not a default — current PyTorch on the stable ROCm wheel runs natively on gfx1101 without it. Only reach for it if you hit a "no kernel image is available" / missing-gfx1101-kernel error from a specific library.

`ImportError` or version mismatch on import

Voxtral was added in transformers >= 4.54.0 and needs mistral-common[audio] >= 1.8.1. The HF card calls these out explicitly. If you see cannot import name 'VoxtralForConditionalGeneration', your transformers is too old — upgrade with pip install -U transformers.

vLLM on ROCm: Triton FlashAttention crash or stack-frame overflow

The Triton FlashAttention path in vLLM overflows the stack frame on RDNA3. Always export VLLM_USE_TRITON_FLASH_ATTN=0 before launching vllm serve on the RX 7800 XT. With that flag set, vLLM falls back to a working attention backend on ROCm. (The Transformers path above does not need this — it uses PyTorch SDPA, which routes through AOTriton/the eager fallback automatically.)

vLLM consumes far more than 9.5 GB

The ~9.5 GB figure describes resident weight memory, not vLLM's pre-allocated KV reservation, which can grow large because the 32k default reserves the full audio-encoder cache up front — and on the 16 GB 7800 XT that overflow can OOM the card. To bring vLLM into a tighter budget, cap --max-model-len (start with --max-model-len 8192 and raise it until you hit OOM) and consider --gpu-memory-utilization 0.85 to leave activation headroom. For single-user desktop work, prefer the Transformers backend, which has no such reservation.

Do not install a FlashAttention wheel or CUDA torch

HF guides written for NVIDIA frequently suggest a FlashAttention wheel or a CUDA torch build. On RDNA3 these are the wrong path: there is no consumer-card FlashAttention build for gfx1101, and a CUDA torch wheel will not see the AMD GPU at all. The Transformers backend already routes attention through PyTorch SDPA on this stack — that is the correct and only attention path you need here.

GGUF / llama.cpp builds

Voxtral's audio encoder is not yet covered by GGUF conversion (conversion currently targets decoder-only architectures), so a GGUF quant would drop the audio tower and break transcription. Stick with the Transformers (or vLLM-on-ROCm) BF16 path above. At ~9.5 GB on a 16 GB card there is no memory motivation to quantize anyway.

Should I use the 24B variant instead?

No — not on this GPU. Voxtral Small 24B is the same architecture at a larger scale, but its model card quotes "~55 GB of GPU RAM in bf16 or fp16" — roughly 3.4× the RX 7800 XT's 16 GB envelope, and with no FP8 hardware on RDNA3 there is no quantization escape hatch to close that gap. Voxtral Mini 3B is the right variant for this card.