self-hosted/ai
§01·recipe · tts

Voxtral Mini 3B on RX 7900 XTX: local speech understanding on ROCm (~9.5 GB)

ttsintermediate10GB+ VRAMJun 17, 2026

This intermediate recipe sets up Voxtral Mini 3B on the RX 7900 XTX, needing about 10 GB of VRAM.

models
tools
prerequisites
  • AMD Radeon RX 7900 XTX (24 GB VRAM, RDNA3 / Navi 31 / gfx1100) or equivalent ROCm-supported card
  • Linux (Ubuntu 24.04 / 22.04 or RHEL) with the AMD ROCm stack installed (ROCm 7.2.x)
  • Python 3.10+
  • PyTorch built for ROCm, transformers >= 4.54.0 and mistral-common[audio] >= 1.8.1

What You'll Build

A local audio-understanding pipeline running Mistral's Voxtral Mini 3B on a 24 GB Radeon RX 7900 XTX (RDNA3, Navi 31, gfx1100) through the ROCm stack. The model handles speech transcription, speech translation, audio Q&A, and summarization — it consumes audio and produces text. Per the model card it "excels at speech transcription, translation and audio understanding" with built-in Q&A and summarization directly from audio. It runs natively in Transformers, which on ROCm routes attention through PyTorch's scaled-dot-product attention (SDPA) — no FlashAttention build required.

Hardware data: RX 7900 XTX (24 GB VRAM, RDNA3 / gfx1100) · ~9.5 GB peak in BF16 per the official model card · ROCm 7.2 · See benchmark data

ℹ️ Not a TTS model. Voxtral understands audio — it transcribes and reasons over speech, it does not synthesize speech. It is a multimodal audio+text LLM (audio-in → text-out), in the same family as ASR systems like Whisper, not text-to-speech models like Kokoro or VoxCPM. Voxtral sits in our tts vertical only because the wider catalogue groups audio-input-or-output models together; the model card is explicit that this is speech-to-text + audio understanding.

⚠️ This is a ROCm recipe, not CUDA. The RX 7900 XTX runs on AMD's ROCm/HIP stack — there is no CUDA wheel, no FlashAttention-2 build, and no FP8/FP4 path here. RDNA3 has no FP8/FP4 hardware (its WMMA units accept FP16, BF16, INT8, INT4 only), so an FP8 checkpoint would just upcast to BF16/FP16 with no memory saving — and at 24 GB you don't need it anyway. Run the native BF16 weights. The attention path is PyTorch SDPA (the Transformers default on ROCm). If a guide tells you to pip install a CUDA torch wheel or a FlashAttention wheel for this card, it's written for the wrong vendor.

Requirements

ComponentMinimumTested
GPU12 GB VRAM (ROCm-supported AMD card)RX 7900 XTX (24 GB, RDNA3 / gfx1100)
RAM16 GB system
Storage~10 GB for weights + cache~9.4 GB of BF16 safetensors across two shards per the HF Files tab
DriverAMD ROCm 7.2.x on Linux
SoftwarePython 3.10+, PyTorch (ROCm build), transformers >= 4.54.0, mistral-common[audio] >= 1.8.1

The model is released under the Apache-2.0 license.

Installation

1. Install PyTorch for ROCm

The RX 7900 XTX (gfx1100) is an officially ROCm-supported GPU, so it uses the stable ROCm PyTorch wheel. Install torch against the ROCm wheel index (do this before transformers, so pip resolves the GPU build):

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm7.2

ℹ️ Verify the ROCm tag before you copy it. The rocmX.Y wheel tag moves over time (6.3 → 6.4 → 7.x). Read the current stable line in the live PyTorch "Get Started" selector (pick Linux / Pip / ROCm) before running, and match it to your installed ROCm version. AMD also ships its own Radeon-tuned wheels via repo.radeon.com if you prefer the vendor build.

Confirm the build is the ROCm one and the GPU is visible:

python -c "import torch; print(torch.__version__); print(torch.cuda.is_available())"

The version string should carry a +rocm7.2-style suffix and torch.cuda.is_available() should return True (ROCm masquerades as the cuda device namespace under HIP).

2. Install Transformers and mistral-common

Voxtral runs natively in Transformers starting with transformers >= 4.54.0. Both packages are required — Voxtral uses mistral-common's audio tokenizer:

pip install -U transformers
pip install --upgrade "mistral-common[audio]"

Verify the audio extras are present:

python -c "import mistral_common; print(mistral_common.__version__)"

You should see 1.8.1 or newer; the model card pins this exact version as the audio-tokenizer floor.

3. (Optional) Install vLLM for high-throughput serving

The Transformers backend is the recommended path for local desktop use; reach for vLLM only when you need batched throughput or an OpenAI-compatible HTTP API. vLLM has a ROCm build, but on RDNA3 you must disable its Triton FlashAttention path (it overflows the stack frame on gfx1100). Set the env flag before launching the server:

export VLLM_USE_TRITON_FLASH_ATTN=0

Install the ROCm vLLM build per the vLLM ROCm install docs (the prebuilt PyPI vllm wheel targets CUDA — on AMD you build from source or use AMD's rocm/vllm Docker image rather than pip install vllm). See Troubleshooting for the --max-model-len cap and the Triton-FA flag.

Running

Transformers — audio Q&A (recommended path)

The canonical example adapted from the Voxtral model card loads the model in BF16 and feeds it an audio clip plus a text question. On ROCm the "cuda" device string is correct — HIP exposes the GPU under the cuda namespace:

from transformers import VoxtralForConditionalGeneration, AutoProcessor
import torch

device = "cuda"  # HIP/ROCm exposes the AMD GPU under the cuda namespace
repo_id = "mistralai/Voxtral-Mini-3B-2507"

processor = AutoProcessor.from_pretrained(repo_id)
model = VoxtralForConditionalGeneration.from_pretrained(
    repo_id, torch_dtype=torch.bfloat16, device_map=device
)

conversation = [
    {
        "role": "user",
        "content": [
            {"type": "audio", "path": "your-clip.mp3"},
            {"type": "text", "text": "Transcribe and summarise this clip."},
        ],
    }
]

inputs = processor.apply_chat_template(conversation)
inputs = inputs.to(device, dtype=torch.bfloat16)

outputs = model.generate(**inputs, max_new_tokens=500)
print(processor.batch_decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)[0])

Loading in torch.bfloat16 is the right choice on RDNA3 — BF16 is a native WMMA input format on this card, and there is no FP8 hardware to quantize down to. Per the model card's Key Features, Voxtral has a 32k-token context length and handles audios up to 30 minutes for transcription, or 40 minutes for understanding (model card, Mistral announcement).

vLLM — server mode (optional)

For batched inference or multi-client setups, with the Triton-FA flag set first:

export VLLM_USE_TRITON_FLASH_ATTN=0
vllm serve mistralai/Voxtral-Mini-3B-2507 \
  --tokenizer_mode mistral --config_format mistral --load_format mistral

The server exposes an OpenAI-compatible API on localhost:8000. Audio is sent as a URL or base64 string inside the standard chat-completions payload. The 24 GB 7900 XTX has ample room for the weights, but vLLM's KV-cache reservation can still balloon — see Troubleshooting.

Results

  • VRAM usage: Running Voxtral-Mini-3B-2507 requires ~9.5 GB of GPU RAM in BF16/FP16, per the model card (the figure is published in the card's vLLM serving section). The BF16 weights are ~9.4 GB on disk across two safetensors shards (HF Files tab), consistent with the ~9.5 GB resident figure. On the 24 GB RX 7900 XTX that leaves well over half the card free for long audio and the KV growth that comes with the 30-minute transcription window — there is no memory pressure and no reason to quantize.
  • Speed: No RX-7900-XTX-named Voxtral throughput benchmark was found that could be verified on the source page itself, and there is no measurement yet on /check/voxtral/rx-7900-xtx. Rather than transfer a number from a different GPU or vendor, the Speed figure is omitted. If you've measured Voxtral on a 7900 XTX, please contribute it so it lands on the check page.
  • Quality notes: Mistral's announcement positions Voxtral as outperforming Whisper large-v3 on speech transcription. Quality is independent of GPU vendor — the BF16 weights you run on ROCm are bit-for-bit the same weights as on any other card, so transcription accuracy matches the NVIDIA path. As with any ASR model, accuracy can slip on very noisy audio or recordings that mix multiple languages.
  • License: Apache-2.0 (model card).

For the full benchmark data once community submissions land, see /check/voxtral/rx-7900-xtx.

Troubleshooting

"Torch not compiled with CUDA enabled"

This means a CUDA build of PyTorch got installed instead of the ROCm build. Uninstall and reinstall against the ROCm wheel index:

pip uninstall torch
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm7.2

Confirm the installed build is the ROCm one: python -c "import torch; print(torch.__version__)" should print a +rocm7.2-style suffix, and torch.cuda.is_available() returns True (ROCm masquerades as the cuda device namespace under HIP).

ImportError or version mismatch on import

Voxtral was added in transformers >= 4.54.0 and needs mistral-common[audio] >= 1.8.1. The HF card calls these out explicitly. If you see cannot import name 'VoxtralForConditionalGeneration', your transformers is too old — upgrade with pip install -U transformers.

vLLM on ROCm: Triton FlashAttention crash or stack-frame overflow

The Triton FlashAttention path in vLLM overflows the stack frame on gfx1100. Always export VLLM_USE_TRITON_FLASH_ATTN=0 before launching vllm serve on the RX 7900 XTX. With that flag set, vLLM falls back to a working attention backend on ROCm. (The Transformers path above does not need this — it uses PyTorch SDPA, which routes through AOTriton/the eager fallback automatically.)

vLLM consumes far more than 9.5 GB

The ~9.5 GB figure describes resident weight memory, not vLLM's pre-allocated KV reservation, which can grow large because the 32k default reserves the full audio-encoder cache up front. To bring vLLM into a tighter budget, cap --max-model-len (start with --max-model-len 8192 and raise it until you hit OOM) and consider --gpu-memory-utilization 0.85 to leave activation headroom. For single-user desktop work, prefer the Transformers backend, which has no such reservation.

Do not install a FlashAttention wheel or CUDA torch

HF guides written for NVIDIA frequently suggest a FlashAttention wheel or a CUDA torch build. On RDNA3 these are the wrong path: there is no consumer-card FlashAttention build for gfx1100, and a CUDA torch wheel will not see the AMD GPU at all. The Transformers backend already routes attention through PyTorch SDPA on this stack — that is the correct and only attention path you need here.

GGUF / llama.cpp builds

Voxtral's audio encoder is not yet covered by GGUF conversion (conversion currently targets decoder-only architectures), so a GGUF quant would drop the audio tower and break transcription. Stick with the Transformers (or vLLM-on-ROCm) BF16 path above. At 24 GB on the 7900 XTX there is no memory motivation to quantize anyway.

Should I use the 24B variant instead?

No — not on this GPU. Voxtral Small 24B is the same architecture at a larger scale, but its model card quotes ~55 GB of GPU RAM in BF16/FP16 — roughly 2.3× the RX 7900 XTX's 24 GB envelope, and with no FP8 hardware on RDNA3 there is no quantization escape hatch to close that gap. Voxtral Mini 3B is the right variant for this card.

common questions
How much VRAM does Voxtral Mini 3B need?

About 10 GB — the minimum this recipe targets.

Which GPUs is Voxtral Mini 3B tested on?

RX 7900 XTX (24 GB).

How hard is this setup?

Intermediate — follow the steps above.