How much VRAM does MOSS-Audio need?

About 11 GB — the minimum this recipe targets.

How hard is this setup?

Intermediate — follow the steps above.

MOSS-Audio 4B-Instruct on RX 7900 XTX: local audio understanding on ROCm (BF16)

What You'll Build

A local audio-understanding pipeline running OpenMOSS MOSS-Audio 4B-Instruct on a 24 GB Radeon RX 7900 XTX (RDNA3, Navi 31, gfx1100) through the ROCm stack. The model handles speech transcription (with word- and sentence-level timestamps), environmental sound understanding, music understanding, audio captioning, audio QA, and time-aware reasoning in English and Chinese — built on a Qwen3-4B backbone with a from-scratch MOSS-Audio-Encoder (~4.6B params total), per the model card and the upstream GitHub README.

Hardware data: RX 7900 XTX (24GB VRAM) · ~10.4 GB BF16 weights · PyTorch on ROCm 7.2 · See benchmark data

⚠️ This is a ROCm recipe, not CUDA. The RX 7900 XTX runs on AMD's ROCm/HIP stack — there is no cu128 wheel here. You install the ROCm PyTorch wheel (--index-url https://download.pytorch.org/whl/rocm7.2) instead of the CUDA one. The good news: MOSS-Audio ships with "_attn_implementation": "eager" in its config.json audio encoder — it is already SDPA/eager, not FlashAttention-2, so it ports cleanly to ROCm via PyTorch's scaled-dot-product attention with no FA2 build step. RDNA3 also has no FP8/FP4 hardware (its WMMA units accept FP16, BF16, INT8, INT4 only), so there is no FP8 quant path on this card — and at 24 GB you don't need one: run the native BF16 weights with room to spare. If a guide tells you to pip install flash-attn, pick a cu12x wheel, or load an FP8 checkpoint for this card, it's written for the wrong vendor.

ℹ️ Not a TTS model. MOSS-Audio understands audio — it does not synthesize speech. Inputs are audio (speech, sound, music); outputs are text. The OpenMOSS team ships speech synthesis separately as MOSS-TTS; don't conflate the two. MOSS-Audio sits in our tts vertical because the wider catalogue groups audio-input-or-output models there; the model card is explicit that this is audio-to-text.

Requirements

Component	Minimum	Tested
GPU	11 GB VRAM (ROCm-supported AMD card)	RX 7900 XTX (24 GB)
RAM	16 GB	—
Storage	~11 GB for BF16 weights + cache	—
Driver	AMD ROCm 7.2.x on Linux	—
Software	Python 3.12, PyTorch (ROCm 7.2 build), ffmpeg 7, `huggingface-hub` CLI	—

The BF16 weight shards on the official HF model page total 10.44 GB across three .safetensors files (4.99 + 4.67 + 0.78 GB, per the HF tree API — 10,445,749,784 bytes). That is the minimum VRAM the model needs just to load before activations and KV cache; in practice, plan on ~11–12 GB for short clips and rising with audio length. The RX 7900 XTX's 24 GB budget leaves enormous headroom — at this VRAM tier the model is never memory-bound, and there is no quantization tradeoff to consider: run the native BF16 weights.

Installation

The official setup, from the model card and GitHub repo, uses a dedicated conda environment. The upstream README prescribes a CUDA-12.8 PyTorch wheel — that line is NVIDIA-specific. On the RX 7900 XTX you swap it for the ROCm wheel; everything else (Python 3.12, ffmpeg 7, the editable pip install -e) is unchanged, because the model is pure-PyTorch transformers with no CUDA-only kernels.

1. Clone the repo and create the environment

git clone https://github.com/OpenMOSS/MOSS-Audio.git
cd MOSS-Audio

conda create -n moss-audio python=3.12 -y
conda activate moss-audio

conda install -c conda-forge "ffmpeg=7" -y

2. Install PyTorch for ROCm

The RX 7900 XTX (gfx1100) is an officially ROCm-supported GPU, so it uses the stable ROCm PyTorch wheel. Install PyTorch first from the ROCm index, before the editable package, so the editable install doesn't pull a CUDA build:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm7.2

ℹ️ Verify the ROCm tag before you copy it. As of this writing the stable wheel is rocm7.2 — but the rocmX.Y tag moves over time (6.3 → 6.4 → 7.x). Read the current "Install PyTorch for ROCm" line at pytorch.org/get-started/locally (or the ComfyUI README "AMD GPUs (Linux)" section, which currently also pins rocm7.2) before running. AMD also ships its own Radeon-tuned wheels at repo.radeon.com if you prefer the vendor build; the upstream whl/rocm7.2 wheel above is the canonical community path and is sufficient for this model.

3. Install the MOSS-Audio package

With the ROCm torch already in place, install the repo's runtime dependencies editable. Drop the --extra-index-url .../whl/cu128 flag from the upstream command — you do not want it to reach for CUDA wheels:

pip install -e ".[torch-runtime]"

ℹ️ Skip the optional FlashAttention extra. The upstream README lists flash-attn as an optional install for lower memory pressure on long audio — but the model already defaults to eager/SDPA attention ("_attn_implementation": "eager" in the audio encoder config), so it runs without it. On RDNA3 the upstream Dao-AILab flash-attn build is CDNA/MI-only and commonly fails to compile on gfx1100, so do not add the flash-attn extra here — PyTorch's SDPA on ROCm is the attention path, and it covers this model fully. Install plain:

# Correct for AMD: NO flash-attn extra, NO cu128 index
pip install -e ".[torch-runtime]"

4. Download the 4B-Instruct weights

hf download OpenMOSS-Team/MOSS-Audio-4B-Instruct \
  --local-dir ./weights/MOSS-Audio-4B-Instruct

This pulls the three BF16 safetensors shards (~10.4 GB total) plus tokenizer and processor config.

Running

The fastest path is the bundled infer.py script for one-shot inference:

# Edit MODEL_PATH and AUDIO_PATH inside infer.py, then:
python infer.py

The default prompt is Describe this audio. and the script also covers transcription and audio QA — change the prompt to switch tasks.

For an interactive UI:

python app.py

ROCm presents itself to PyTorch under the cuda device namespace (HIP masquerades as CUDA), so the upstream scripts that call .to("cuda") / torch.cuda.is_available() work unmodified — no code edits are needed for AMD. Confirm the runtime sees the GPU:

python -c "import torch; print(torch.__version__); print(torch.cuda.is_available())"

The version string should carry a +rocm7.2-style suffix, and torch.cuda.is_available() should print True.

ℹ️ Note on the SGLang server path. The upstream usage guide offers an optional patched SGLang fork for batched/server-side serving, but its setup pins nvidia-cudnn-cu12 and a CUDA torch build — that path is NVIDIA-only and does not apply to the RX 7900 XTX. On AMD, stay on the transformers path (infer.py / app.py) shown above, which is fully supported on ROCm. If you need an OpenAI-compatible server on AMD, vLLM-on-ROCm lists gfx1100 — but treat that as out-of-scope here and verify separately.

Results

Speed: Not quoted — no source benchmarks MOSS-Audio on a Radeon RX 7900 XTX (or a close AMD compute-sibling), and we do not transfer iterations-per-second from a different card or vendor. Submit results via /contribute once you've run it; live numbers will appear on the check page below.
VRAM usage: No vendor benchmark on this card exists yet. The concrete lower bound is the 10.44 GB BF16 weight footprint computed from the HF Files page (three safetensors shards totalling 10,445,749,784 bytes via the HF tree API); plan on ~11–12 GB for short clips before activations and KV cache, rising with audio length and max_tokens. Trivially within the 24 GB 7900 XTX budget, leaving large headroom for long-form audio. This is a derived envelope, not a measured peak — once community submissions land, the live number will appear on the check page below.
Quality notes: The MOSS-Audio team reports an overall CER of 11.58 for the 4B-Instruct across their diverse ASR benchmark suite, per the model card (the 11.30 figure on that card belongs to the larger 8B-Instruct). The 4B-Thinking sibling at the same parameter count adds chain-of-thought reasoning for harder audio QA at the cost of more tokens per response — pick Thinking if you need step-by-step audio reasoning rather than direct instruction following. Numeric quality is architecture-independent: these scores carry over from the NVIDIA recipe unchanged, since the BF16 weights are identical across vendors.
License: Apache-2.0.

For the full benchmark data once community submissions land, see /check/moss-audio/rx-7900-xtx.

Troubleshooting

"Torch not compiled with CUDA enabled" or a CUDA-build torch got installed

This means a CUDA wheel of PyTorch got pulled in instead of the ROCm build — most often because the editable pip install -e ".[torch-runtime]" ran with the upstream --extra-index-url .../whl/cu128 flag still attached, or before the ROCm torch was installed. Reinstall torch from the ROCm index:

pip uninstall torch torchvision torchaudio
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm7.2

Confirm the installed build is the ROCm one: python -c "import torch; print(torch.__version__)" should print a +rocm7.2-style suffix, and torch.cuda.is_available() returns True (ROCm masquerades as the cuda device namespace under HIP).

`flash-attn` build fails on gfx1100

If you (or the editable install's optional extras) try to build the Dao-AILab flash-attn package, the C++/CK compile commonly fails on consumer RDNA3 cards — upstream FlashAttention's CK kernels target CDNA/MI accelerators, not gfx1100. You don't need it: MOSS-Audio defaults to eager/SDPA attention, so simply install without the flash-attn extra (pip install -e ".[torch-runtime]") and let PyTorch's ROCm SDPA handle attention. Remove any flash-attn line from a custom requirements file before installing.

`Cannot import name` or version errors during install

Keep MOSS-Audio inside its own conda environment exactly as the README prescribes — but with the ROCm torch wheel substituted for the CUDA one. If you reuse a system Python with a mismatched (or CUDA) torch, the audio encoder will fail to load. Recreate the env, install the ROCm wheel first (step 2), then the editable package (step 3).

Out of memory on long audio

The weight footprint (~10.4 GB BF16) is fixed, but activations and the KV cache scale with clip length and max_tokens. On the 24 GB 7900 XTX this is unlikely, but if a very long clip OOMs, shorten the audio segment or reduce max_tokens. No widely-reported OOMs on this card exist yet; report problems via the submission form.