What You'll Build
A local audio-understanding pipeline running OpenMOSS MOSS-Audio 4B-Instruct on a 24 GB Radeon RX 7900 XTX (RDNA3, Navi 31, gfx1100) through the ROCm stack. The model handles speech transcription (with word- and sentence-level timestamps), environmental sound understanding, music understanding, audio captioning, audio QA, and time-aware reasoning in English and Chinese — built on a Qwen3-4B backbone with a from-scratch MOSS-Audio-Encoder (~4.6B params total), per the model card and the upstream GitHub README.
Hardware data: RX 7900 XTX (24GB VRAM) · ~10.4 GB BF16 weights · PyTorch on ROCm 7.2 · See benchmark data
⚠️ This is a ROCm recipe, not CUDA. The RX 7900 XTX runs on AMD's ROCm/HIP stack — there is no
cu128wheel here. You install the ROCm PyTorch wheel (--index-url https://download.pytorch.org/whl/rocm7.2) instead of the CUDA one. The good news: MOSS-Audio ships with"_attn_implementation": "eager"in itsconfig.jsonaudio encoder — it is already SDPA/eager, not FlashAttention-2, so it ports cleanly to ROCm via PyTorch's scaled-dot-product attention with no FA2 build step. RDNA3 also has no FP8/FP4 hardware (its WMMA units accept FP16, BF16, INT8, INT4 only), so there is no FP8 quant path on this card — and at 24 GB you don't need one: run the native BF16 weights with room to spare. If a guide tells you topip install flash-attn, pick acu12xwheel, or load an FP8 checkpoint for this card, it's written for the wrong vendor.
ℹ️ Not a TTS model. MOSS-Audio understands audio — it does not synthesize speech. Inputs are audio (speech, sound, music); outputs are text. The OpenMOSS team ships speech synthesis separately as MOSS-TTS; don't conflate the two. MOSS-Audio sits in our
ttsvertical because the wider catalogue groups audio-input-or-output models there; the model card is explicit that this is audio-to-text.
Requirements
| Component | Minimum | Tested |
|---|---|---|
| GPU | 11 GB VRAM (ROCm-supported AMD card) | RX 7900 XTX (24 GB) |
| RAM | 16 GB | — |
| Storage | ~11 GB for BF16 weights + cache | — |
| Driver | AMD ROCm 7.2.x on Linux | — |
| Software | Python 3.12, PyTorch (ROCm 7.2 build), ffmpeg 7, huggingface-hub CLI | — |
The BF16 weight shards on the official HF model page total 10.44 GB across three .safetensors files (4.99 + 4.67 + 0.78 GB, per the HF tree API — 10,445,749,784 bytes). That is the minimum VRAM the model needs just to load before activations and KV cache; in practice, plan on ~11–12 GB for short clips and rising with audio length. The RX 7900 XTX's 24 GB budget leaves enormous headroom — at this VRAM tier the model is never memory-bound, and there is no quantization tradeoff to consider: run the native BF16 weights.
Installation
The official setup, from the model card and GitHub repo, uses a dedicated conda environment. The upstream README prescribes a CUDA-12.8 PyTorch wheel — that line is NVIDIA-specific. On the RX 7900 XTX you swap it for the ROCm wheel; everything else (Python 3.12, ffmpeg 7, the editable pip install -e) is unchanged, because the model is pure-PyTorch transformers with no CUDA-only kernels.
1. Clone the repo and create the environment
git clone https://github.com/OpenMOSS/MOSS-Audio.git
cd MOSS-Audio
conda create -n moss-audio python=3.12 -y
conda activate moss-audio
conda install -c conda-forge "ffmpeg=7" -y
2. Install PyTorch for ROCm
The RX 7900 XTX (gfx1100) is an officially ROCm-supported GPU, so it uses the stable ROCm PyTorch wheel. Install PyTorch first from the ROCm index, before the editable package, so the editable install doesn't pull a CUDA build:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm7.2
ℹ️ Verify the ROCm tag before you copy it. As of this writing the stable wheel is
rocm7.2— but therocmX.Ytag moves over time (6.3 → 6.4 → 7.x). Read the current "Install PyTorch for ROCm" line at pytorch.org/get-started/locally (or the ComfyUI README "AMD GPUs (Linux)" section, which currently also pinsrocm7.2) before running. AMD also ships its own Radeon-tuned wheels at repo.radeon.com if you prefer the vendor build; the upstreamwhl/rocm7.2wheel above is the canonical community path and is sufficient for this model.
3. Install the MOSS-Audio package
With the ROCm torch already in place, install the repo's runtime dependencies editable. Drop the --extra-index-url .../whl/cu128 flag from the upstream command — you do not want it to reach for CUDA wheels:
pip install -e ".[torch-runtime]"
ℹ️ Skip the optional FlashAttention extra. The upstream README lists
flash-attnas an optional install for lower memory pressure on long audio — but the model already defaults to eager/SDPA attention ("_attn_implementation": "eager"in the audio encoder config), so it runs without it. On RDNA3 the upstream Dao-AILabflash-attnbuild is CDNA/MI-only and commonly fails to compile on gfx1100, so do not add theflash-attnextra here — PyTorch's SDPA on ROCm is the attention path, and it covers this model fully. Install plain:
# Correct for AMD: NO flash-attn extra, NO cu128 index
pip install -e ".[torch-runtime]"
4. Download the 4B-Instruct weights
hf download OpenMOSS-Team/MOSS-Audio-4B-Instruct \
--local-dir ./weights/MOSS-Audio-4B-Instruct
This pulls the three BF16 safetensors shards (~10.4 GB total) plus tokenizer and processor config.
Running
The fastest path is the bundled infer.py script for one-shot inference:
# Edit MODEL_PATH and AUDIO_PATH inside infer.py, then:
python infer.py
The default prompt is Describe this audio. and the script also covers transcription and audio QA — change the prompt to switch tasks.
For an interactive UI:
python app.py
ROCm presents itself to PyTorch under the cuda device namespace (HIP masquerades as CUDA), so the upstream scripts that call .to("cuda") / torch.cuda.is_available() work unmodified — no code edits are needed for AMD. Confirm the runtime sees the GPU:
python -c "import torch; print(torch.__version__); print(torch.cuda.is_available())"
The version string should carry a +rocm7.2-style suffix, and torch.cuda.is_available() should print True.
ℹ️ Note on the SGLang server path. The upstream usage guide offers an optional patched SGLang fork for batched/server-side serving, but its setup pins
nvidia-cudnn-cu12and a CUDA torch build — that path is NVIDIA-only and does not apply to the RX 7900 XTX. On AMD, stay on thetransformerspath (infer.py/app.py) shown above, which is fully supported on ROCm. If you need an OpenAI-compatible server on AMD, vLLM-on-ROCm lists gfx1100 — but treat that as out-of-scope here and verify separately.
Results
- Speed: Not quoted — no source benchmarks MOSS-Audio on a Radeon RX 7900 XTX (or a close AMD compute-sibling), and we do not transfer iterations-per-second from a different card or vendor. Submit results via
/contributeonce you've run it; live numbers will appear on the check page below. - VRAM usage: No vendor benchmark on this card exists yet. The concrete lower bound is the 10.44 GB BF16 weight footprint computed from the HF Files page (three safetensors shards totalling 10,445,749,784 bytes via the HF tree API); plan on ~11–12 GB for short clips before activations and KV cache, rising with audio length and
max_tokens. Trivially within the 24 GB 7900 XTX budget, leaving large headroom for long-form audio. This is a derived envelope, not a measured peak — once community submissions land, the live number will appear on the check page below. - Quality notes: The MOSS-Audio team reports an overall CER of 11.58 for the 4B-Instruct across their diverse ASR benchmark suite, per the model card (the 11.30 figure on that card belongs to the larger 8B-Instruct). The 4B-Thinking sibling at the same parameter count adds chain-of-thought reasoning for harder audio QA at the cost of more tokens per response — pick Thinking if you need step-by-step audio reasoning rather than direct instruction following. Numeric quality is architecture-independent: these scores carry over from the NVIDIA recipe unchanged, since the BF16 weights are identical across vendors.
- License: Apache-2.0.
For the full benchmark data once community submissions land, see /check/moss-audio/rx-7900-xtx.
Troubleshooting
"Torch not compiled with CUDA enabled" or a CUDA-build torch got installed
This means a CUDA wheel of PyTorch got pulled in instead of the ROCm build — most often because the editable pip install -e ".[torch-runtime]" ran with the upstream --extra-index-url .../whl/cu128 flag still attached, or before the ROCm torch was installed. Reinstall torch from the ROCm index:
pip uninstall torch torchvision torchaudio
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm7.2
Confirm the installed build is the ROCm one: python -c "import torch; print(torch.__version__)" should print a +rocm7.2-style suffix, and torch.cuda.is_available() returns True (ROCm masquerades as the cuda device namespace under HIP).
flash-attn build fails on gfx1100
If you (or the editable install's optional extras) try to build the Dao-AILab flash-attn package, the C++/CK compile commonly fails on consumer RDNA3 cards — upstream FlashAttention's CK kernels target CDNA/MI accelerators, not gfx1100. You don't need it: MOSS-Audio defaults to eager/SDPA attention, so simply install without the flash-attn extra (pip install -e ".[torch-runtime]") and let PyTorch's ROCm SDPA handle attention. Remove any flash-attn line from a custom requirements file before installing.
Cannot import name or version errors during install
Keep MOSS-Audio inside its own conda environment exactly as the README prescribes — but with the ROCm torch wheel substituted for the CUDA one. If you reuse a system Python with a mismatched (or CUDA) torch, the audio encoder will fail to load. Recreate the env, install the ROCm wheel first (step 2), then the editable package (step 3).
Out of memory on long audio
The weight footprint (~10.4 GB BF16) is fixed, but activations and the KV cache scale with clip length and max_tokens. On the 24 GB 7900 XTX this is unlikely, but if a very long clip OOMs, shorten the audio segment or reduce max_tokens. No widely-reported OOMs on this card exist yet; report problems via the submission form.