What You'll Build
A local audio-understanding pipeline running OpenMOSS MOSS-Audio 4B-Instruct on a 16 GB Radeon RX 7800 XT (RDNA3, Navi 32, gfx1101) through the ROCm stack. The model handles speech transcription (with word- and sentence-level timestamps), environmental sound understanding, music understanding, audio captioning, audio QA, and time-aware reasoning in English and Chinese — built on a Qwen3-4B backbone with a from-scratch MOSS-Audio-Encoder (~4.6B params total), per the model card and the upstream GitHub README.
Hardware data: RX 7800 XT (16GB VRAM) · ~10.4 GB BF16 weights · PyTorch on ROCm 7.2 · See benchmark data
⚠️ This is a ROCm recipe, not CUDA. The RX 7800 XT runs on AMD's ROCm/HIP stack — there is no
cu128wheel here. You install the ROCm PyTorch wheel (--index-url https://download.pytorch.org/whl/rocm7.2) instead of the CUDA one. The good news: MOSS-Audio ships with"_attn_implementation": "eager"in itsconfig.jsonaudio encoder — it is already SDPA/eager, not FlashAttention-2, so it ports cleanly to ROCm via PyTorch's scaled-dot-product attention with no FA2 build step. RDNA3 also has no FP8/FP4 hardware (its WMMA units accept FP16, BF16, INT8, INT4 only), so there is no FP8 quant path on this card — and at 16 GB you don't need one: the ~10.4 GB BF16 weights fit comfortably with room for activations and the KV cache. If a guide tells you topip install flash-attn, pick acu12xwheel, or load an FP8 checkpoint for this card, it's written for the wrong vendor.
ℹ️ Not a TTS model. MOSS-Audio understands audio — it does not synthesize speech. Inputs are audio (speech, sound, music); outputs are text. The OpenMOSS team ships speech synthesis separately as MOSS-TTS; don't conflate the two. MOSS-Audio sits in our
ttsvertical because the wider catalogue groups audio-input-or-output models there; the model card is explicit that this is audio-to-text.
Requirements
| Component | Minimum | Tested |
|---|---|---|
| GPU | 11 GB VRAM (ROCm-supported AMD card) | RX 7800 XT (16 GB) |
| RAM | 16 GB | — |
| Storage | ~11 GB for BF16 weights + cache | — |
| Driver | AMD ROCm 7.2.x on Linux | — |
| Software | Python 3.12, PyTorch (ROCm 7.2 build), ffmpeg 7, huggingface-hub CLI | — |
The BF16 weight shards on the official HF model page total 10.44 GB across three .safetensors files (4.99 + 4.67 + 0.78 GB, per the HF tree API — 10,445,749,784 bytes). That is the minimum VRAM the model needs just to load before activations and KV cache; in practice, plan on ~11–12 GB for short clips and rising with audio length. The RX 7800 XT's 16 GB budget fits the BF16 weights with comfortable headroom for short- to medium-length audio — at this size there is no quantization tradeoff to consider, so run the native BF16 weights.
Installation
The official setup, from the model card and GitHub repo, uses a dedicated conda environment. The upstream README prescribes a CUDA-12.8 PyTorch wheel — that line is NVIDIA-specific. On the RX 7800 XT you swap it for the ROCm wheel; everything else (Python 3.12, ffmpeg 7, the editable pip install -e) is unchanged, because the model is pure-PyTorch transformers with no CUDA-only kernels.
1. Clone the repo and create the environment
git clone https://github.com/OpenMOSS/MOSS-Audio.git
cd MOSS-Audio
conda create -n moss-audio python=3.12 -y
conda activate moss-audio
conda install -c conda-forge "ffmpeg=7" -y
2. Install PyTorch for ROCm
The RX 7800 XT (gfx1101) is an officially ROCm-supported GPU on Linux, so it uses the stable ROCm PyTorch wheel. Install PyTorch first from the ROCm index, before the editable package, so the editable install doesn't pull a CUDA build:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm7.2
ℹ️ Verify the ROCm tag before you copy it. As of this writing the stable wheel is
rocm7.2— but therocmX.Ytag moves over time (6.3 → 6.4 → 7.x). Read the current "Install PyTorch for ROCm" line at pytorch.org/get-started/locally (or the ComfyUI README "AMD GPUs (Linux)" section, which currently also pinsrocm7.2) before running. AMD also ships its own Radeon-tuned wheels at repo.radeon.com if you prefer the vendor build; the upstreamwhl/rocm7.2wheel above is the canonical community path and is sufficient for this model.
3. Install the MOSS-Audio package
With the ROCm torch already in place, install the repo's runtime dependencies editable. Drop the --extra-index-url .../whl/cu128 flag from the upstream command — you do not want it to reach for CUDA wheels:
pip install -e ".[torch-runtime]"
ℹ️ Skip the optional FlashAttention extra. The upstream README lists
flash-attnas an optional install (.[torch-runtime,flash-attn]) for lower memory pressure on long audio — but the model already defaults to eager/SDPA attention ("_attn_implementation": "eager"in the audio encoder config), so it runs without it. On RDNA3 the upstream Dao-AILabflash-attnbuild is CDNA/MI-only and commonly fails to compile on gfx1101, so do not add theflash-attnextra here — PyTorch's SDPA on ROCm is the attention path, and it covers this model fully. Install plain:
# Correct for AMD: NO flash-attn extra, NO cu128 index
pip install -e ".[torch-runtime]"
4. Download the 4B-Instruct weights
hf download OpenMOSS-Team/MOSS-Audio-4B-Instruct \
--local-dir ./weights/MOSS-Audio-4B-Instruct
This pulls the three BF16 safetensors shards (~10.4 GB total) plus tokenizer and processor config.
Running
The fastest path is the bundled infer.py script for one-shot inference:
# Edit MODEL_PATH and AUDIO_PATH inside infer.py, then:
python infer.py
The default prompt is Describe this audio. and the script also covers transcription and audio QA — change the prompt to switch tasks.
For an interactive UI:
python app.py
ROCm presents itself to PyTorch under the cuda device namespace (HIP masquerades as CUDA), so the upstream scripts that call .to("cuda") / torch.cuda.is_available() work unmodified — no code edits are needed for AMD. Confirm the runtime sees the GPU:
python -c "import torch; print(torch.__version__); print(torch.cuda.is_available())"
The version string should carry a +rocm7.2-style suffix, and torch.cuda.is_available() should print True.
ℹ️ Note on the SGLang server path. The upstream usage guide offers an optional patched SGLang fork for batched/server-side serving, but its setup pins
nvidia-cudnn-cu12and a CUDA torch build — that path is NVIDIA-only and does not apply to the RX 7800 XT. On AMD, stay on thetransformerspath (infer.py/app.py) shown above, which is fully supported on ROCm. If you need an OpenAI-compatible server on AMD, vLLM-on-ROCm lists gfx1101 — but treat that as out-of-scope here and verify separately.
Results
- Speed: Not quoted — no source benchmarks MOSS-Audio on a Radeon RX 7800 XT (or a close AMD compute-sibling), and we do not transfer iterations-per-second from a different card or vendor. Submit results via
/contributeonce you've run it; live numbers will appear on the check page below. - VRAM usage: No vendor benchmark on this card exists yet. The concrete lower bound is the 10.44 GB BF16 weight footprint computed from the HF Files page (three safetensors shards totalling 10,445,749,784 bytes via the HF tree API); plan on ~11–12 GB for short clips before activations and KV cache, rising with audio length and
max_tokens. That fits the 16 GB 7800 XT with headroom, though long-form audio with a largemax_tokenswill eat into it — see Troubleshooting. This is a derived envelope, not a measured peak — once community submissions land, the live number will appear on the check page below. - Quality notes: The MOSS-Audio team reports an overall CER of 11.58 for the 4B-Instruct across their diverse ASR benchmark suite, per the model card (the 11.30 figure on that card belongs to the larger 8B-Instruct). The 4B-Thinking sibling at the same parameter count adds chain-of-thought reasoning for harder audio QA at the cost of more tokens per response — pick Thinking if you need step-by-step audio reasoning rather than direct instruction following. Numeric quality is architecture-independent: these scores carry over from the NVIDIA recipe unchanged, since the BF16 weights are identical across vendors.
- License: Apache-2.0.
For the full benchmark data once community submissions land, see /check/moss-audio/rx-7800-xt.
Troubleshooting
"Torch not compiled with CUDA enabled" or a CUDA-build torch got installed
This means a CUDA wheel of PyTorch got pulled in instead of the ROCm build — most often because the editable pip install -e ".[torch-runtime]" ran with the upstream --extra-index-url .../whl/cu128 flag still attached, or before the ROCm torch was installed. Reinstall torch from the ROCm index:
pip uninstall torch torchvision torchaudio
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm7.2
Confirm the installed build is the ROCm one: python -c "import torch; print(torch.__version__)" should print a +rocm7.2-style suffix, and torch.cuda.is_available() returns True (ROCm masquerades as the cuda device namespace under HIP).
flash-attn build fails on gfx1101
If you (or the editable install's optional extras) try to build the Dao-AILab flash-attn package, the C++/CK compile commonly fails on consumer RDNA3 cards — upstream FlashAttention's CK kernels target CDNA/MI accelerators, not gfx1101. You don't need it: MOSS-Audio defaults to eager/SDPA attention, so simply install without the flash-attn extra (pip install -e ".[torch-runtime]") and let PyTorch's ROCm SDPA handle attention. Remove any flash-attn line from a custom requirements file before installing.
A library ships only gfx1100 kernels and won't load on the 7800 XT
The 7800 XT is gfx1101 (Navi 32), while the flagship 7900 XTX is gfx1100 (Navi 31). Most of the ROCm stack and transformers/PyTorch ship kernels for both, but occasionally a prebuilt extension only carries gfx1100 kernels and refuses to run on gfx1101. The standard Linux-only fallback is to mask the card as gfx1100 at runtime:
HSA_OVERRIDE_GFX_VERSION=11.0.0 python infer.py
This is a legacy fallback, not a default — the stable ROCm PyTorch wheel runs MOSS-Audio natively on gfx1101 without it. Only reach for it if you hit a "no kernel image is available" / missing-gfx1101-kernel error from a specific library.
Cannot import name or version errors during install
Keep MOSS-Audio inside its own conda environment exactly as the README prescribes — but with the ROCm torch wheel substituted for the CUDA one. If you reuse a system Python with a mismatched (or CUDA) torch, the audio encoder will fail to load. Recreate the env, install the ROCm wheel first (step 2), then the editable package (step 3).
Out of memory on long audio
The weight footprint (~10.4 GB BF16) is fixed, but activations and the KV cache scale with clip length and max_tokens. On the 16 GB 7800 XT short and medium clips fit comfortably, but a very long clip with a large max_tokens can push past the budget — if a clip OOMs, shorten the audio segment or reduce max_tokens. No widely-reported OOMs on this card exist yet; report problems via the submission form.