MOSS-Audio 4B-Instruct on RTX 5070: local audio understanding in a tight 12 GB

What You'll Build

A local audio-understanding pipeline running OpenMOSS MOSS-Audio 4B-Instruct on an RTX 5070. The model handles speech transcription (with word- and sentence-level timestamps), environmental sound understanding, music understanding, audio captioning, audio QA, and time-aware reasoning in English and Chinese — built on a Qwen3-4B LLM backbone with a from-scratch MOSS-Audio-Encoder (12.5 Hz frame rate) and a DeepStack-inspired cross-layer adapter, per the model card and the upstream GitHub README.

Hardware data: RTX 5070 (12 GB VRAM) · ~10.45 GB BF16 weights from the HF Files listing · See benchmark data

ℹ️ Not a TTS model. MOSS-Audio understands audio — it does not synthesize speech. Inputs are audio (speech, sound, music); outputs are text. MOSS-Audio sits in our tts vertical because the wider catalogue groups audio-input-or-output models there; the model card is explicit that this is an audio-understanding model (audio in, text out).

⚠️ 12 GB is tight on this card. The 4B-Instruct BF16 weights alone are 10.45 GB, and a desktop RTX 5070 with a monitor attached exposes only roughly 10.5–11.3 GB usable. Weights plus activations and KV cache push the realistic peak over that desktop ceiling, so on a 12 GB card you should run headless (no display on the GPU → ~11.6 GB usable) and keep clips short with a capped max_tokens. This is a derived envelope, not a measured peak — see Results. If you want comfortable margin, a 16 GB card (e.g. the RTX 5070 Ti) is the safer home for this model.

⚠️ This recipe is the 4B-Instruct variant. The OpenMOSS release ships four models — MOSS-Audio-4B-Instruct, 4B-Thinking, 8B-Instruct and 8B-Thinking. The two 4B variants share a Qwen3-4B backbone (~4.6B total) and the same ~10.45 GB BF16 footprint. The 8B variants use a Qwen3-8B backbone (~8.6B total) and roughly double the weight footprint — out of scope for a 12 GB card at BF16. Pick 4B-Thinking over 4B-Instruct only if you need chain-of-thought audio reasoning (at the cost of more tokens per response); the install path is otherwise identical.

Requirements

Component	Minimum	Tested
GPU	12 GB VRAM consumer card (headless recommended on 12 GB)	RTX 5070 (12 GB)
RAM	16 GB	—
Storage	~11 GB for BF16 weights + cache	—
Software	Python 3.12, PyTorch (CUDA 12.8 / cu128 build), ffmpeg 7, `huggingface-hub` CLI	—

The BF16 weight shards on the official HF model page total 10.45 GB across three .safetensors files (4.99 + 4.67 + 0.78 GB, summing to 10,445,749,784 bytes per the HF tree API). That is the minimum VRAM the model needs just to load before activations and KV cache. On the RTX 5070's 12 GB that leaves only a slim margin — fine for short clips when running headless, but easy to exhaust on long audio or large max_tokens (see Troubleshooting).

Installation

The official setup, from the model card and GitHub repo, uses a dedicated conda environment with the CUDA 12.8 PyTorch wheel. The RTX 5070 is a Blackwell (GB205, sm_120) card — the cu128 wheel the upstream README already prescribes ships sm_120 kernels, so the standard install needs no modification. The default inference path uses PyTorch SDPA attention (see Running below), which has full sm_120 support; the optional FlashAttention 2 step is the one place where the Blackwell kernel gap bites (see Troubleshooting).

1. Clone the repo and create the environment

git clone https://github.com/OpenMOSS/MOSS-Audio.git
cd MOSS-Audio

conda create -n moss-audio python=3.12 -y
conda activate moss-audio

conda install -c conda-forge "ffmpeg=7" -y
pip install --extra-index-url https://download.pytorch.org/whl/cu128 -e ".[torch-runtime]"

2. (Optional) FlashAttention 2 — skip it on the RTX 5070

The upstream README offers FlashAttention 2 as an optional swap for the last install command, gated on hardware support: "If your GPU supports FlashAttention 2" you can install the flash-attn extra instead. On Blackwell (sm_120) cards including the RTX 5070, FA2 prebuilt wheels still lack sm_120 kernels as of mid-2026 — skip this step. The default SDPA path runs fine without it (see Troubleshooting for the tracking issue).

# Only if you have a known-good sm_120 FlashAttention 2 build (most users will NOT):
pip install --extra-index-url https://download.pytorch.org/whl/cu128 -e ".[torch-runtime,flash-attn]"

3. Download the 4B-Instruct weights

hf download OpenMOSS-Team/MOSS-Audio-4B-Instruct \
  --local-dir ./weights/MOSS-Audio-4B-Instruct

This pulls the three BF16 safetensors shards (~10.45 GB total) plus tokenizer and processor config. (The README's generic huggingface-cli download OpenMOSS-Team/MOSS-Audio example downloads the umbrella collection slug; pin the explicit MOSS-Audio-4B-Instruct repo above so you get exactly the variant this recipe targets.)

Running

The fastest path is the bundled infer.py script for one-shot inference. It loads the model with MossAudioModel.from_pretrained(..., trust_remote_code=True, dtype="auto", device_map=...) and requests no attn_implementation — so transformers selects PyTorch SDPA, which works out of the box on the RTX 5070's sm_120 architecture. Edit the two path constants at the top of infer.py first — the upstream default MODEL_PATH points at weights/MOSS-Audio-4B-Thinking, so set it to your 4B-Instruct download:

# In infer.py set:
#   MODEL_PATH = "weights/MOSS-Audio-4B-Instruct"
#   AUDIO_PATH = "/path/to/your/audio.wav"
python infer.py

The default prompt is Describe this audio. and the script also covers transcription, audio QA, and speech captioning — change the prompt string to switch tasks.

For batched / server-side serving with the patched SGLang fork (per the official usage guide):

git clone -b moss-audio https://github.com/OpenMOSS/sglang.git
cd sglang
pip install -e "python[all]"
pip install nvidia-cudnn-cu12==9.16.0.29
cd ..
sglang serve --model-path ./weights/MOSS-Audio-4B-Instruct --trust-remote-code

The SGLang server exposes an OpenAI-compatible /v1/chat/completions endpoint that accepts audio via audio_url parts inside the chat message:

import requests

resp = requests.post("http://localhost:30000/v1/chat/completions", json={
    "model": "default",
    "messages": [
        {
            "role": "user",
            "content": [
                {"type": "audio_url", "audio_url": {"url": "/path/to/audio.wav"}},
                {"type": "text", "text": "Transcribe and summarise this clip."},
            ],
        }
    ],
    "max_tokens": 1024,
    "temperature": 0.0,
})

print(resp.json()["choices"][0]["message"]["content"])

On a 12 GB card, prefer the single-shot infer.py path over standing up an SGLang server — a long-lived server reserves a static KV-cache pool that eats into the slim free margin. If you do serve, keep max_tokens low and the audio segments short.

Results

VRAM usage: No vendor or community benchmark on an RTX 5070 (or any 12 GB consumer card) exists yet. The concrete lower bound is the 10.45 GB BF16 weight footprint computed from the HF Files page (three safetensors shards totalling 10,445,749,784 bytes via the HF tree API). With activations and KV cache layered on top, plan on a realistic peak in the ~11 GB and up range — which is why this recipe recommends running headless on a 12 GB card (a desktop RTX 5070 with a monitor exposes only ~10.5–11.3 GB usable). This is a derived envelope, not a measured peak; once a community submission lands, the live number will appear on the check page below.
Speed: Not quoted — no source benchmarks MOSS-Audio on the RTX 5070 or a comparable consumer GPU, and the RTX 5070's lower memory bandwidth and core count versus larger Blackwell cards mean a figure from another card would not transfer. Submit results via /contribute once you've run it; live numbers will appear on the check page below.
Quality notes: The MOSS-Audio team reports an overall CER of 11.58 for the 4B-Instruct across their 12-dimension ASR benchmark suite, with standout scores on non-speech vocalizations (4.01) and code-switching (10.11), per the model card. The 4B-Thinking sibling at the same parameter count adds chain-of-thought reasoning for harder audio QA at the cost of more tokens per response — pick Thinking if you need step-by-step audio reasoning rather than direct instruction following.
License: Apache-2.0.

For the full benchmark data once community submissions land, see /check/moss-audio/rtx-5070.

Troubleshooting

Out of memory on a 12 GB RTX 5070

The weight footprint (~10.45 GB BF16) is fixed and already close to the usable budget of a 12 GB card. To stay under the ceiling: (1) run headless — don't attach a display to the GPU, which frees ~0.5–1 GB of usable VRAM; (2) keep audio clips short and cap max_tokens so the KV cache stays small; (3) prefer the single-shot infer.py path over a long-lived SGLang server (which reserves a static KV pool). No widely-reported OOMs on 12 GB cards exist yet — report your results via the submission form so the live envelope on the check page can replace this derived estimate.

FlashAttention 2 fails to build or crashes on the RTX 5070 (sm_120)

The RTX 5070 is a Blackwell (GB205, compute capability 12.0 / sm_120) card. As of mid-2026, prebuilt FlashAttention 2 wheels do not ship sm_120 kernels, so installing the optional .[torch-runtime,flash-attn] extra and running with FA2 enabled can fail at the first inference call on RTX 50-series hardware — tracked upstream at Dao-AILab/flash-attention#2168. You don't need FA2: the bundled infer.py requests no specific attention backend, so transformers selects PyTorch SDPA, which has full sm_120 support. Skip step 2 of Installation unless you have a known-good sm_120 FA2 build.

`Cannot import name` or version errors during install

The official install uses an editable PyTorch CUDA-12.8 build via pip install --extra-index-url https://download.pytorch.org/whl/cu128 -e ".[torch-runtime]". If you reuse a system Python with an older torch, the audio encoder will fail to load. Keep MOSS-Audio inside its own conda environment exactly as the README prescribes.

SGLang refuses to start with a CuDNN error

The MOSS-Audio SGLang fork recommends pinning nvidia-cudnn-cu12==9.16.0.29 for the default torch==2.9.1+cu128 runtime — install it explicitly after pip install -e "python[all]", per the official usage guide. Without that pin, SGLang can fail its CuDNN compatibility check on CUDA 12.8 builds and the server never comes up.