MOSS-Audio 4B-Instruct on RTX 4070: local audio understanding in a tight 12 GB

What You'll Build

A local audio-understanding pipeline running OpenMOSS MOSS-Audio 4B-Instruct on an RTX 4070. The model handles speech transcription (with word- and sentence-level timestamps), environmental sound understanding, music understanding, audio captioning, audio QA, and time-aware reasoning in English and Chinese — built on a Qwen3-4B LLM backbone with a from-scratch MOSS-Audio-Encoder (12.5 Hz frame rate) and a DeepStack-inspired cross-layer adapter, per the model card and the upstream GitHub README.

Hardware data: RTX 4070 (12 GB VRAM) · ~10.45 GB BF16 weights from the HF Files listing · See benchmark data

ℹ️ Not a TTS model. MOSS-Audio understands audio — it does not synthesize speech. Inputs are audio (speech, sound, music); outputs are text. MOSS-Audio sits in our tts vertical because the wider catalogue groups audio-input-or-output models there; the model card is explicit that this is an audio-understanding model (audio in, text out).

⚠️ 12 GB is tight on this card. The 4B-Instruct BF16 weights alone are 10.45 GB, and a desktop RTX 4070 with a monitor attached exposes only roughly 10.5–11.3 GB usable. Weights plus activations and KV cache push the realistic peak over that desktop ceiling, so on a 12 GB card you should run headless (no display on the GPU → ~11.6 GB usable) and keep clips short with a capped max_tokens. This is a derived envelope, not a measured peak — see Results. If you want comfortable margin, a 16 GB card (e.g. the RTX 4070 Ti SUPER) is the safer home for this model.

⚠️ This recipe is the 4B-Instruct variant. The OpenMOSS release ships four models — MOSS-Audio-4B-Instruct, 4B-Thinking, 8B-Instruct and 8B-Thinking. The two 4B variants share a Qwen3-4B backbone (~4.6B total per the model card) and the same ~10.45 GB BF16 footprint. The 8B variants use a Qwen3-8B backbone (~8.6B total) and roughly double the weight footprint — out of scope for a 12 GB card at BF16. Pick 4B-Thinking over 4B-Instruct only if you need chain-of-thought audio reasoning (at the cost of more tokens per response); the install path is otherwise identical.

Requirements

Component	Minimum	Tested
GPU	12 GB VRAM consumer card (headless recommended on 12 GB)	RTX 4070 (12 GB)
RAM	16 GB	—
Storage	~11 GB for BF16 weights + cache	—
Software	Python 3.12, PyTorch (CUDA 12.8 / cu128 build), ffmpeg 7, `huggingface-hub` CLI	—

The BF16 weight shards on the official HF model page total 10.45 GB across three .safetensors files (4.99 + 4.67 + 0.78 GB, summing to 10,445,749,784 bytes per the HF tree API). That is the minimum VRAM the model needs just to load before activations and KV cache. On the RTX 4070's 12 GB that leaves only a slim margin — fine for short clips when running headless, but easy to exhaust on long audio or large max_tokens (see Troubleshooting).

Installation

The official setup, from the model card and GitHub repo, uses a dedicated conda environment with the CUDA 12.8 PyTorch wheel. The RTX 4070 is an Ada Lovelace (AD104, sm_89) card — the cu128 wheel the upstream README already prescribes ships sm_89 kernels, so the standard install needs no modification and no special wheel selection. The default inference path uses PyTorch SDPA attention (see Running below), which has full sm_89 support. FlashAttention 2 is an optional upstream swap; on Ada (unlike on the Blackwell sm_120 cards) the prebuilt FA2 wheels already ship sm_89 kernels, so FA2 works if you want it — but it is not required and the default SDPA path is what this recipe uses.

1. Clone the repo and create the environment

git clone https://github.com/OpenMOSS/MOSS-Audio.git
cd MOSS-Audio

conda create -n moss-audio python=3.12 -y
conda activate moss-audio

conda install -c conda-forge "ffmpeg=7" -y
pip install --extra-index-url https://download.pytorch.org/whl/cu128 -e ".[torch-runtime]"

2. (Optional) FlashAttention 2 — works on the RTX 4070, but not required

The upstream README offers FlashAttention 2 as an optional swap for the last install command, gated on hardware support: "If your GPU supports FlashAttention 2, you can replace the last install command with" the flash-attn extra. The RTX 4070 is Ada Lovelace (sm_89), which prebuilt FA2 wheels do support — so this step is genuinely optional here (it works, but the default SDPA path runs fine without it):

# Optional on Ada sm_89 — FA2 prebuilt wheels include sm_89 kernels:
pip install --extra-index-url https://download.pytorch.org/whl/cu128 -e ".[torch-runtime,flash-attn]"

3. Download the 4B-Instruct weights

huggingface-cli download OpenMOSS-Team/MOSS-Audio-4B-Instruct \
  --local-dir ./weights/MOSS-Audio-4B-Instruct

This pulls the three BF16 safetensors shards (~10.45 GB total) plus tokenizer and processor config. (The README's generic huggingface-cli download OpenMOSS-Team/MOSS-Audio example downloads the umbrella collection slug; pin the explicit MOSS-Audio-4B-Instruct repo above so you get exactly the variant this recipe targets.)

Running

The fastest path is the bundled infer.py script for one-shot inference. It loads the model with MossAudioModel.from_pretrained(MODEL_PATH, trust_remote_code=True, dtype="auto", device_map=...) and requests no attn_implementation — so transformers selects PyTorch SDPA, which works out of the box on the RTX 4070's sm_89 architecture. Edit the two path constants at the top of infer.py first — the upstream default MODEL_PATH points at weights/MOSS-Audio-4B-Thinking, so set it to your 4B-Instruct download:

# In infer.py set:
#   MODEL_PATH = "weights/MOSS-Audio-4B-Instruct"
#   AUDIO_PATH = "/path/to/your/audio.wav"
python infer.py

The default prompt is Describe this audio. and the script also covers transcription, audio QA, and speech captioning — change the prompt string to switch tasks.

For batched / server-side serving with the patched SGLang fork (per the official usage guide):

git clone -b moss-audio https://github.com/OpenMOSS/sglang.git
cd sglang
pip install -e "python[all]"
pip install nvidia-cudnn-cu12==9.16.0.29
cd ..
sglang serve --model-path ./weights/MOSS-Audio-4B-Instruct --trust-remote-code

The SGLang server exposes an OpenAI-compatible /v1/chat/completions endpoint that accepts audio via audio_url parts inside the chat message:

import requests

resp = requests.post("http://localhost:30000/v1/chat/completions", json={
    "model": "default",
    "messages": [
        {
            "role": "user",
            "content": [
                {"type": "audio_url", "audio_url": {"url": "/path/to/audio.wav"}},
                {"type": "text", "text": "Transcribe and summarise this clip."},
            ],
        }
    ],
    "max_tokens": 1024,
    "temperature": 0.0,
})

print(resp.json()["choices"][0]["message"]["content"])

On a 12 GB card, prefer the single-shot infer.py path over standing up an SGLang server — a long-lived server reserves a static KV-cache pool that eats into the slim free margin. If you do serve, keep max_tokens low and the audio segments short.

Results

VRAM usage: No vendor or community benchmark on an RTX 4070 (or any 12 GB consumer card) exists yet. The concrete lower bound is the 10.45 GB BF16 weight footprint computed from the HF Files page (three safetensors shards totalling 10,445,749,784 bytes via the HF tree API). With activations and KV cache layered on top, plan on a realistic peak in the ~11 GB and up range — which is why this recipe recommends running headless on a 12 GB card (a desktop RTX 4070 with a monitor exposes only ~10.5–11.3 GB usable). This is a derived envelope, not a measured peak; once a community submission lands, the live number will appear on the check page below.
Speed: Not quoted — no source benchmarks MOSS-Audio on the RTX 4070 or a comparable consumer GPU, so there is no number to attribute honestly. Submit results via /contribute once you've run it; live numbers will appear on the check page below.
Quality notes: The MOSS-Audio team reports an overall CER of 11.58 for the 4B-Instruct across their 12-dimension ASR benchmark suite, with standout scores on non-speech vocalizations (4.01) and code-switching (10.11), per the model card ASR table. The 4B-Thinking sibling at the same parameter count adds chain-of-thought reasoning for harder audio QA at the cost of more tokens per response — pick Thinking if you need step-by-step audio reasoning rather than direct instruction following.
License: Apache-2.0, per the model card ("Models in MOSS-Audio are licensed under the Apache License 2.0.").

For the full benchmark data once community submissions land, see /check/moss-audio/rtx-4070.

Troubleshooting

Out of memory on a 12 GB RTX 4070

The weight footprint (~10.45 GB BF16) is fixed and already close to the usable budget of a 12 GB card. To stay under the ceiling: (1) run headless — don't attach a display to the GPU, which frees ~0.5–1 GB of usable VRAM; (2) keep audio clips short and cap max_tokens so the KV cache stays small; (3) prefer the single-shot infer.py path over a long-lived SGLang server (which reserves a static KV pool). No widely-reported OOMs on 12 GB cards exist yet — report your results via the submission form so the live envelope on the check page can replace this derived estimate.

`Cannot import name` or version errors during install

The official install uses an editable PyTorch CUDA-12.8 build via pip install --extra-index-url https://download.pytorch.org/whl/cu128 -e ".[torch-runtime]". If you reuse a system Python with an older torch, the audio encoder will fail to load. Keep MOSS-Audio inside its own conda environment exactly as the README prescribes. The cu128 wheel already includes Ada sm_89 kernels, so no Ada-specific wheel swap is needed (this is the one place the Blackwell sm_120 recipes differ — they have a FlashAttention 2 kernel gap that Ada does not).

SGLang refuses to start with a CuDNN error

The MOSS-Audio SGLang fork recommends pinning nvidia-cudnn-cu12==9.16.0.29 for the default torch==2.9.1+cu128 runtime — install it explicitly after pip install -e "python[all]", per the official usage guide. Without that pin, SGLang can fail its CuDNN compatibility check on CUDA 12.8 builds and the server never comes up.