What You'll Build
A local audio-understanding pipeline running OpenMOSS MOSS-Audio 4B-Instruct on an RTX 5070 Ti. The model handles speech transcription (with word- and sentence-level timestamps), environmental sound understanding, music understanding, audio captioning, audio QA, and time-aware reasoning in English and Chinese — built on a Qwen3-4B LLM backbone with a from-scratch MOSS-Audio-Encoder (12.5 Hz frame rate) and a DeepStack-inspired cross-layer adapter, per the model card and the upstream GitHub README.
Hardware data: RTX 5070 Ti (16 GB VRAM) · ~10.4 GB BF16 weights from the HF Files listing · See benchmark data
ℹ️ Not a TTS model. MOSS-Audio understands audio — it does not synthesize speech. Inputs are audio (speech, sound, music); outputs are text. The OpenMOSS team ships speech synthesis separately as MOSS-TTS; don't conflate the two. MOSS-Audio sits in our
ttsvertical because the wider catalogue groups audio-input-or-output models there; the model card is explicit that this is audio-to-text.
⚠️ This recipe is the 4B-Instruct variant. The OpenMOSS release ships four models — MOSS-Audio-4B-Instruct, 4B-Thinking, 8B-Instruct and 8B-Thinking. The two 4B variants share a Qwen3-4B backbone (~4.6B total) and the same ~10.4 GB BF16 footprint, so they fit the RTX 5070 Ti's 16 GB cleanly. The 8B variants use a Qwen3-8B backbone (~8.6B total) and roughly double the weight footprint — out of scope for a 16 GB card at BF16. Pick 4B-Thinking over 4B-Instruct only if you need chain-of-thought audio reasoning (at the cost of more tokens per response); the install path is otherwise identical.
Requirements
| Component | Minimum | Tested |
|---|---|---|
| GPU | 12 GB VRAM consumer card | RTX 5070 Ti (16 GB) |
| RAM | 16 GB | — |
| Storage | ~11 GB for BF16 weights + cache | — |
| Software | Python 3.12, PyTorch (CUDA 12.8 / cu128 build), ffmpeg 7, huggingface-hub CLI | — |
The BF16 weight shards on the official HF model page total 10.44 GB across three .safetensors files (4.99 + 4.67 + 0.78 GB, summing to 10,445,749,784 bytes per the HF tree API). That is the minimum VRAM the model needs just to load before activations and KV cache; in practice, plan on ~12 GB for short clips and rising with audio length. The RTX 5070 Ti's 16 GB budget leaves comfortable headroom, and matches the lower bound of the "16-24GB consumer-grade VRAM" envelope quoted in the SoftTechHub release coverage.
Installation
The official setup, from the model card and GitHub repo, uses a dedicated conda environment with the CUDA 12.8 PyTorch wheel. The RTX 5070 Ti is a Blackwell (GB203, sm_120) card — the cu128 wheel the upstream README already prescribes ships sm_120 kernels, so the standard install needs no modification. The default inference path uses PyTorch SDPA attention (see Running below), which has full sm_120 support; the optional FlashAttention 2 step is the one place where the Blackwell kernel gap bites (see Troubleshooting).
1. Clone the repo and create the environment
git clone https://github.com/OpenMOSS/MOSS-Audio.git
cd MOSS-Audio
conda create -n moss-audio python=3.12 -y
conda activate moss-audio
conda install -c conda-forge "ffmpeg=7" -y
pip install --extra-index-url https://download.pytorch.org/whl/cu128 -e ".[torch-runtime]"
2. (Optional) FlashAttention 2 — see the Blackwell caveat first
The upstream README lists FlashAttention 2 as an optional install for lower memory pressure on long audio, but gates it on hardware support ("If your GPU supports FlashAttention 2"). On Blackwell (sm_120) cards including the RTX 5070 Ti, FA2 prebuilt wheels still lack sm_120 kernels as of mid-2026 — skip this step unless you have a working sm_120 FA2 build (see Troubleshooting). The default SDPA path runs fine without it.
# Only if you have a working sm_120 FlashAttention 2 build:
pip install --extra-index-url https://download.pytorch.org/whl/cu128 -e ".[torch-runtime,flash-attn]"
3. Download the 4B-Instruct weights
hf download OpenMOSS-Team/MOSS-Audio-4B-Instruct \
--local-dir ./weights/MOSS-Audio-4B-Instruct
This pulls the three BF16 safetensors shards (~10.4 GB total) plus tokenizer and processor config.
Running
The fastest path is the bundled infer.py script for one-shot inference. It loads the model with MossAudioModel.from_pretrained(..., dtype="auto", device_map="cuda:0") and does not request FlashAttention — so it falls back to PyTorch SDPA, which works out of the box on the RTX 5070 Ti's sm_120 architecture:
# Edit MODEL_PATH and AUDIO_PATH inside infer.py, then:
python infer.py
The default prompt is Describe this audio. and the script also covers transcription, audio QA, and speech captioning — change the prompt to switch tasks.
For batched / server-side serving with the patched SGLang fork (per the official usage guide):
git clone -b moss-audio https://github.com/OpenMOSS/sglang.git
cd sglang
pip install -e "python[all]"
pip install nvidia-cudnn-cu12==9.16.0.29
cd ..
sglang serve --model-path ./weights/MOSS-Audio-4B-Instruct --trust-remote-code
The SGLang server exposes an OpenAI-compatible /v1/chat/completions endpoint that accepts audio via audio_url parts inside the chat message:
import requests
resp = requests.post("http://localhost:30000/v1/chat/completions", json={
"model": "default",
"messages": [
{
"role": "user",
"content": [
{"type": "audio_url", "audio_url": {"url": "/path/to/audio.wav"}},
{"type": "text", "text": "Transcribe and summarise this clip."},
],
}
],
"max_tokens": 1024,
"temperature": 0.0,
})
print(resp.json()["choices"][0]["message"]["content"])
Results
- VRAM usage: No vendor or community benchmark on an RTX 5070 Ti exists yet. The concrete lower bound is the 10.44 GB BF16 weight footprint computed from the HF Files page (three safetensors shards totalling 10,445,749,784 bytes via the HF tree API); a third-party analysis estimates that the 4B variants fit on a single consumer-grade GPU with 16-24GB VRAM (SoftTechHub coverage of the MOSS-Audio release). Both figures are consistent with running cleanly on the RTX 5070 Ti's 16 GB budget for short-to-medium clips; long-form audio + large
max_tokenswill shrink that headroom. This is a derived envelope, not a measured peak — once community submissions land, the live number will appear on the check page below. - Speed: Not quoted — no source benchmarks MOSS-Audio on a comparable consumer GPU. Submit results via
/contributeonce you've run it; live numbers will appear on the check page below. - Quality notes: The MOSS-Audio team reports an overall CER of 11.58 for the 4B-Instruct across their 12-dimension ASR benchmark suite, with standout scores on non-speech vocalizations (4.01) and code-switching (10.11), per the model card. The 4B-Thinking sibling at the same parameter count adds chain-of-thought reasoning for harder audio QA at the cost of more tokens per response — pick Thinking if you need step-by-step audio reasoning rather than direct instruction following.
- License: Apache-2.0.
For the full benchmark data once community submissions land, see /check/moss-audio/rtx-5070-ti.
Troubleshooting
FlashAttention 2 fails to build or crashes on the RTX 5070 Ti (sm_120)
The RTX 5070 Ti is a Blackwell (GB203, compute capability 12.0 / sm_120) card. As of mid-2026, prebuilt FlashAttention 2 wheels do not ship sm_120 kernels, so installing the optional .[torch-runtime,flash-attn] extra and running with FA2 enabled can fail at the first inference call on RTX 50-series hardware — tracked upstream at Dao-AILab/flash-attention#2168. You don't need FA2: the bundled infer.py requests no specific attention backend, so transformers selects PyTorch SDPA, which has full sm_120 support. Skip step 2 of Installation unless you have a known-good sm_120 FA2 build.
Cannot import name or version errors during install
The official install uses an editable PyTorch CUDA-12.8 build via pip install --extra-index-url https://download.pytorch.org/whl/cu128 -e ".[torch-runtime]". If you reuse a system Python with an older torch, the audio encoder will fail to load. Keep MOSS-Audio inside its own conda environment exactly as the README prescribes.
SGLang refuses to start with a CuDNN error
The MOSS-Audio SGLang fork recommends pinning nvidia-cudnn-cu12==9.16.0.29 for the default torch==2.9.1+cu128 runtime — install it explicitly after pip install -e "python[all]", per the official usage guide. Without that pin, SGLang can fail its CuDNN compatibility check on CUDA 12.8 builds and the server never comes up.
Out of memory on long audio
The weight footprint (~10.4 GB BF16) is fixed, but activations and the SGLang KV cache scale with clip length and max_tokens. If a long clip OOMs on 16 GB, either shorten the audio segment, reduce max_tokens, or lower SGLang's static memory fraction. No widely-reported OOMs on consumer cards exist yet; report problems via the submission form.