self-hosted/ai
§01·recipe · tts

MOSS-Audio 4B-Instruct on RTX 5060 Ti: local audio understanding in ~12 GB

ttsintermediate12GB+ VRAMMay 19, 2026
models
tools
prerequisites
  • NVIDIA RTX 5060 Ti (16 GB VRAM) or any consumer GPU with 12 GB+ VRAM
  • Python 3.12 (per the official conda recipe), CUDA 12.8-capable PyTorch
  • ffmpeg 7 (for audio decoding)

What You'll Build

A local audio-understanding pipeline running OpenMOSS MOSS-Audio 4B-Instruct on an RTX 5060 Ti. The model handles speech transcription (with word- and sentence-level timestamps), environmental sound understanding, music understanding, audio captioning, audio QA, and time-aware reasoning in English and Chinese — built on a Qwen3-4B backbone with a from-scratch MOSS-Audio-Encoder, per the model card.

Hardware data: RTX 5060 Ti (16 GB VRAM) · ~10.4 GB BF16 weights from the HF Files listing · See benchmark data

ℹ️ Not a TTS model. MOSS-Audio understands audio — it does not synthesize speech. Inputs are audio (speech, sound, music); outputs are text. The OpenMOSS team ships speech synthesis separately as MOSS-TTS; don't conflate the two. MOSS-Audio sits in our tts vertical because the wider catalogue groups audio-input-or-output models there; the model card is explicit that this is audio-to-text.

Requirements

ComponentMinimumTested
GPU12 GB VRAM consumer cardRTX 5060 Ti (16 GB)
RAM16 GB
Storage~11 GB for BF16 weights + cache
SoftwarePython 3.12, PyTorch (CUDA 12.8 build), ffmpeg 7, huggingface-hub CLI

The BF16 weight shards on the official HF model page total 10.44 GB across three .safetensors files. That is the minimum VRAM the model needs just to load before activations and KV cache; in practice, plan on ~12 GB for short clips and rising with audio length. The 5060 Ti's 16 GB budget leaves comfortable headroom.

Installation

The official setup, from the model card and GitHub repo, uses a dedicated conda environment with the CUDA 12.8 PyTorch wheel.

1. Clone the repo and create the environment

git clone https://github.com/OpenMOSS/MOSS-Audio.git
cd MOSS-Audio

conda create -n moss-audio python=3.12 -y
conda activate moss-audio

conda install -c conda-forge "ffmpeg=7" -y
pip install --extra-index-url https://download.pytorch.org/whl/cu128 -e ".[torch-runtime]"

2. (Optional) Install FlashAttention 2

Recommended on Ada / Blackwell parts like the 5060 Ti for lower memory pressure on long audio:

pip install --extra-index-url https://download.pytorch.org/whl/cu128 -e ".[torch-runtime,flash-attn]"

3. Download the 4B-Instruct weights

huggingface-cli download OpenMOSS-Team/MOSS-Audio-4B-Instruct \
  --local-dir ./weights/MOSS-Audio-4B-Instruct

This pulls the three BF16 safetensors shards (~10.4 GB total) plus tokenizer and processor config.

Running

The fastest path is the bundled infer.py script for one-shot inference:

# Edit MODEL_PATH and AUDIO_PATH inside infer.py, then:
python infer.py

The default prompt is "Describe this audio." and the script also covers transcription and audio QA — change the prompt to switch tasks.

For an interactive UI:

python app.py

For batched / server-side serving with the patched SGLang fork (per the official usage guide):

git clone -b moss-audio https://github.com/OpenMOSS/sglang.git
cd sglang
pip install -e "python[all]"
pip install nvidia-cudnn-cu12==9.16.0.29
cd ..
sglang serve --model-path ./weights/MOSS-Audio-4B-Instruct --trust-remote-code

The SGLang server exposes an OpenAI-compatible /v1/chat/completions endpoint that accepts audio via audio_url parts inside the chat message:

import requests

resp = requests.post("http://localhost:30000/v1/chat/completions", json={
    "model": "default",
    "messages": [
        {
            "role": "user",
            "content": [
                {"type": "audio_url", "audio_url": {"url": "/path/to/audio.wav"}},
                {"type": "text", "text": "Transcribe and summarise this clip."},
            ],
        }
    ],
    "max_tokens": 1024,
    "temperature": 0.0,
})

print(resp.json()["choices"][0]["message"]["content"])

Results

  • VRAM usage: No vendor benchmark on a consumer GPU exists yet. The concrete lower bound is the 10.44 GB BF16 weight footprint from the HF Files page; a third-party analysis estimates "the 4B variants will fit on a single consumer-grade GPU with 16-24GB VRAM" (SoftTechHub coverage of the MOSS-Audio release). Both numbers are consistent with running cleanly on the 5060 Ti's 16 GB budget for short-to-medium clips; long-form audio + large max_tokens will shrink that headroom.
  • Speed: Not quoted — no source benchmarks MOSS-Audio on a comparable consumer GPU. Submit results via /contribute once you've run it; live numbers will appear on the check page below.
  • Quality notes: The MOSS-Audio team reports timestamp ASR scores of 76.96 AAS on AISHELL-1 and 358.13 AAS on LibriSpeech for the 4B-Instruct, materially better than Qwen3-Omni's 833.66 / 646.95 on the same metric (lower is better), per the model card. The 4B-Thinking sibling at the same parameter count averages 68.37 vs the Instruct's 64.04 across the team's reasoning suite — pick Thinking if you need chain-of-thought audio QA at the cost of more tokens per response.
  • License: Apache-2.0.

For the full benchmark data once community submissions land, see /check/moss-audio/rtx-5060-ti.

Troubleshooting

Cannot import name or version errors during install

The official install uses an editable PyTorch CUDA-12.8 build via pip install --extra-index-url https://download.pytorch.org/whl/cu128 -e ".[torch-runtime]". If you reuse a system Python with an older torch, the audio encoder will fail to load. Keep MOSS-Audio inside its own conda environment exactly as the README prescribes.

SGLang refuses to start with a CuDNN error

The MOSS-Audio SGLang fork pins nvidia-cudnn-cu12==9.16.0.29 — install it explicitly after pip install -e "python[all]", per the official usage guide. Without that pin, SGLang fails its CuDNN compatibility check on CUDA 12.8 builds and the server never comes up.

Out of memory on long audio

The weight footprint (~10.4 GB BF16) is fixed, but activations and the SGLang KV cache scale with clip length and max_tokens. If a long clip OOMs on 16 GB, either shorten the audio segment, reduce max_tokens, or pass a smaller --mem-fraction-static to SGLang. No widely-reported OOMs on consumer cards exist yet; report problems via the submission form.