MOSS-Audio 4B-Instruct on RTX 4060 Ti 16GB: local audio understanding in ~12 GB

What You'll Build

A local audio-understanding pipeline running OpenMOSS MOSS-Audio 4B-Instruct on an RTX 4060 Ti 16GB. The model handles speech transcription (with word- and sentence-level timestamps), environmental sound understanding, music understanding, audio captioning, audio QA, and time-aware reasoning in English and Chinese — built on a Qwen3-4B backbone with a from-scratch MOSS-Audio-Encoder, per the model card and the upstream GitHub README.

Hardware data: RTX 4060 Ti 16GB · ~10.4 GB BF16 weights from the HF Files listing · See benchmark data

ℹ️ Not a TTS model. MOSS-Audio understands audio — it does not synthesize speech. Inputs are audio (speech, sound, music); outputs are text. The OpenMOSS team ships speech synthesis separately as MOSS-TTS; don't conflate the two. MOSS-Audio sits in our tts vertical because the wider catalogue groups audio-input-or-output models there; the model card is explicit that this is audio-to-text.

Requirements

Component	Minimum	Tested
GPU	12 GB VRAM consumer card	RTX 4060 Ti 16GB
RAM	16 GB	—
Storage	~11 GB for BF16 weights + cache	—
Software	Python 3.12, PyTorch (CUDA 12.8 build), ffmpeg 7, `huggingface-hub` CLI	—

The BF16 weight shards on the official HF model page total 10.44 GB across three .safetensors files (4.99 + 4.67 + 0.78 GB, per the HF tree API). That is the minimum VRAM the model needs just to load before activations and KV cache; in practice, plan on ~12 GB for short clips and rising with audio length. The 4060 Ti 16GB's 16 GB budget leaves comfortable headroom, and matches the lower bound of the "16–24GB consumer-grade VRAM" envelope quoted in the SoftTechHub release coverage.

Installation

The official setup, from the model card and GitHub repo, uses a dedicated conda environment with the CUDA 12.8 PyTorch wheel. The 4060 Ti is an Ada Lovelace (sm_89) card — cu128 wheels include sm_89 kernels, so no special wheel selection is required versus what the upstream README prescribes.

1. Clone the repo and create the environment

git clone https://github.com/OpenMOSS/MOSS-Audio.git
cd MOSS-Audio

conda create -n moss-audio python=3.12 -y
conda activate moss-audio

conda install -c conda-forge "ffmpeg=7" -y
pip install --extra-index-url https://download.pytorch.org/whl/cu128 -e ".[torch-runtime]"

2. (Optional) Install FlashAttention 2

FlashAttention 2 ships mature sm_89 kernels for Ada cards like the 4060 Ti; the upstream README lists this as an optional install for lower memory pressure on long audio:

pip install --extra-index-url https://download.pytorch.org/whl/cu128 -e ".[torch-runtime,flash-attn]"

3. Download the 4B-Instruct weights

hf download OpenMOSS-Team/MOSS-Audio-4B-Instruct \
  --local-dir ./weights/MOSS-Audio-4B-Instruct

This pulls the three BF16 safetensors shards (~10.4 GB total) plus tokenizer and processor config.

Running

The fastest path is the bundled infer.py script for one-shot inference:

# Edit MODEL_PATH and AUDIO_PATH inside infer.py, then:
python infer.py

The default prompt is "Describe this audio." and the script also covers transcription and audio QA — change the prompt to switch tasks.

For an interactive UI:

python app.py

For batched / server-side serving with the patched SGLang fork (per the official usage guide):

git clone -b moss-audio https://github.com/OpenMOSS/sglang.git
cd sglang
pip install -e "python[all]"
pip install nvidia-cudnn-cu12==9.16.0.29
cd ..
sglang serve --model-path ./weights/MOSS-Audio-4B-Instruct --trust-remote-code

The SGLang server exposes an OpenAI-compatible /v1/chat/completions endpoint that accepts audio via audio_url parts inside the chat message:

import requests

resp = requests.post("http://localhost:30000/v1/chat/completions", json={
    "model": "default",
    "messages": [
        {
            "role": "user",
            "content": [
                {"type": "audio_url", "audio_url": {"url": "/path/to/audio.wav"}},
                {"type": "text", "text": "Transcribe and summarise this clip."},
            ],
        }
    ],
    "max_tokens": 1024,
    "temperature": 0.0,
})

print(resp.json()["choices"][0]["message"]["content"])

Results

VRAM usage: No vendor benchmark on a consumer GPU exists yet. The concrete lower bound is the 10.44 GB BF16 weight footprint computed from the HF Files page (three safetensors shards totalling 10,445,749,784 bytes via the HF tree API); a third-party analysis estimates "the 4B variants will fit on a single consumer-grade GPU with 16-24GB VRAM" (SoftTechHub coverage of the MOSS-Audio release). Both numbers are consistent with running cleanly on the 4060 Ti 16GB's 16 GB budget for short-to-medium clips; long-form audio + large max_tokens will shrink that headroom. This is a derived envelope, not a measured peak — once community submissions land, the live number will appear on the check page below.
Speed: Not quoted — no source benchmarks MOSS-Audio on a comparable consumer GPU. Submit results via /contribute once you've run it; live numbers will appear on the check page below.
Quality notes: The MOSS-Audio team reports an overall CER of 11.58 for the 4B-Instruct across their 12-dimension ASR benchmark suite, with best-in-class scores on non-speech vocalizations (4.01) and code-switching (10.11), per the model card. The 4B-Thinking sibling at the same parameter count adds chain-of-thought reasoning for harder audio QA at the cost of more tokens per response — pick Thinking if you need step-by-step audio reasoning rather than direct instruction following.
License: Apache-2.0.

For the full benchmark data once community submissions land, see /check/moss-audio/rtx-4060-ti-16gb.

Troubleshooting

`Cannot import name` or version errors during install

The official install uses an editable PyTorch CUDA-12.8 build via pip install --extra-index-url https://download.pytorch.org/whl/cu128 -e ".[torch-runtime]". If you reuse a system Python with an older torch, the audio encoder will fail to load. Keep MOSS-Audio inside its own conda environment exactly as the README prescribes.

SGLang refuses to start with a CuDNN error

The MOSS-Audio SGLang fork pins nvidia-cudnn-cu12==9.16.0.29 — install it explicitly after pip install -e "python[all]", per the official usage guide. Without that pin, SGLang fails its CuDNN compatibility check on CUDA 12.8 builds and the server never comes up.

Out of memory on long audio

The weight footprint (~10.4 GB BF16) is fixed, but activations and the SGLang KV cache scale with clip length and max_tokens. If a long clip OOMs on 16 GB, either shorten the audio segment, reduce max_tokens, or pass a smaller --mem-fraction-static to SGLang. No widely-reported OOMs on consumer cards exist yet; report problems via the submission form.