MiniMind-O on RTX 4080: 0.1B Omni Model with Headroom to Spare

What You'll Build

A from-scratch, end-to-end omni model running locally on the RTX 4080: MiniMind-O accepts text, speech, and image inputs and produces text plus streaming 24 kHz speech output. The dense minimind-3o variant has ~113M trainable backbone parameters and is intended as an educational / research-grade reference implementation rather than a production-grade omni model. With a derived ~4 GB envelope on a 16 GB card, this is the easiest entrypoint for multimodal experimentation in our catalogue — most of the VRAM stays free for a second model, longer context, or batch experimentation.

Hardware data: RTX 4080 16GB · ~113M backbone parameters, bfloat16 inference · See benchmark data

Note: As of this writing the backend has no measured benchmarks for this pair (/check/minimind-o/rtx-4080 returns verdict: unknown). The min_vram_gb: 4 figure is a conservative envelope derived from the official weight sizes and architecture (see the Results section); revisit /check/minimind-o/rtx-4080 once a community benchmark lands.

ℹ️ GitHub-hosted, with a HuggingFace weights mirror. The canonical project lives on GitHub; the from_pretrained-style weights are mirrored at jingyaogong/minimind-3o. Load the model with the repo's own eval_omni.py script — not transformers.pipeline (see Troubleshooting).

Requirements

Component	Minimum	Tested
GPU	4GB VRAM CUDA GPU	RTX 4080 16GB — pair not yet benchmarked, see /check/
RAM	8GB	—
Storage	~3 GB (code + all sub-model weights)	—
Software	Python 3.10, PyTorch with CUDA, Git	—

The RTX 4080 is an Ada Lovelace (sm_89) card. The default pip install of PyTorch already ships sm_89 CUDA kernels — no special wheel index, no cu128, and no attention-kernel workarounds are required. This recipe transfers directly from the previously-published RTX 4060 Ti 16GB recipe (same Ada sm_89 architecture, same 16 GB VRAM tier), so the install and the VRAM floor are identical; only the spare-headroom story changes with the card.

Installation

1. Clone the official repo

git clone --depth 1 https://github.com/jingyaogong/minimind-o
cd minimind-o

2. Install Python dependencies

The repo pins Python==3.10. Create a fresh environment, then install the project requirements (the README uses the Tsinghua mirror for faster downloads in China — drop the -i flag if you're outside that region):

pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple

3. Download the model weights

MiniMind-O is an assembly of a 113M dense LLM backbone trained from scratch plus several frozen pre-trained encoders/decoders. Pull them all with modelscope (the official downloader for this project), exactly as the README documents:

modelscope download --model gongjy/SenseVoiceSmall --local_dir ./model/SenseVoiceSmall
modelscope download --model gongjy/siglip2-base-p32-256-ve --local_dir ./model/siglip2-base-p32-256-ve
modelscope download --model gongjy/mimi --local_dir ./model/mimi
modelscope download --model gongjy/campplus --local_dir ./model/campplus
modelscope download --model gongjy/minimind-3o-pytorch llm_768.pth --local_dir ./out

SenseVoice-Small handles speech input, SigLIP2 handles image input, Mimi is the streaming 24 kHz audio codec for speech output, CAM++ provides speaker embeddings, and llm_768.pth is the 113M-parameter language backbone trained from scratch by the project author.

Running

Run the omni inference script bundled with the repo:

python eval_omni.py --load_from model --weight sft_omni

This loads the SFT-tuned omni checkpoint and starts an interactive CLI session that accepts text, speech, or image inputs and replies in text + 24 kHz audio. If you prefer the from_pretrained-format weights from the HuggingFace mirror, download the jingyaogong/minimind-3o model directory and point the same script at it (this is the README's documented transformers-format path — not a generic pipeline("text-generation") call, because the model is a custom architecture):

git clone https://huggingface.co/jingyaogong/minimind-3o
python eval_omni.py --load_from minimind-3o

A browser WebUI is also available; after copying a transformers-format model folder into ./scripts/, run cd scripts && python web_demo_omni.py.

Results

Speed: Not measured on an RTX 4080 by any cited source. The only hardware the official GitHub README names is an RTX 3090 (24GB), where the mini-dataset SFT training run takes about 2 hours — a training figure on a different-generation card (Ampere sm_86), not an RTX 4080 inference benchmark, so we do not transfer it as a speed number. Report a measured figure via /contribute.
VRAM usage: Approximate envelope ~4 GB peak, derived from the official artifact sizes. The HuggingFace mirror jingyaogong/minimind-3o ships a single pytorch_model.bin of ~216 MB for the 113M backbone; the raw .pth mirror jingyaogong/minimind-3o-pytorch carries the llm_768.pth / sft_omni_768.pth checkpoints. The frozen encoder/decoder bundles (SenseVoice-Small, SigLIP2, Mimi, CAM++) load alongside at modest sizes, and the backbone uses the 768-hidden-size MiniMind architecture per config.json — so KV cache and decode buffers stay well under 1 GB for short conversations. 4 GB is a conservative ceiling. This is a derived envelope, not a measured peak; see /check/minimind-o/rtx-4080 for live data once a benchmark lands. On a 16 GB RTX 4080 this leaves roughly 12 GB free.
Quality notes: This is a from-scratch ~113M-parameter omni model, deliberately small. The README frames it as the smallest complete open Omni implementation, useful as a reference for understanding the full Thinker–Talker pipeline end to end — not for comparison against frontier omni models like Qwen3-Omni or GPT-4o. The repo also ships a 0.3B MoE variant (minimind-3o-moe, ~0.3B total / ~0.1B active) if you want to test a slightly larger configuration; even then the 16 GB card has comfortable headroom.

For the full benchmark data, see /check/minimind-o/rtx-4080.

The Real Use Case on a 16 GB Card: Headroom

At ~113M backbone parameters, MiniMind-O is wildly over-provisioned on the RTX 4080's 16 GB — the dense backbone weights are a few hundred MB and the whole stack stays within the ~4 GB envelope above. The card's remaining ~12 GB is the genuinely interesting per-GPU story:

Colocate a second model. Run a 7B LLM at Q4 (~5–6 GB) alongside MiniMind-O for a local toolchain, or load a Whisper-class ASR model — both fit comfortably in the spare VRAM.
Test the 0.3B MoE variant. Swap to minimind-3o-moe for slightly better quality; even at ~0.3B total parameters the 16 GB card barely notices.
Push context and batch. Raise --max_seq_len and batch size in the train_sft_omni.py flow, or run longer multimodal sessions, without approaching the 16 GB ceiling.

Troubleshooting

`transformers.pipeline` / `AutoModel` errors

MiniMind-O does not use the HuggingFace transformers.pipeline API for inference. Per the README, the key modules are implemented from scratch in native PyTorch (model/model_omni.py, model/model_minimind.py) and are only transformers-compatible at the tokenizer level. Don't try to load the model with pipeline() or AutoModel.from_pretrained — run the repo's eval_omni.py --load_from model --weight sft_omni instead. The --load_from minimind-3o form loads the HuggingFace-mirror weights through the same script.

`pip install` is slow or fails on the Tsinghua mirror

The README defaults to a China-region PyPI mirror. Outside China, drop the -i https://pypi.tuna.tsinghua.edu.cn/simple flag and let pip use its default index:

pip install -r requirements.txt

ModelScope is unfamiliar — can I use Hugging Face instead?

Yes. The same checkpoints are mirrored on Hugging Face under jingyaogong/minimind-3o-pytorch for the raw .pth files and jingyaogong/minimind-3o for the from_pretrained-style packaging. Substitute huggingface-cli download jingyaogong/minimind-3o-pytorch llm_768.pth --local-dir ./out (and analogous calls for the encoders) for the modelscope commands.

"What does 'omni' actually mean here?"

Three input modalities (text, speech, image) and two output modalities (text, streaming 24 kHz speech). Image output is not supported despite the "omni" branding — the official README frames the goal as a model that can listen, see, think, and speak (audio and image in, text and audio out). If you need image generation as an output modality, this is not the right model.

No widely-reported issues yet

The GitHub Issues tab surfaces no RTX 4080 / 16 GB failure reports as of this writing. Report problems via the submission form.