What You'll Build
A from-scratch, end-to-end omni model running locally on the RTX 4080: MiniMind-O accepts text, speech, and image inputs and produces text plus streaming 24 kHz speech output. The dense minimind-3o variant has ~113M trainable backbone parameters and is intended as an educational / research-grade reference implementation rather than a production-grade omni model. With a derived ~4 GB envelope on a 16 GB card, this is the easiest entrypoint for multimodal experimentation in our catalogue — most of the VRAM stays free for a second model, longer context, or batch experimentation.
Hardware data: RTX 4080 16GB · ~113M backbone parameters, bfloat16 inference · See benchmark data
Note: As of this writing the backend has no measured benchmarks for this pair (
/check/minimind-o/rtx-4080returnsverdict: unknown). Themin_vram_gb: 4figure is a conservative envelope derived from the official weight sizes and architecture (see the Results section); revisit /check/minimind-o/rtx-4080 once a community benchmark lands.
ℹ️ GitHub-hosted, with a HuggingFace weights mirror. The canonical project lives on GitHub; the
from_pretrained-style weights are mirrored atjingyaogong/minimind-3o. Load the model with the repo's owneval_omni.pyscript — nottransformers.pipeline(see Troubleshooting).
Requirements
| Component | Minimum | Tested |
|---|---|---|
| GPU | 4GB VRAM CUDA GPU | RTX 4080 16GB — pair not yet benchmarked, see /check/ |
| RAM | 8GB | — |
| Storage | ~3 GB (code + all sub-model weights) | — |
| Software | Python 3.10, PyTorch with CUDA, Git | — |
The RTX 4080 is an Ada Lovelace (sm_89) card. The default pip install of PyTorch already ships sm_89 CUDA kernels — no special wheel index, no cu128, and no attention-kernel workarounds are required. This recipe transfers directly from the previously-published RTX 4060 Ti 16GB recipe (same Ada sm_89 architecture, same 16 GB VRAM tier), so the install and the VRAM floor are identical; only the spare-headroom story changes with the card.
Installation
1. Clone the official repo
git clone --depth 1 https://github.com/jingyaogong/minimind-o
cd minimind-o
2. Install Python dependencies
The repo pins Python==3.10. Create a fresh environment, then install the project requirements (the README uses the Tsinghua mirror for faster downloads in China — drop the -i flag if you're outside that region):
pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple
3. Download the model weights
MiniMind-O is an assembly of a 113M dense LLM backbone trained from scratch plus several frozen pre-trained encoders/decoders. Pull them all with modelscope (the official downloader for this project), exactly as the README documents:
modelscope download --model gongjy/SenseVoiceSmall --local_dir ./model/SenseVoiceSmall
modelscope download --model gongjy/siglip2-base-p32-256-ve --local_dir ./model/siglip2-base-p32-256-ve
modelscope download --model gongjy/mimi --local_dir ./model/mimi
modelscope download --model gongjy/campplus --local_dir ./model/campplus
modelscope download --model gongjy/minimind-3o-pytorch llm_768.pth --local_dir ./out
SenseVoice-Small handles speech input, SigLIP2 handles image input, Mimi is the streaming 24 kHz audio codec for speech output, CAM++ provides speaker embeddings, and llm_768.pth is the 113M-parameter language backbone trained from scratch by the project author.
Running
Run the omni inference script bundled with the repo:
python eval_omni.py --load_from model --weight sft_omni
This loads the SFT-tuned omni checkpoint and starts an interactive CLI session that accepts text, speech, or image inputs and replies in text + 24 kHz audio. If you prefer the from_pretrained-format weights from the HuggingFace mirror, download the jingyaogong/minimind-3o model directory and point the same script at it (this is the README's documented transformers-format path — not a generic pipeline("text-generation") call, because the model is a custom architecture):
git clone https://huggingface.co/jingyaogong/minimind-3o
python eval_omni.py --load_from minimind-3o
A browser WebUI is also available; after copying a transformers-format model folder into ./scripts/, run cd scripts && python web_demo_omni.py.
Results
- Speed: Not measured on an RTX 4080 by any cited source. The only hardware the official GitHub README names is an RTX 3090 (24GB), where the mini-dataset SFT training run takes about 2 hours — a training figure on a different-generation card (Ampere sm_86), not an RTX 4080 inference benchmark, so we do not transfer it as a speed number. Report a measured figure via /contribute.
- VRAM usage: Approximate envelope ~4 GB peak, derived from the official artifact sizes. The HuggingFace mirror
jingyaogong/minimind-3oships a singlepytorch_model.binof ~216 MB for the 113M backbone; the raw.pthmirrorjingyaogong/minimind-3o-pytorchcarries thellm_768.pth/sft_omni_768.pthcheckpoints. The frozen encoder/decoder bundles (SenseVoice-Small, SigLIP2, Mimi, CAM++) load alongside at modest sizes, and the backbone uses the 768-hidden-size MiniMind architecture perconfig.json— so KV cache and decode buffers stay well under 1 GB for short conversations. 4 GB is a conservative ceiling. This is a derived envelope, not a measured peak; see /check/minimind-o/rtx-4080 for live data once a benchmark lands. On a 16 GB RTX 4080 this leaves roughly 12 GB free. - Quality notes: This is a from-scratch ~113M-parameter omni model, deliberately small. The README frames it as the smallest complete open Omni implementation, useful as a reference for understanding the full Thinker–Talker pipeline end to end — not for comparison against frontier omni models like Qwen3-Omni or GPT-4o. The repo also ships a 0.3B MoE variant (
minimind-3o-moe, ~0.3B total / ~0.1B active) if you want to test a slightly larger configuration; even then the 16 GB card has comfortable headroom.
For the full benchmark data, see /check/minimind-o/rtx-4080.
The Real Use Case on a 16 GB Card: Headroom
At ~113M backbone parameters, MiniMind-O is wildly over-provisioned on the RTX 4080's 16 GB — the dense backbone weights are a few hundred MB and the whole stack stays within the ~4 GB envelope above. The card's remaining ~12 GB is the genuinely interesting per-GPU story:
- Colocate a second model. Run a 7B LLM at Q4 (~5–6 GB) alongside MiniMind-O for a local toolchain, or load a Whisper-class ASR model — both fit comfortably in the spare VRAM.
- Test the 0.3B MoE variant. Swap to
minimind-3o-moefor slightly better quality; even at ~0.3B total parameters the 16 GB card barely notices. - Push context and batch. Raise
--max_seq_lenand batch size in thetrain_sft_omni.pyflow, or run longer multimodal sessions, without approaching the 16 GB ceiling.
Troubleshooting
transformers.pipeline / AutoModel errors
MiniMind-O does not use the HuggingFace transformers.pipeline API for inference. Per the README, the key modules are implemented from scratch in native PyTorch (model/model_omni.py, model/model_minimind.py) and are only transformers-compatible at the tokenizer level. Don't try to load the model with pipeline() or AutoModel.from_pretrained — run the repo's eval_omni.py --load_from model --weight sft_omni instead. The --load_from minimind-3o form loads the HuggingFace-mirror weights through the same script.
pip install is slow or fails on the Tsinghua mirror
The README defaults to a China-region PyPI mirror. Outside China, drop the -i https://pypi.tuna.tsinghua.edu.cn/simple flag and let pip use its default index:
pip install -r requirements.txt
ModelScope is unfamiliar — can I use Hugging Face instead?
Yes. The same checkpoints are mirrored on Hugging Face under jingyaogong/minimind-3o-pytorch for the raw .pth files and jingyaogong/minimind-3o for the from_pretrained-style packaging. Substitute huggingface-cli download jingyaogong/minimind-3o-pytorch llm_768.pth --local-dir ./out (and analogous calls for the encoders) for the modelscope commands.
"What does 'omni' actually mean here?"
Three input modalities (text, speech, image) and two output modalities (text, streaming 24 kHz speech). Image output is not supported despite the "omni" branding — the official README frames the goal as a model that can listen, see, think, and speak (audio and image in, text and audio out). If you need image generation as an output modality, this is not the right model.
No widely-reported issues yet
The GitHub Issues tab surfaces no RTX 4080 / 16 GB failure reports as of this writing. Report problems via the submission form.