How much VRAM does MiniMind-O need?

About 4 GB — the minimum this recipe targets.

How hard is this setup?

Intermediate — follow the steps above.

MiniMind-O on RTX 4060 Ti 16GB: 0.1B Omni Model with Headroom to Spare

What You'll Build

A from-scratch, end-to-end omni model running locally on the RTX 4060 Ti 16GB: MiniMind-O accepts text, speech, and image inputs and produces text plus streaming 24 kHz speech output. The dense 0.1B variant ships as ~1.7 GB of bfloat16 weights and is intended as an educational / research-grade reference implementation rather than a production-grade omni model. With a derived ~4 GB envelope on a 16 GB card, this is the easiest entrypoint for multimodal experimentation in our catalogue — most of the VRAM stays free for a second model, longer context, or batch experimentation.

Hardware data: RTX 4060 Ti 16GB · 113M trainable parameters, bfloat16 inference · See benchmark data

Note: As of this writing the backend has no measured benchmarks for this pair (/check/ returns verdict: unknown). The min_vram_gb: 4 figure is a conservative envelope derived from the official weight sizes and architecture (see the Results section); revisit /check/minimind-o/rtx-4060-ti-16gb once a community benchmark lands.

Requirements

Component	Minimum	Tested
GPU	4GB VRAM CUDA GPU	RTX 4060 Ti 16GB — pair not yet benchmarked, see /check/
RAM	8GB	—
Storage	~3 GB (code + all sub-model weights)	—
Software	Python 3.10, PyTorch with CUDA, Git	—

Installation

1. Clone the official repo

git clone --depth 1 https://github.com/jingyaogong/minimind-o
cd minimind-o

2. Install Python dependencies

The repo pins Python==3.10. Create a fresh environment, then install the project requirements (the README uses the Tsinghua mirror for faster downloads in China — drop the -i flag if you're outside that region):

pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple

3. Download the model weights

MiniMind-O is an assembly of four sub-models — a 113M dense LLM trained from scratch plus three pre-trained encoders/decoders. Pull them all with modelscope (the official downloader for this project):

modelscope download --model gongjy/SenseVoiceSmall --local_dir ./model/SenseVoiceSmall
modelscope download --model gongjy/siglip2-base-p32-256-ve --local_dir ./model/siglip2-base-p32-256-ve
modelscope download --model gongjy/mimi --local_dir ./model/mimi
modelscope download --model gongjy/campplus --local_dir ./model/campplus
modelscope download --model gongjy/minimind-3o-pytorch llm_768.pth --local_dir ./out

The five files together total ~1.7 GB. SenseVoiceSmall handles speech input, SigLIP2 handles image input, Mimi is the streaming 24 kHz audio codec for speech output, CamPlus provides speaker embeddings, and llm_768.pth is the 113M-parameter language backbone trained from scratch on a small dataset by the project author.

Running

Run the omni inference script bundled with the repo:

python eval_omni.py --load_from model --weight sft_omni

This loads the SFT-tuned omni checkpoint and starts an interactive session that accepts text, speech, or image inputs and replies in text + 24 kHz audio. If you prefer the transformers-format weights instead of the raw .pth checkpoints, download the jingyaogong/minimind-3o model directory and point the same script at it — the README documents this transformers path as eval_omni.py --load_from <dir>, not a generic pipeline("text-generation") call (the model is a custom MiniMindOmni architecture loaded via trust_remote_code, so a stock text-generation pipeline is not the supported entry point):

python eval_omni.py --load_from minimind-3o

Results

Speed: Not yet measured on a 4060 Ti 16GB by any cited source; the project's documented training run fits on a single RTX 3090 (24GB) in ~2 hours per the official GitHub README, which strongly suggests inference latency on a 4060 Ti 16GB is interactive but is not a substitute for an empirical inference benchmark. The 4060 Ti 16GB shares the Ada Lovelace (sm_89) architecture with the previously-published RTX 4060 sibling, so behaviour should be near-identical from a kernel-compatibility perspective. See /check/ for live data.
VRAM usage: Approximate envelope ~4 GB peak, derived from the official artifact sizes — llm_768.pth is 138 MB and sft_omni_768.pth is 236 MB per the HF Files listing for jingyaogong/minimind-3o-pytorch. The architecture is 8 hidden layers × 768 hidden size × 32K max context in bfloat16 per the config.json, which keeps KV cache and decode buffers well under 1 GB for typical short conversations. Encoder/decoder bundles (SenseVoice, SigLIP2, Mimi, CamPlus) load alongside but at modest sizes — 4 GB is a conservative ceiling. On a 16 GB 4060 Ti this leaves roughly 12 GB free, which is enough headroom to co-locate a second small model (e.g. a small LLM for tool-calling) or to run longer-context sessions without watching the meter.
Quality notes: This is a from-scratch 113M-parameter omni model, deliberately small. Reported average character error rate on English text-to-audio is 0.0897 avg CER for the Dense 768 configuration per the Talker Hidden Size Ablation table in the GitHub README — useful as a reference for what a tiny, hobbyist-trained omni stack achieves, not for comparison against frontier omni models like Gemini or Qwen3-Omni. The repo also ships a 0.3B MoE variant (minimind-3o-moe, 315M trainable, ~115M active) if you want to test a slightly larger configuration; even at 315M total parameters the 16 GB card still has comfortable headroom.

For the full benchmark data, see /check/minimind-o/rtx-4060-ti-16gb.

Troubleshooting

`pip install` is slow or fails on the Tsinghua mirror

The README defaults to a China-region PyPI mirror. Outside China, drop the -i https://pypi.tuna.tsinghua.edu.cn/simple flag and let pip use its default index:

pip install -r requirements.txt

ModelScope is unfamiliar — can I use Hugging Face instead?

Yes. The same checkpoints are mirrored on Hugging Face under jingyaogong/minimind-3o-pytorch for the raw .pth files and jingyaogong/minimind-3o for the from_pretrained-style packaging. Substitute huggingface-cli download jingyaogong/minimind-3o-pytorch llm_768.pth --local-dir ./out (and analogous calls for the encoders, which live under the gongjy/* mirrors) for the modelscope commands.

"What does 'omni' actually mean here?"

Three input modalities (text, speech, image) and two output modalities (text, streaming 24 kHz speech). Image output is not supported despite the "omni" branding — the official README frames the project's goal as training, from scratch, a model that can listen, see, think, and speak (能听、能看、能思考、能说), i.e. audio and image in, text and audio out. If you need image generation as an output modality, this is not the right model.

What should I do with the spare 12 GB on this card?

The dense 0.1B variant is wildly over-provisioned for a 16 GB card. Productive uses for the headroom: load the 0.3B MoE variant (minimind-3o-moe) for slightly better quality; run a second small model in parallel (e.g. a small chat LLM for routing or tool-calling); push the context window toward the 32K maximum advertised in config.json for long multimodal sessions. None of these will stress a 16 GB envelope.

No widely-reported issues yet

The HF discussions tab for jingyaogong/minimind-3o is empty as of this writing (model published ~12 days ago). Report problems via the submission form.