self-hosted/ai
§01·recipe · multimodal

MiniMind-O on RTX 5070: 0.1B Omni Model with Headroom to Spare

multimodalintermediate4GB+ VRAMJun 4, 2026
models
tools
prerequisites
  • NVIDIA RTX 5070 (12GB VRAM) or any CUDA GPU with at least 4GB free VRAM
  • Python 3.10
  • Git, pip, and ~3 GB free disk for code + weights

What You'll Build

A from-scratch, end-to-end omni model running locally on the RTX 5070: MiniMind-O accepts text, speech, and image inputs and produces text plus streaming 24 kHz speech output. The dense 0.1B variant ships as ~374 MB of bfloat16 PyTorch checkpoints (the 137.7 MB llm_768.pth backbone plus the 236.2 MB sft_omni_768.pth omni checkpoint, alongside the bundled encoders/decoders) and is Apache-2.0 licensed, intended as an educational / research-grade reference implementation rather than a production-grade omni model. With a derived ~4 GB envelope on a 12 GB card, this is one of the easiest entrypoints for multimodal experimentation in our catalogue — roughly two-thirds of the VRAM stays free for a second model, a longer context, or batch experimentation.

Hardware data: RTX 5070 (12GB VRAM) · 115.29M-parameter Dense variant, bfloat16 inference · See benchmark data

ℹ️ Three inputs, two outputs — no image generation. "Omni" here means text, speech, and image in and text plus streaming 24 kHz speech out. The model does not generate images. If you need image output, this is not the right model — see the catalogue's image vertical.

Note: As of this writing the backend has no measured benchmarks for this pair (/check/ returns verdict: unknown). The min_vram_gb: 4 figure is a conservative envelope derived from the official weight sizes and architecture (see the Results section); revisit /check/minimind-o/rtx-5070 once a community benchmark lands.

Requirements

ComponentMinimumTested
GPU4GB VRAM CUDA GPURTX 5070 (12GB) — pair not yet benchmarked, see /check/
RAM8GB
Storage~3 GB (code + all sub-model weights)
SoftwarePython 3.10, PyTorch with CUDA, Git

Installation

1. Clone the official repo

git clone --depth 1 https://github.com/jingyaogong/minimind-o
cd minimind-o

2. Install Python dependencies

The repo pins Python==3.10. Create a fresh environment, then install the project requirements (the README uses the Tsinghua mirror for faster downloads in China — drop the -i flag if you're outside that region):

pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple

The RTX 5070 is a Blackwell card (GB205, compute capability sm_120). A current pip install torch ships the cu128 wheels that include sm_120 kernels, so no special wheel selection is required beyond making sure the installed PyTorch is recent enough to target Blackwell. MiniMind-O's bundled attention layer calls PyTorch's built-in F.scaled_dot_product_attention (with a manual softmax fallback when SDPA is unavailable) — it does not depend on the external flash-attention package, so there is no flash_attention_2 sm_120 wheel gap to work around on this GPU. SDPA ships sm_120 kernels in the cu128 PyTorch build.

3. Download the model weights

MiniMind-O is an assembly of sub-models — a 113M-class dense LLM trained from scratch plus pre-trained encoders/decoders. Pull them all with modelscope (the official downloader for this project):

modelscope download --model gongjy/SenseVoiceSmall --local_dir ./model/SenseVoiceSmall
modelscope download --model gongjy/siglip2-base-p32-256-ve --local_dir ./model/siglip2-base-p32-256-ve
modelscope download --model gongjy/mimi --local_dir ./model/mimi
modelscope download --model gongjy/campplus --local_dir ./model/campplus
modelscope download --model gongjy/minimind-3o-pytorch llm_768.pth --local_dir ./out

SenseVoiceSmall handles speech input, SigLIP2 handles image input, Mimi is the streaming 24 kHz audio codec for speech output, CamPlus provides speaker embeddings, and llm_768.pth is the dense language backbone trained from scratch by the project author. The gongjy/* slugs above are ModelScope repositories — they are the canonical download path documented in the README.

Running

Run the omni inference script bundled with the repo:

python eval_omni.py --load_from model --weight sft_omni

This loads the SFT-tuned omni checkpoint (sft_omni_768.pth) from ./out and starts an interactive session. The evaluation modality is selected with --mode: per the script's argparse help, -1=all, 0=text, 1=multi, 2=audio, 3=clone, 4=image, 5=mix (combinable, e.g. 2,5).

If you prefer the transformers-format weights instead of the raw .pth checkpoints, download the jingyaogong/minimind-3o model directory and point the same script at it — the README documents this transformers path as eval_omni.py --load_from <dir>, not a generic pipeline("text-generation") call (the model is a custom MiniMindOmni architecture loaded via trust_remote_code, so a stock text-generation pipeline is not the supported entry point):

git clone https://huggingface.co/jingyaogong/minimind-3o
python eval_omni.py --load_from minimind-3o

Results

  • Speed: Not yet measured on an RTX 5070 by any cited source. The project's documented mini-training run fits on a single RTX 3090 (24GB) in about 2 hours per the official GitHub README — a training-cost data point on a different card, not an inference benchmark, and not transferable to RTX 5070 inference. We will not quote an inference number until a 5070-named measurement exists; contribute one via the submission form and track live data at /check/.
  • VRAM usage: Approximate envelope ~4 GB peak, derived from the official artifact sizes — llm_768.pth is 137.7 MB and sft_omni_768.pth is 236.2 MB per the HF Files listing for jingyaogong/minimind-3o-pytorch. The architecture is 8 hidden layers × 768 hidden size × 32,768 max context in bfloat16 per the config.json, which keeps KV cache and decode buffers well under 1 GB for typical short conversations. The encoder/decoder bundles (SenseVoice, SigLIP2, Mimi, CamPlus) load alongside but at modest sizes — 4 GB is a conservative ceiling. On a 12 GB RTX 5070 that leaves roughly 8 GB free, enough headroom to co-locate a second small model or run longer-context sessions without watching the meter.
  • Quality notes: This is a from-scratch 0.1B omni model (115.29M parameters for the Dense 768 configuration), deliberately small. Reported average character error rate on English text-to-audio is 0.0897 avg CER for the Dense 768 configuration per the Talker hidden size ablation table in the GitHub README — useful as a reference for what a tiny, hobbyist-trained omni stack achieves, not for comparison against frontier omni models like Gemini or Qwen3-Omni. The repo also ships a 0.3B-class MoE variant (minimind-3o-moe, 317.05M total / ~115.33M active per the same ablation table) if you want to test a larger configuration; even at 317M total parameters the 12 GB card still has comfortable headroom.

For the full benchmark data, see /check/minimind-o/rtx-5070.

Troubleshooting

pip install is slow or fails on the Tsinghua mirror

The README defaults to a China-region PyPI mirror. Outside China, drop the -i https://pypi.tuna.tsinghua.edu.cn/simple flag and let pip use its default index:

pip install -r requirements.txt

ModelScope is unfamiliar — can I use Hugging Face instead?

Yes. The same checkpoints are mirrored on Hugging Face. The raw .pth files live under jingyaogong/minimind-3o-pytorch, and the from_pretrained-style packaging lives under jingyaogong/minimind-3o. The encoders/decoders are mirrored under the jingyaogong/ HF namespace too. Note that the gongjy/* slugs in the install commands are ModelScope usernames; on Hugging Face the same artifacts sit under jingyaogong/. So substitute:

huggingface-cli download jingyaogong/minimind-3o-pytorch llm_768.pth --local-dir ./out
huggingface-cli download jingyaogong/SenseVoiceSmall --local-dir ./model/SenseVoiceSmall

(and analogous calls for the other three encoders) for the modelscope commands. You can browse the full file set from the HuggingFace Collection.

"What does 'omni' actually mean here?"

Three input modalities (text, speech, image) and two output modalities (text, streaming 24 kHz speech). Image output is not supported despite the "omni" branding — the official README frames the project's goal as training, from scratch, a model that can listen, see, think, and speak (能听、能看、能思考、能说), i.e. audio and image in, text and audio out. If you need image generation as an output modality, this is not the right model.

What should I do with the spare ~8 GB on this card?

The dense 0.1B variant is over-provisioned for a 12 GB card. Productive uses for the headroom: load the 0.3B-class MoE variant (minimind-3o-moe) for slightly better quality on longer answers; run a second small model in parallel (e.g. a small chat LLM for routing or tool-calling); or push the context window toward the 32,768-token maximum advertised in config.json for long multimodal sessions. None of these will stress a 12 GB envelope.

No widely-reported issues yet

The HF discussions tab for jingyaogong/minimind-3o is empty as of this writing. Report problems via the submission form.