What You'll Build
A from-scratch, end-to-end omni model running locally on the RTX 4070: MiniMind-O accepts text, speech, and image inputs and produces text plus streaming 24 kHz speech output. The dense 0.1B variant ships as a pair of bfloat16 PyTorch checkpoints (the 137.7 MB llm_768.pth backbone plus the 236.2 MB sft_omni_768.pth omni checkpoint, alongside the bundled encoders/decoders) and is Apache-2.0 licensed, intended as an educational / research-grade reference implementation rather than a production-grade omni model. The author frames the project's goal as training, from the first line of code, a model that can "能听、能看、能思考、能说" — listen, see, think, and speak. With a derived ~4 GB envelope on a 12 GB card, this is one of the easiest entrypoints for multimodal experimentation in our catalogue — roughly two-thirds of the VRAM stays free for a second model, a longer context, or batch experimentation.
Hardware data: RTX 4070 (12GB VRAM) · 115.29M-parameter Dense variant, bfloat16 inference · See benchmark data
ℹ️ Three inputs, two outputs — no image generation. "Omni" here means text, speech, and image in and text plus streaming 24 kHz speech out. The model does not generate images. If you need image output, this is not the right model — see the catalogue's
imagevertical.
Note: As of this writing the backend has no measured benchmarks for this pair (
/check/minimind-o/rtx-4070returnsverdict: unknown). Themin_vram_gb: 4figure is a conservative envelope derived from the official weight sizes and architecture (see the Results section); revisit /check/minimind-o/rtx-4070 once a community benchmark lands.
Requirements
| Component | Minimum | Tested |
|---|---|---|
| GPU | 4GB VRAM CUDA GPU | RTX 4070 (12GB) — pair not yet benchmarked, see /check/ |
| RAM | 8GB | — |
| Storage | ~3 GB (code + all sub-model weights) | — |
| Software | Python 3.10, PyTorch with CUDA, Git | — |
The RTX 4070 is an Ada Lovelace card (AD104, compute capability sm_89). The default pip install torch already ships sm_89 CUDA kernels — no special wheel index, no cu128, and no attention-kernel workarounds are required. MiniMind-O's attention layer calls PyTorch's built-in F.scaled_dot_product_attention (gated on hasattr(torch.nn.functional, 'scaled_dot_product_attention') in model_minimind.py), so it does not depend on the external flash-attention package and there is no flash_attention_2 toggle to reason about on this GPU.
Installation
1. Clone the official repo
git clone --depth 1 https://github.com/jingyaogong/minimind-o
cd minimind-o
2. Install Python dependencies
The repo pins Python==3.10. Create a fresh environment, then install the project requirements (the README uses the Tsinghua mirror for faster downloads in China — drop the -i flag if you're outside that region):
pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple
3. Download the model weights
MiniMind-O is an assembly of sub-models — a 113M-class dense LLM trained from scratch plus pre-trained encoders/decoders. Pull them all with modelscope (the official downloader for this project), exactly as the README documents:
modelscope download --model gongjy/SenseVoiceSmall --local_dir ./model/SenseVoiceSmall
modelscope download --model gongjy/siglip2-base-p32-256-ve --local_dir ./model/siglip2-base-p32-256-ve
modelscope download --model gongjy/mimi --local_dir ./model/mimi
modelscope download --model gongjy/campplus --local_dir ./model/campplus
modelscope download --model gongjy/minimind-3o-pytorch llm_768.pth --local_dir ./out
SenseVoiceSmall handles speech input, SigLIP2 handles image input, Mimi is the streaming 24 kHz audio codec for speech output, CAM++ provides speaker embeddings, and llm_768.pth is the dense language backbone trained from scratch by the project author. The gongjy/* slugs above are ModelScope repositories — they are the canonical download path documented in the README. (Prefer Hugging Face? See the Troubleshooting section for the huggingface-cli equivalents.)
Running
Run the omni inference script bundled with the repo:
python eval_omni.py --load_from model --weight sft_omni
This loads the SFT-tuned omni checkpoint (sft_omni_768.pth) from ./out and starts an interactive session. The evaluation modality is selected with --mode: per the script's argparse help, -1=all, 0=text, 1=multi, 2=audio, 3=clone, 4=image, 5=mix (combinable, e.g. 2,5).
If you prefer the transformers-format weights instead of the raw .pth checkpoints, download the jingyaogong/minimind-3o model directory and point the same script at it. MiniMind-O is a custom MiniMindOmni architecture (see config.json), loaded via AutoModelForCausalLM.from_pretrained(..., trust_remote_code=True) inside eval_omni.py — not a generic transformers.pipeline("text-generation") call, which is not the supported entry point for this model:
git clone https://huggingface.co/jingyaogong/minimind-3o
python eval_omni.py --load_from minimind-3o
Results
- Speed: Not yet measured on an RTX 4070 by any cited source. The only hardware the official GitHub README names is an RTX 3090 (24GB), where the mini-dataset SFT training run takes about 2 hours — a training figure on a different-generation card (Ampere sm_86), not an RTX 4070 inference benchmark, so we do not transfer it as a speed number. Report a measured figure via /contribute and track live data at /check/.
- VRAM usage: Approximate envelope ~4 GB peak, derived from the official artifact sizes —
llm_768.pthis 137.7 MB andsft_omni_768.pthis 236.2 MB per the HF Files listing forjingyaogong/minimind-3o-pytorch. The architecture is 8 hidden layers × 768 hidden size × 32,768 max context in bfloat16 per theconfig.json, which keeps KV cache and decode buffers well under 1 GB for typical short conversations. The frozen encoder/decoder bundles (SenseVoice, SigLIP2, Mimi, CAM++) load alongside but at modest sizes — 4 GB is a conservative ceiling. This is a derived envelope, not a measured peak; see /check/minimind-o/rtx-4070 for live data once a benchmark lands. On a 12 GB RTX 4070 that leaves roughly 8 GB free. - Quality notes: This is a from-scratch 0.1B omni model (115.29M parameters for the Dense 768 configuration), deliberately small. Reported average character error rate on English text-to-audio is
0.0897 avg CERfor the Dense 768 configuration per the Talker hidden size ablation table in the GitHub README — useful as a reference for what a tiny, hobbyist-trained omni stack achieves, not for comparison against frontier omni models like Gemini or Qwen3-Omni. The repo also ships a 0.3B-class MoE variant (317.05M-A115.33Mtotal/active per the same ablation table) if you want to test a larger configuration; even at 317M total parameters the 12 GB card still has comfortable headroom.
For the full benchmark data, see /check/minimind-o/rtx-4070.
Spending the Headroom on a 12 GB Card
At ~113M backbone parameters, MiniMind-O fits inside roughly a third of the RTX 4070's 12 GB. The remaining ~8 GB is the genuinely interesting per-GPU story:
- Test the 0.3B MoE variant. Swap to the MoE configuration (
317.05M-A115.33Mtotal/active per the ablation table) for slightly better reported CER (0.0900); even at 317M total parameters the 12 GB card barely notices the difference. - Push context. The backbone is configured for a 32,768-token context window (
config.json); longer multimodal sessions stay well within the 12 GB ceiling. - Colocate a second model. Load a small ASR model (Whisper-class) or a quantized 7B LLM alongside MiniMind-O — the ~8 GB of spare VRAM accommodates a second lightweight model for a local toolchain.
Troubleshooting
transformers.pipeline / AutoModel errors
MiniMind-O does not load through a stock transformers.pipeline("text-generation") call. It is a custom MiniMindOmni architecture; the repo's eval_omni.py loads it with AutoModelForCausalLM.from_pretrained(args.load_from, trust_remote_code=True). Run python eval_omni.py --load_from model --weight sft_omni (raw .pth weights) or python eval_omni.py --load_from minimind-3o (transformers-format mirror) instead of hand-rolling a pipeline() call.
pip install is slow or fails on the Tsinghua mirror
The README defaults to a China-region PyPI mirror. Outside China, drop the -i https://pypi.tuna.tsinghua.edu.cn/simple flag and let pip use its default index:
pip install -r requirements.txt
ModelScope is unfamiliar — can I use Hugging Face instead?
Yes. The same checkpoints are mirrored on Hugging Face. The raw .pth files live under jingyaogong/minimind-3o-pytorch, and the from_pretrained-style packaging lives under jingyaogong/minimind-3o. Note that the gongjy/* slugs in the install commands are ModelScope usernames; on Hugging Face the same artifacts sit under jingyaogong/. Substitute:
huggingface-cli download jingyaogong/minimind-3o-pytorch llm_768.pth --local-dir ./out
(and analogous calls for the encoders) for the modelscope commands. You can browse the full file set from the HuggingFace Collection.
No widely-reported issues yet
The GitHub Issues tab surfaces no RTX 4070 / 12 GB failure reports as of this writing. Report problems via the submission form.