self-hosted/ai
§01·recipe · multimodal

MiniMind-O on RX 7800 XT: 0.1B Omni Model on ROCm (BF16)

multimodalintermediate4GB+ VRAMJun 19, 2026

This intermediate recipe sets up MiniMind-O on the RX 7800 XT, needing about 4 GB of VRAM.

models
tools
  • Pytorch
prerequisites
  • AMD Radeon RX 7800 XT (16 GB VRAM, RDNA3 / Navi 32 / gfx1101) or equivalent ROCm-supported card
  • Linux (Ubuntu 24.04 / 22.04 or RHEL) with the AMD ROCm stack installed
  • Python 3.10
  • Git, pip, and ~4 GB free disk for code + all sub-model weights

What You'll Build

A from-scratch, end-to-end omni model running locally on the Radeon RX 7800 XT (RDNA3, Navi 32, gfx1101) through the ROCm stack: MiniMind-O accepts text, speech, and image inputs and produces text plus streaming 24 kHz speech output. The dense minimind-3o variant has ~113M trainable backbone parameters and is an educational / research-grade reference implementation — the smallest complete open Omni model — rather than a production-grade system. With a derived ~4 GB envelope on a 16 GB card, this is trivially within reach: the model is heavily over-provisioned here, and most of the VRAM stays free.

Hardware data: RX 7800 XT (16 GB VRAM) · ~113M backbone parameters, BF16 inference · ROCm · See benchmark data

⚠️ This is a ROCm recipe, not CUDA. The RX 7800 XT runs on AMD's ROCm/HIP stack — there is no cu124/cu128 wheel, no flash-attn install, and no FP8/FP4 path here. RDNA3 has no FP8/FP4 hardware (its WMMA units accept FP16, BF16, INT8, INT4 only), so an FP8 checkpoint would just upcast to BF16/FP16 with no memory saving — and at 16 GB you don't need it anyway. MiniMind-O is pure PyTorch (the model is implemented from scratch in model/model_omni.py and model/model_minimind.py), so its attention path is PyTorch SDPA / eager attention, not FlashAttention and not xformers. If a guide tells you to pip install flash-attn or pick a cu12x wheel for this card, it's written for the wrong vendor.

⚠️ AMD support is untested — verify on first run. No AMD/ROCm benchmark or success report exists for MiniMind-O yet. We scanned the repository and found no custom CUDA op, no .cu file, no setup.py building extensions, and no cpp_extension/load_inline usage — it's plain PyTorch (SDPA + standard ops) plus standard pip packages (no flash-attn, triton, or deepspeed in requirements.txt). On that basis it should inherit a clean ROCm run the same way any pure-PyTorch model does. But "should" is not "verified": treat this as PASS-untested and confirm it works on your card before relying on it. If you hit a wall, please contribute your result.

ℹ️ GitHub-hosted, with a HuggingFace weights mirror. The canonical project lives on GitHub under the Apache-2.0 license; the from_pretrained-style weights are mirrored at jingyaogong/minimind-3o. Load the model with the repo's own eval_omni.py script — not transformers.pipeline (see Troubleshooting).

Note: As of this writing the backend has no measured benchmarks for this pair (/check/minimind-o/rx-7800-xt returns verdict: unknown). The min_vram_gb: 4 figure is a conservative envelope derived from the official weight sizes and architecture (see the Results section); revisit /check/minimind-o/rx-7800-xt once a community benchmark lands.

Requirements

ComponentMinimumTested
GPU4 GB VRAM (ROCm-supported AMD card)RX 7800 XT (16 GB) — pair not yet benchmarked, see /check/
RAM8 GB
Storage~4 GB (code + all sub-model weights)
DriverAMD ROCm on Linux
SoftwarePython 3.10, PyTorch (ROCm build), Git

The RX 7800 XT (gfx1101) is an officially ROCm-supported GPU on Linux — the ROCm install-on-linux system-requirements matrix lists the RX 7800 XT under the gfx1101 LLVM target — so the install uses the stable ROCm PyTorch wheel, not a CUDA wheel, and no HSA_OVERRIDE_GFX_VERSION masquerade is needed. MiniMind-O does not bundle any compiled extension, so once PyTorch is installed against ROCm there is nothing else to build for your GPU.

Installation

1. Clone the official repo

git clone --depth 1 https://github.com/jingyaogong/minimind-o
cd minimind-o

2. Install PyTorch for ROCm

MiniMind-O's requirements.txt ships with the torch lines commented out (# torch==2.6, # torchaudio==2.6, # torchvision==0.21.0) precisely so you install the build that matches your hardware. On the RX 7800 XT, that means the ROCm wheel — install it first, before the rest of the requirements, so pip doesn't pull a default CUDA build:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.3

ℹ️ Verify the ROCm tag before you copy it. The rocmX.Y tag in the index URL moves over time (6.3 → 6.4 → 7.x). Read the current "ROCm" line in the live PyTorch "Get Started" selector and use whatever stable tag it shows for your installed ROCm version. AMD also publishes its own Radeon-recommended wheels at repo.radeon.com if you prefer the vendor build.

3. Install the remaining Python dependencies

The repo pins Python==3.10. With ROCm PyTorch already in place, install the rest of the project requirements (the README uses the Tsinghua mirror for faster downloads in China — drop the -i flag if you're outside that region):

pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple

None of these packages require custom CUDA compilation — there is no flash-attn, triton, or deepspeed in the list, so there's no NVIDIA-only build step to fail on AMD.

4. Download the model weights

MiniMind-O is an assembly of a 113M dense LLM backbone trained from scratch plus several frozen pre-trained encoders/decoders. Pull them all with modelscope (the official downloader for this project), exactly as the README documents:

modelscope download --model gongjy/SenseVoiceSmall --local_dir ./model/SenseVoiceSmall
modelscope download --model gongjy/siglip2-base-p32-256-ve --local_dir ./model/siglip2-base-p32-256-ve
modelscope download --model gongjy/mimi --local_dir ./model/mimi
modelscope download --model gongjy/campplus --local_dir ./model/campplus
modelscope download --model gongjy/minimind-3o-pytorch llm_768.pth --local_dir ./out

SenseVoice-Small handles speech input, SigLIP2 handles image input, Mimi is the streaming 24 kHz audio codec for speech output, CAM++ provides speaker embeddings, and llm_768.pth is the 113M-parameter language backbone trained from scratch by the project author.

Running

Run the omni inference script bundled with the repo:

python eval_omni.py --load_from model --weight sft_omni

This loads the SFT-tuned omni checkpoint and starts an interactive CLI session that accepts text, speech, or image inputs and replies in text + 24 kHz audio. If you prefer the from_pretrained-format weights from the HuggingFace mirror, download the jingyaogong/minimind-3o model directory and point the same script at it (this is the README's documented transformers-format path — not a generic pipeline("text-generation") call, because the model is a custom architecture):

git clone https://huggingface.co/jingyaogong/minimind-3o
python eval_omni.py --load_from minimind-3o

A browser WebUI is also available; after copying a transformers-format model folder into ./scripts/, run cd scripts && python web_demo_omni.py.

On the very first inference, confirm PyTorch actually sees the GPU: python -c "import torch; print(torch.cuda.is_available(), torch.cuda.get_device_name(0))" should print True and your card name (ROCm masquerades as the cuda device namespace under HIP). The model runs in BF16 — RDNA3 supports BF16 natively in its WMMA units, so this is the right precision floor; there is no FP8/FP4 path to consider on this card, and at 16 GB no quantization is needed at all.

Results

  • Speed: Not measured on an RX 7800 XT by any cited source. The only hardware the official GitHub README names is an RTX 3090 (24 GB, NVIDIA Ampere), where the mini-dataset SFT training run takes about 2 hours — a training figure on a different-vendor card, not an RX 7800 XT inference benchmark, so we do not transfer it as a speed number. The Speed figure is therefore omitted. If you've measured MiniMind-O inference latency or tokens/s on a 7800 XT, please contribute it so it lands on /check/minimind-o/rx-7800-xt.
  • VRAM usage: Approximate envelope ~4 GB peak, derived from the official artifact sizes. The HuggingFace mirror jingyaogong/minimind-3o ships a single pytorch_model.bin of ~226 MB for the 113M backbone; the raw .pth mirror jingyaogong/minimind-3o-pytorch carries the llm_768.pth / sft_omni_768.pth checkpoints. The frozen encoder/decoder bundles (SenseVoice-Small, SigLIP2, Mimi, CAM++) load alongside at modest sizes, and the backbone uses the 768-hidden-size MiniMind architecture per config.json — so KV cache and decode buffers stay well under 1 GB for short conversations. 4 GB is a conservative ceiling. This is a derived envelope, not a measured peak; see /check/minimind-o/rx-7800-xt for live data once a benchmark lands. On a 16 GB RX 7800 XT this leaves roughly 12 GB free.
  • Quality notes: This is a from-scratch ~113M-parameter omni model, deliberately small. The README frames it as the smallest complete open Omni implementation, useful as a reference for understanding the full Thinker–Talker pipeline end to end — not for comparison against frontier omni models like Qwen3-Omni or GPT-4o. The repo also ships a 0.3B MoE variant (minimind-3o-moe, ~0.3B total / ~0.1B active) if you want to test a slightly larger configuration; even then the 16 GB card has ample headroom.

For the full benchmark data, see /check/minimind-o/rx-7800-xt.

The Real Use Case on a 16 GB Card: Headroom

At ~113M backbone parameters, MiniMind-O is heavily over-provisioned on the RX 7800 XT's 16 GB — the dense backbone weights are a few hundred MB and the whole stack stays within the ~4 GB envelope above. The card's remaining ~12 GB is the genuinely interesting per-GPU story:

  • Colocate a second model. Run an 8B LLM (Q4 quant) via Ollama or llama.cpp (the most reliable LLM paths on ROCm) alongside MiniMind-O for a local toolchain, or load a Whisper-class ASR model — both fit comfortably in the spare VRAM.
  • Test the 0.3B MoE variant. Swap to minimind-3o-moe for slightly better quality; even at ~0.3B total parameters the 16 GB card barely notices.
  • Push context and batch. Raise --max_seq_len and batch size in the train_sft_omni.py flow, or run longer multimodal sessions, without approaching the 16 GB ceiling.

Troubleshooting

"Torch not compiled with CUDA enabled" / torch.cuda.is_available() is False

This means a CUDA build of PyTorch got installed instead of the ROCm build. Uninstall and reinstall against the ROCm wheel index:

pip uninstall torch torchvision torchaudio
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.3

Confirm the installed build is the ROCm one: python -c "import torch; print(torch.__version__)" should print a +rocm6.x-style suffix, and torch.cuda.is_available() returns True (ROCm masquerades as the cuda device namespace under HIP). Because the torch lines in requirements.txt are commented out, the most common trap is letting a later pip install pull a default CUDA wheel — install the ROCm torch first and pin it.

transformers.pipeline / AutoModel errors

MiniMind-O does not use the HuggingFace transformers.pipeline API for inference. Per the README, the key modules are implemented from scratch in native PyTorch (model/model_omni.py, model/model_minimind.py) and are only transformers-compatible at the tokenizer level. Don't try to load the model with pipeline() or AutoModel.from_pretrained — run the repo's eval_omni.py --load_from model --weight sft_omni instead. The --load_from minimind-3o form loads the HuggingFace-mirror weights through the same script.

Don't install flash-attn or xformers on this card

HF and GitHub guides written for NVIDIA frequently suggest pip install flash-attn or pip install xformers. On RDNA3 these are the wrong path: upstream FlashAttention's CK build targets CDNA/MI accelerators and commonly fails on gfx110x, and the ROCm xformers fork is limited. MiniMind-O is pure PyTorch and routes attention through PyTorch SDPA by default — that already maps to the best available kernel on ROCm. Don't add an attention package; there's nothing to install.

pip install is slow or fails on the Tsinghua mirror

The README defaults to a China-region PyPI mirror. Outside China, drop the -i https://pypi.tuna.tsinghua.edu.cn/simple flag and let pip use its default index:

pip install -r requirements.txt

ModelScope is unfamiliar — can I use Hugging Face instead?

Yes. The same checkpoints are mirrored on Hugging Face under jingyaogong/minimind-3o-pytorch for the raw .pth files and jingyaogong/minimind-3o for the from_pretrained-style packaging. Substitute huggingface-cli download jingyaogong/minimind-3o-pytorch llm_768.pth --local-dir ./out (and analogous calls for the encoders) for the modelscope commands.

"What does 'omni' actually mean here?"

Three input modalities (text, speech, image) and two output modalities (text, streaming 24 kHz speech). Image output is not supported despite the "omni" branding — the official README frames the goal as a model that can listen, see, think, and speak (audio and image in, text and audio out). If you need image generation as an output modality, this is not the right model.

No AMD reports yet

The GitHub Issues tab surfaces no RX 7800 XT / ROCm success or failure reports as of this writing — this recipe's AMD path is reasoned from the repo being pure PyTorch with no custom kernels, not from a confirmed run. If you get it working (or hit a snag), report it via the submission form so the next person has real data.

common questions
How much VRAM does MiniMind-O need?

About 4 GB — the minimum this recipe targets.

Which GPUs is MiniMind-O tested on?

RX 7800 XT (16 GB).

How hard is this setup?

Intermediate — follow the steps above.