self-hosted/ai
§01·recipe · specialized

KiMoDo on RX 7800 XT: Text-to-3D-Motion Generation on ROCm

specializedintermediate3GB+ VRAMJun 19, 2026

This intermediate recipe sets up KiMoDo on the RX 7800 XT, needing about 3 GB of VRAM.

models
tools
prerequisites
  • AMD Radeon RX 7800 XT (16 GB VRAM, RDNA3 / Navi 32 / gfx1101) or equivalent ROCm-supported card
  • Linux (Ubuntu 24.04 / 22.04 or RHEL) with the AMD ROCm stack installed — this is a ROCm recipe, not CUDA
  • Python 3.10 (per official installation docs)
  • PyTorch 2.0+ built for ROCm (install from the ROCm wheel index, NOT a CUDA cuXXX wheel — see below)
  • HuggingFace account with approved access to meta-llama/Meta-Llama-3-8B-Instruct (gated)

What You'll Build

A local text-to-3D-motion pipeline using NVIDIA's KiMoDo (Kinematic Motion Diffusion) on a 16 GB Radeon RX 7800 XT (RDNA3, Navi 32, gfx1101) through the ROCm stack. You type a prompt like "a person walks forward" and get back a 3D skeletal animation — joint rotations and root translations — saved as .npz (and optionally .bvh for direct import into Blender/Maya/MotionBuilder). KiMoDo is a small 282M-parameter (0.3B) diffusion model that is pure PyTorch — there is no custom CUDA kernel, no compiled rasterizer, and no FlashAttention dependency anywhere in its generation or rendering path, which is exactly why it runs cleanly on AMD ROCm. This recipe covers both the official CLI (kimodo_gen / kimodo_demo) and the community ComfyUI plugin.

Hardware data: RX 7800 XT (16 GB VRAM) · KiMoDo (282M text-to-3D-motion diffusion + Llama-3-8B text encoder) · BF16 on ROCm · See benchmark data

⚠️ This is a ROCm recipe, not CUDA. The RX 7800 XT runs on AMD's ROCm/HIP stack — there is no cu124/cu128 wheel, no FlashAttention-2/3, no FP8/FP4 path, and nothing to compile. KiMoDo is plain PyTorch, so the only stack change versus the NVIDIA build is the PyTorch wheel: install the ROCm build instead of a CUDA one. RDNA3 has no FP8/FP4 hardware (its WMMA units accept FP16, BF16, INT8, INT4 only), so this model runs in its native BF16/FP16 precision. If a guide tells you to pick a cu12x wheel, install flash-attn, or use an FP8 checkpoint for this card, it's written for the wrong vendor.

ℹ️ Not a roleplay / language model. Despite the -RP in the checkpoint name and the nvidia org, KiMoDo is a motion-generation model: text in, a 3D skeleton animation out. "SOMA" / "RP" here refer to the skeleton-and-retarget variant family, not chat or roleplay. The model is in our specialized vertical; its only language component is the frozen Llama-3-8B used (via LLM2Vec) as a text encoder — it never generates text.

Note: as of this writing the backend has no measured benchmarks for this pair (/check/kimodo/rx-7800-xt returns unknown). The VRAM figures below come from the canonical Kimodo GitHub README and a maintainer comment on Issue #27; treat them as the published baseline and report your own numbers via /contribute.

Requirements

ComponentMinimumTested
GPUROCm-supported AMD card with ≥3 GB free VRAM (with CPU text encoder)RX 7800 XT (16 GB) — pair not yet benchmarked, see /check/
VRAM (with TEXT_ENCODER_DEVICE=cpu)<3 GB (canonical README; maintainer-confirmed in Issue #27)
RAM16 GB+ (Llama-3-8B sits on the CPU when offloaded; budget ~16 GB for it plus headroom)
Storage~16 GB (Llama-3-8B weights ~15 GB + KiMoDo checkpoint ~1.13 GB)
DriverAMD ROCm on Linux (Ubuntu 24.04 / 22.04 or RHEL)
SoftwarePython 3.10, PyTorch 2.0+ built for ROCm (installation docs)

VRAM note — use the CPU text-encoder offload on a 16 GB card. The KiMoDo diffusion model itself is tiny (282M params per the HF model card; the model.safetensors checkpoint is 1.13 GB on disk per the HF file tree). But the default all-on-GPU footprint is about 17 GB "primarily due to the text embedding model" (canonical README) — which does not fit the RX 7800 XT's 16 GB. The fix is the model's own offload flag: moving the Llama-3-8B text encoder to system RAM with TEXT_ENCODER_DEVICE=cpu drops GPU usage to under 3 GB (canonical README), which fits the 16 GB card comfortably at a small speed cost. This recipe therefore leads with that flag — it is the path that fits — and the pair's min_vram_gb is ~3 GB.

Installation

Installation steps below come from the canonical NVIDIA sources — the official Kimodo installation docs, the Quick Start docs, and the nv-tlabs/kimodo README — with one substitution for AMD: the PyTorch wheel is the ROCm build, not a CUDA one (KiMoDo's docs install a "GPU-capable version of PyTorch" without mandating CUDA, so this is a drop-in change). The ComfyUI section at the end uses the community jtydhr88/ComfyUI-Kimodo plugin and is clearly labelled as such.

1. Set up a Python environment

The installation docs specify Python 3.10:

conda create -n kimodo python=3.10
conda activate kimodo

2. Install PyTorch for ROCm — first, before Kimodo

The installation docs recommend installing "the best version of PyTorch for you before installing Kimodo" (anything over PyTorch 2.0, GPU-capable). On the RX 7800 XT that means the ROCm wheel — the RX 7800 XT (gfx1101) is an officially ROCm-supported GPU, so it uses the stable ROCm PyTorch wheel:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.3

ℹ️ Verify the ROCm tag before you copy it. The rocmX.Y tag in that index URL moves over time (6.2 → 6.3 → 6.4 → 7.x). Read the current "ROCm" line in the live PyTorch "Get Started" selector and use whatever stable ROCm version it shows for your installed ROCm stack. Do not substitute a cu124/cu128 CUDA wheel here — a CUDA build will not see the AMD GPU. After install, confirm: python -c "import torch; print(torch.__version__)" should print a +rocm-style suffix, and torch.cuda.is_available() returns True (under HIP, ROCm masquerades as the cuda device namespace, so KiMoDo's cuda-targeted code runs unchanged). The RX 7800 XT is officially ROCm-supported as gfx1101 (ROCm system-requirements matrix), so you do not need to set HSA_OVERRIDE_GFX_VERSION — that legacy gfx-masquerade is only for cards ROCm doesn't build kernels for natively.

3. Install Kimodo

Two options, per the installation docs:

# Minimal install (CLI generation only)
pip install git+https://github.com/nv-tlabs/kimodo.git

Or, with the interactive demo web UI:

pip install "kimodo[all] @ git+https://github.com/nv-tlabs/kimodo.git"

No custom CUDA extension is built by either command — KiMoDo is pure PyTorch, so the pip install completes without a compile step. This is the key reason the model is unproblematic on ROCm: there is no custom rasterizer, no FlashAttention build, and nothing that would need a gfx1101 port.

4. Authenticate with HuggingFace and request Llama-3 access

KiMoDo's text encoder (LLM2Vec) relies on the gated meta-llama/Meta-Llama-3-8B-Instruct model — the installation docs instruct you to request access to it, create a read token, and authenticate. Request access on its HF page, wait for approval, then authenticate locally:

hf auth login
# or place a token at ~/.cache/huggingface/token

KiMoDo's own weights (the 282M-param diffusion model) live at nvidia/Kimodo-SOMA-RP-v1.1 and are not gated (gated: false per the HF model card); they download automatically on first use.

Running

Generate a motion from a text prompt (RX 7800 XT path)

This recipe leads with TEXT_ENCODER_DEVICE=cpu so the Llama-3 text encoder runs on system RAM and the diffusion model has the card to itself in under 3 GB. This is the path that fits a 16 GB RX 7800 XT — the all-on-GPU ~17 GB footprint exceeds 16 GB, so keep the flag set:

TEXT_ENCODER_DEVICE=cpu kimodo_gen "a person walks forward" \
    --model Kimodo-SOMA-RP-v1.1 \
    --duration 5.0 \
    --output output

Argument reference (from the Quick Start and CLI docs):

  • --model — checkpoint name. This recipe pins Kimodo-SOMA-RP-v1.1. Other published tiers are Kimodo-SOMA-RP-v1 and Kimodo-G1-RP-v1 (humanoid-robot retarget).
  • --duration — motion length in seconds. The README documents an adjustable duration.
  • --output — output stem name; the motion file is written as output.npz.

Use the env var name TEXT_ENCODER_DEVICE exactly as written. The reporter in Issue #27 hit an out-of-memory error because the text encoder was loading onto the GPU even when CPU execution was requested; the maintainer added the TEXT_ENCODER_DEVICE variable that independently controls the text-encoder device in the merged PR #32 ("Fixes to multi-prompt handling and add support for TEXT_ENCODER_DEVICE"). See Troubleshooting.

Attention runs through PyTorch SDPA — nothing to install

KiMoDo's diffusion transformer uses standard PyTorch attention. On ROCm, that resolves to PyTorch's scaled-dot-product attention (SDPA), which is the correct and only attention path to assume on RDNA3 — do not install flash-attn or xformers for this card. The model is small enough that attention is never the bottleneck.

Also export BVH for Blender/Maya/MotionBuilder

KiMoDo's CLI reference documents BVH as an export format for the SOMA skeleton. Add --bvh to also write a standard BVH file alongside the NPZ — the format you want for downstream DCC tools:

TEXT_ENCODER_DEVICE=cpu kimodo_gen "a person walks forward" \
    --model Kimodo-SOMA-RP-v1.1 \
    --duration 5.0 \
    --output walk_forward \
    --bvh

Or launch the interactive web demo

If you installed with kimodo[all], the Quick Start docs document a web UI. Keep the CPU text-encoder flag set so the demo fits the 16 GB card:

TEXT_ENCODER_DEVICE=cpu kimodo_demo

Then open the local demo URL it prints. You can load and visualize a motion you generated earlier from its .npz path; see the CLI docs for the demo's load/save workflow.

Optional: run inside ComfyUI

A community ComfyUI plugin, jtydhr88/ComfyUI-Kimodo (not an NVIDIA-official wrapper), exposes KiMoDo as nodes:

cd ComfyUI/custom_nodes
git clone https://github.com/jtydhr88/ComfyUI-Kimodo.git
cd ComfyUI-Kimodo
pip install -r requirements.txt

The plugin wraps the same upstream kimodo package, so the same Llama-3 text-encoder behavior applies. On a 16 GB card you'll want the CPU text-encoder offload here too: set the env var in the shell where you start ComfyUI, e.g. TEXT_ENCODER_DEVICE=cpu python main.py. (Make sure ComfyUI itself is running on a ROCm PyTorch build — same wheel as step 2.)

Results

  • Output (NPZ): the default output is an .npz 3D skeletal animation containing posed joints, rotation matrices, foot contacts, and trajectory information (README).
  • Output (BVH, SOMA models only): add --bvh to write a standard BVH alongside the NPZ (CLI reference). This is the format for Blender/Maya/MotionBuilder.
  • Output (G1 / SMPL-X): the G1 humanoid-robot checkpoint instead supports MuJoCo qpos CSV and AMASS NPZ output (README).
  • Precision: native BF16/FP16 — the only sensible path on RDNA3. There is no FP8/FP4 hardware on this card and no reason to quantize a 282M model that already fits in under 3 GB of VRAM.
  • VRAM (with TEXT_ENCODER_DEVICE=cpu): <3 GB — what this recipe leads with, and the path that fits the 16 GB RX 7800 XT. The README documents this CPU-offload path, and a Kimodo maintainer (davrempe, a repo collaborator) confirmed the same ~3 GB figure on a 16 GB card in Issue #27: running TEXT_ENCODER_DEVICE=cpu kimodo_gen uses only about 3 GB of VRAM.
  • VRAM (default, all-on-GPU): ~17 GB (README) — listed for reference only; this exceeds the RX 7800 XT's 16 GB, which is why the CPU text-encoder offload above is the path to use on this card.
  • Speed: not quoted. No source reports KiMoDo generation throughput on an RX 7800 XT by name, and the backend has no benchmark for this pair yet (/check/ returns unknown). The 7800 XT's ~624 GB/s memory bandwidth should make it comfortably fast for a 282M diffusion model — but quoting a number without a 7800-XT-named measurement would be a guess. Once a community benchmark lands it will appear at /check/kimodo/rx-7800-xt; please /contribute yours.
  • Model size: 282M parameters (0.3B) for the diffusion model itself (HF model card; 1.13 GB checkpoint on disk per the HF tree); the Llama-3-8B text encoder (~15 GB on disk) dwarfs it.
  • License: Apache-2.0 for the codebase (repo); the checkpoint at nvidia/Kimodo-SOMA-RP-v1.1 is under the NVIDIA Open Model License (a permissive, commercial-use-permitted license).

For the full benchmark data, see /check/kimodo/rx-7800-xt.

Troubleshooting

"Torch not compiled with CUDA enabled" or the GPU isn't detected

This means a CUDA build of PyTorch got installed instead of the ROCm build (KiMoDo's cuda-targeted code finds no usable device). Uninstall and reinstall against the ROCm wheel index:

pip uninstall torch torchvision torchaudio
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.3

Confirm the installed build is the ROCm one: python -c "import torch; print(torch.__version__)" should print a +rocm-style suffix, and torch.cuda.is_available() returns True (ROCm masquerades as the cuda device namespace under HIP). Use the current stable ROCm tag from the PyTorch selector — the rocm6.3 above is illustrative and moves over time.

Out of memory at the text-encoder load step

The default pipeline loads the Llama-3 text encoder onto the GPU (~17 GB all-on-GPU), which exceeds the RX 7800 XT's 16 GB and will OOM. Set TEXT_ENCODER_DEVICE=cpu to move the text encoder to system RAM and drop GPU usage to under 3 GB — this is the required path on a 16 GB card. The maintainer davrempe (a repo collaborator) added this variable in the merged PR #32 and confirmed the ~3 GB figure in Issue #27. If you installed before PR #32 landed, reinstall from git+https://github.com/nv-tlabs/kimodo.git.

"401 / gated model" or "Access to model Meta-Llama-3-8B-Instruct is restricted"

KiMoDo's text encoder pulls the gated meta-llama/Meta-Llama-3-8B-Instruct. You need to (a) request access on the meta-llama/Meta-Llama-3-8B-Instruct HF page and wait for approval, then (b) run hf auth login or drop a token at ~/.cache/huggingface/token (installation docs). KiMoDo's own checkpoint is not gated — only the Llama-3 dependency is.

--bvh flag is ignored / output has no .bvh file

The CLI reference documents BVH export under the SOMA skeleton. If you ran with --model Kimodo-G1-RP-v1 (the humanoid-robot retarget), the SOMA path is skipped — pass --model Kimodo-SOMA-RP-v1.1 (or -v1) to get BVH. The G1 / SMPL-X checkpoint instead emits MuJoCo qpos CSV and AMASS NPZ.

Slow generation with the CPU text encoder

Expected, and a documented trade-off — the README describes the CPU-offload path as slightly slower in exchange for the large VRAM saving. Only the one-time text-encoding step runs on the CPU; the diffusion sampling stays on the GPU. On a 16 GB RX 7800 XT this offload is required (the all-on-GPU ~17 GB path doesn't fit), so the small text-encoding cost is the price of running the model on this card — and it is a one-time cost per prompt, not per diffusion step.

Don't install flash-attn or xformers

Guides written for NVIDIA frequently suggest a FlashAttention or xformers install. On RDNA3 these are the wrong path: upstream FlashAttention is not built for consumer gfx1101, and KiMoDo already routes attention through PyTorch SDPA, which works on ROCm out of the box. KiMoDo needs neither — it has no FlashAttention dependency at all.

For other issues, file a report via the submission form.

common questions
How much VRAM does KiMoDo need?

About 3 GB — the minimum this recipe targets.

Which GPUs is KiMoDo tested on?

RX 7800 XT (16 GB).

How hard is this setup?

Intermediate — follow the steps above.