self-hosted/ai
§01·recipe · specialized

KiMoDo on RTX 5080: Text-to-3D-Motion Generation Guide

specializedintermediate3GB+ VRAMMay 29, 2026
models
tools
prerequisites
  • NVIDIA RTX 5080 (16 GB VRAM) or equivalent — see VRAM note below
  • Python 3.10 (per official installation docs)
  • PyTorch 2.0+ built for CUDA 12.8 (cu128 — ships Blackwell sm_120 kernels)
  • HuggingFace account with approved access to meta-llama/Meta-Llama-3-8B-Instruct (gated)

What You'll Build

A local text-to-3D-motion pipeline using NVIDIA's KiMoDo (Kinematic Motion Diffusion) on an RTX 5080. You'll type a prompt like "A person walks forward." and get back a 3D skeleton animation — joint rotations and root translations at 30 fps, up to 10 s long — saved as .npz (and optionally .bvh for direct import into Blender/Maya/MotionBuilder). This recipe covers both the official CLI (kimodo_gen / kimodo_demo) and the community ComfyUI plugin.

Hardware data: RTX 5080 (16 GB VRAM) · KiMoDo (282M-param diffusion model + LLM2Vec/Llama-3-8B text encoder) · See benchmark data

⚠️ Known issue (RTX 5080-specific): The default pipeline loads the Llama-3-8B text encoder on the GPU and needs ~17 GB — more than the 5080's 16 GB. A 5080 owner hit exactly this on the canonical repo (Issue #27); a Kimodo maintainer confirmed the fix below. Set TEXT_ENCODER_DEVICE=cpu and the pipeline runs in ~3 GB.

Note: As of this writing the backend has no measured benchmarks for this pair (/check/ returns unknown). The VRAM figures below come from the official Kimodo GitHub README and a maintainer comment on Issue #27; treat them as the published baseline and report your own numbers via /contribute.

Requirements

ComponentMinimumTested
GPUNVIDIA, CUDA-capable, PyTorch 2.0+ — see VRAM rowRTX 5080 (16 GB) — pair not yet benchmarked, see /check/
VRAM (default, all on GPU)~17 GB (official README) — does not fit a 16 GB 5080 out of the box
VRAM (with TEXT_ENCODER_DEVICE=cpu)<3 GB (official README; maintainer-confirmed on a 5080 in Issue #27) — comfortably fits a 5080
RAM16 GB (Llama-3-8B sits on the CPU when offloaded; budget ~16 GB for it plus headroom)
Storage~16 GB (Llama-3-8B weights ~15 GB + KiMoDo checkpoint ~1.13 GB)
SoftwarePython 3.10, PyTorch 2.0+ (installation docs)

VRAM caveat — this is the load-bearing detail for a 16 GB card. The KiMoDo diffusion model itself is small (282 M params, HF model card; checkpoint ~1.13 GB on disk), but the default pipeline loads a Llama-3-8B-based text encoder (LLM2Vec) on the same GPU. The README states it plainly: "Kimodo requires ~17GB of VRAM to generate locally entirely on GPU, primarily due to the text embedding model." (official README). That ~17 GB exceeds the 5080's 16 GB. The official fix is TEXT_ENCODER_DEVICE=cpu, which moves the text encoder to system RAM and drops GPU VRAM to <3 GB — "This is slightly slower but reduces VRAM usage to <3 GB." (official README). This is the path documented below.

Installation

Installation steps below come from the canonical NVIDIA sources only — the official Kimodo installation docs, the Quick Start docs, and the nv-tlabs/kimodo README. No third-party walkthrough is required because the install path is the upstream-supported one. The ComfyUI section at the end uses the community jtydhr88/ComfyUI-Kimodo plugin and is clearly labelled as such.

1. Set up a Python environment

Per the official installation docs:

conda create -n kimodo python=3.10
conda activate kimodo

2. Install PyTorch first, matched to your CUDA

Kimodo's docs say to install a compatible PyTorch manually before Kimodo, optimized for your CUDA version — "Anything over PyTorch 2.0 is sufficient." (installation docs). The RTX 5080 is a Blackwell (sm_120) card, so install a wheel that ships sm_120 kernels — that means the CUDA 12.8 (cu128) build or newer. Pick the matching index URL at pytorch.org/get-started/locally:

pip install torch --index-url https://download.pytorch.org/whl/cu128

The stable cu128 (and newer) PyTorch wheels include Blackwell sm_120 kernels. Older cu121/cu124 wheels predate sm_120 and will fall back to slow JIT or fail outright on a 5080 — do not use them on this card.

3. Install Kimodo

Two options from the official installation docs:

# Minimal install (CLI generation only)
pip install git+https://github.com/nv-tlabs/kimodo.git

Or, with the interactive demo web UI:

pip install "kimodo[all] @ git+https://github.com/nv-tlabs/kimodo.git"

4. Authenticate with HuggingFace and request Llama-3 access

KiMoDo's text encoder relies on the gated meta-llama/Meta-Llama-3-8B-Instruct model — you must request access on its HF page and then authenticate locally, per the installation docs:

hf auth login
# or place a token at ~/.cache/huggingface/token

KiMoDo's own weights (the 282M-param diffusion model) live at nvidia/Kimodo-SOMA-RP-v1.1 and are not gated; they download automatically on first use.

Running

Generate a motion from a text prompt (RTX 5080 path)

This is the load-bearing command for a 16 GB card. Set TEXT_ENCODER_DEVICE=cpu so the Llama-3 text encoder runs on CPU and the diffusion model has the GPU to itself (official README):

TEXT_ENCODER_DEVICE=cpu kimodo_gen "A person walks forward." \
    --model Kimodo-SOMA-RP-v1.1 \
    --duration 5.0 \
    --output output

Arguments are from the Quick Start docs and the CLI docs:

  • --model — checkpoint name. This recipe pins Kimodo-SOMA-RP-v1.1. Other published tiers are Kimodo-SOMA-RP-v1 and Kimodo-G1-RP-v1 (humanoid robot retarget).
  • --duration — motion length in seconds. Cap is 10 s / 300 frames at 30 fps (HF model card).
  • --output — output stem name. The motion file is written as output.npz.

Use the canonical env var name TEXT_ENCODER_DEVICEnot KIMODO_TEXT_ENCODER_DEVICE. The 5080 owner in Issue #27 hit a CUDA OOM precisely because the KIMODO_-prefixed name did not control the encoder device; the maintainer's merged fix (PR #32) wires up the unprefixed TEXT_ENCODER_DEVICE. See Troubleshooting.

Also export BVH for Blender/Maya/MotionBuilder

The SOMA-family checkpoints support direct BVH export via the --bvh flag — "also export BVH (SOMA models only)" (CLI docs) — the most useful format for downstream DCC tools:

TEXT_ENCODER_DEVICE=cpu kimodo_gen "A person walks forward." \
    --model Kimodo-SOMA-RP-v1.1 \
    --duration 5.0 \
    --output walk_forward \
    --bvh

Or launch the interactive web demo

If you installed with kimodo[all], the Quick Start docs document a web UI:

TEXT_ENCODER_DEVICE=cpu kimodo_demo

Then open http://localhost:7860. To visualize a motion you generated earlier, the CLI docs say to go under "Load/Save" > "Motion", type the path of the generated output .npz file, then click "Load Motion".

Optional: run inside ComfyUI

A community ComfyUI plugin, jtydhr88/ComfyUI-Kimodo (Apache-2.0, not an NVIDIA-official wrapper), exposes KiMoDo as nodes and adds FBX (Mixamo-rigged) export on top of the upstream NPZ/BVH:

cd ComfyUI/custom_nodes
git clone https://github.com/jtydhr88/ComfyUI-Kimodo.git
cd ComfyUI-Kimodo
pip install -r requirements.txt

The plugin's README documents NPZ, BVH (SOMA only), and FBX-via-Mixamo retarget as export options. It does not document the TEXT_ENCODER_DEVICE=cpu workaround — if you launch ComfyUI on a 16 GB 5080 without it you will hit OOM at the text-encoder load step (the same failure as Issue #27). Set the env var in the shell where you start ComfyUI, e.g. TEXT_ENCODER_DEVICE=cpu python main.py.

Results

  • Output (NPZ): the Kimodo NPZ contains root positions ([T, 3]), global and local joint rotation matrices ([T, J, 3, 3]), foot contacts ([T, 4]), and global root heading ([T, 2]), at 30 fps, max 10 s / 300 frames (official README; HF model card).
  • Output (BVH, SOMA models only): add --bvh to write a standard BVH alongside the NPZ (CLI docs). This is the format you want for Blender/Maya/MotionBuilder — Kimodo does not emit MP4 video; visualization happens in the demo UI.
  • Output (FBX): only available via the community ComfyUI plugin's Mixamo-retarget node (ComfyUI-Kimodo README).
  • VRAM (default, all-on-GPU): ~17 GB (official README) — exceeds the 5080's 16 GB.
  • VRAM (with TEXT_ENCODER_DEVICE=cpu): <3 GB (official README) — what this recipe uses. A Kimodo maintainer confirmed this number specifically on an RTX 5080 in Issue #27: "Running TEXT_ENCODER_DEVICE=cpu kimodo_gen uses only ~3 GB VRAM now."
  • Speed: not quoted. No source reports KiMoDo generation throughput on an RTX 5080 by name, and the model card's tested-hardware list names RTX 3090 / 4090 / 5090 but not the 5080 (HF model card). The 5080 has roughly 2× the memory bandwidth of the 16 GB Blackwell tier below it (~960 GB/s vs ~448 GB/s), so it should be faster than smaller Blackwell cards — but quoting a number without a 5080-named measurement would be a guess. Once a community benchmark lands it will appear at /check/kimodo/rtx-5080; please /contribute yours.
  • Model size: 282M parameters for the diffusion model itself (HF model card; ~1.13 GB checkpoint on disk); the Llama-3-8B text encoder (~15 GB on disk) dwarfs it.
  • License: Apache-2.0 for the codebase (repo); the checkpoint at nvidia/Kimodo-SOMA-RP-v1.1 is under the NVIDIA Open Model License and the model card states it "is ready for commercial use." — no non-commercial restriction.

For the full benchmark data, see /check/kimodo/rtx-5080.

Troubleshooting

"CUDA out of memory" even though you set CPU mode

This is the default failure on a 16 GB 5080, and it was reported directly on the canonical repo by a 5080 owner in Issue #27: the Llama-3 text encoder loaded onto the GPU (~14.7 GiB) and left no room for the diffusion model. Two things to check:

  1. Use the unprefixed variable. Set TEXT_ENCODER_DEVICE=cpu, not KIMODO_TEXT_ENCODER_DEVICE=cpu. The prefixed name does not control the encoder device; the maintainer's merged fix (PR #32) wires up TEXT_ENCODER_DEVICE. The maintainer (davrempe, a repo collaborator) confirmed on that issue: "Running TEXT_ENCODER_DEVICE=cpu kimodo_gen uses only ~3 GB VRAM now." (comment).
  2. Make sure you're on a recent Kimodo. PR #32 is merged; reinstall from git+https://github.com/nv-tlabs/kimodo.git if you installed before it landed.

"401 / gated model" or "Access to model Meta-Llama-3-8B-Instruct is restricted"

You need to (a) request access on the meta-llama/Meta-Llama-3-8B-Instruct HF page and wait for approval, then (b) run hf auth login or drop a token at ~/.cache/huggingface/token (installation docs).

--bvh flag is ignored / output has no .bvh file

BVH export is SOMA-only (CLI docs). If you ran with --model Kimodo-G1-RP-v1 (the humanoid-robot retarget), the SOMA skeleton path is skipped — pass --model Kimodo-SOMA-RP-v1.1 (or -v1) to get BVH. G1 / SMPL-X models instead support MuJoCo qpos CSV and AMASS NPZ output (CLI docs).

Slow generation on the CPU text encoder

Expected, and the documented trade-off ("This is slightly slower..."official README). The diffusion sampling stays on the GPU and is fast; only the one-time text-encoding step is on the CPU. If the CPU is your bottleneck, the Quick Start docs describe running kimodo_textencoder as a standalone service: on a GPU with under 16 GB of VRAM you can launch TEXT_ENCODER_DEVICE=cpu kimodo_textencoder in one terminal and kimodo_demo in another. A 5080 owner found this two-terminal pattern resolved the OOM in Issue #27.

Plugin: kimodo package not found inside ComfyUI

The ComfyUI plugin's README says "The kimodo package itself will be auto-installed on first launch if needed." (ComfyUI-Kimodo README). If the auto-install silently fails (common when ComfyUI uses an embedded Python without internet egress), install it explicitly into the same interpreter ComfyUI uses: <comfy-python> -m pip install git+https://github.com/nv-tlabs/kimodo.git.

For other issues, file a report via the submission form.