self-hosted/ai
§01·recipe · specialized

KiMoDo on RTX 4080 SUPER: Text-to-3D-Motion Generation Guide

specializedintermediate3GB+ VRAMJun 2, 2026
models
tools
prerequisites
  • NVIDIA RTX 4080 SUPER (16 GB VRAM) or equivalent — see VRAM note below
  • Python 3.10 (per official installation docs)
  • PyTorch 2.0+ with CUDA support (default cu124/cu121 stable wheels include Ada sm_89 kernels)
  • HuggingFace account with approved access to meta-llama/Meta-Llama-3-8B-Instruct (gated)

What You'll Build

A local text-to-3D-motion pipeline using NVIDIA's KiMoDo (Kinematic Motion Diffusion) on an RTX 4080 SUPER. You type a prompt like "a person walks forward" and get back a 3D skeletal animation — joint positions and rotations plus the root trajectory — saved as .npz (and optionally .bvh for direct import into Blender/Maya/MotionBuilder). This recipe covers both the official CLI (kimodo_gen / kimodo_demo) and the community ComfyUI plugin.

Hardware data: RTX 4080 SUPER (16 GB VRAM) · KiMoDo (282M-param diffusion model + Llama-3-8B text encoder) · See benchmark data

⚠️ Known issue (16 GB cards): The default pipeline loads the Llama-3-8B text encoder on the GPU and needs about 17 GB — more than the 4080 SUPER's 16 GB. The fix is one environment variable: set TEXT_ENCODER_DEVICE=cpu to move the text encoder to system RAM and the pipeline runs in under 3 GB of VRAM. Both numbers come from the canonical README (see VRAM caveat below).

ℹ️ Not a roleplay / language model. Despite the -RP in the checkpoint name, KiMoDo is a motion-generation model: text in, a 3D skeleton animation out. "SOMA" / "RP" here refer to the skeleton-and-dataset variant family (SOMA skeleton, trained on the Bones Rigplay dataset), not chat or roleplay. The model is in our specialized vertical; its only language component is the frozen Llama-3-8B used as a text encoder — it never generates text.

Note: as of this writing the backend has no measured benchmarks for this pair (/check/kimodo/rtx-4080-super returns unknown). The VRAM figures below come from the canonical Kimodo GitHub README and a maintainer comment on Issue #27; treat them as the published baseline and report your own numbers via /contribute.

Requirements

ComponentMinimumTested
GPUNVIDIA, CUDA-capable, PyTorch 2.0+ — see VRAM rowsRTX 4080 SUPER (16 GB) — pair not yet benchmarked, see /check/
VRAM (default, all on GPU)~17 GB (canonical README) — does not fit a 16 GB 4080 SUPER out of the box
VRAM (with TEXT_ENCODER_DEVICE=cpu)<3 GB (canonical README; maintainer-confirmed in Issue #27) — comfortably fits a 4080 SUPER
RAM16 GB+ (Llama-3-8B sits on the CPU when offloaded; budget ~16 GB for it plus headroom)
Storage~17 GB (Llama-3-8B weights ~16 GB + KiMoDo checkpoint ~1.13 GB)
SoftwarePython 3.10, PyTorch 2.0+ (installation docs)

VRAM caveat — this is the load-bearing detail for a 16 GB card. The KiMoDo diffusion model itself is small (282M params per the HF model card; the model.safetensors checkpoint is 1.13 GB on disk per the HF file tree). But the default pipeline loads a Llama-3-8B-based text encoder onto the same GPU. The README states verbatim that "Kimodo requires ~17GB of VRAM to generate locally entirely on GPU, primarily due to the text embedding model" (canonical README). That exceeds the 4080 SUPER's 16 GB. The documented fix is TEXT_ENCODER_DEVICE=cpu: the README states "If you have a smaller card, set TEXT_ENCODER_DEVICE=cpu when running Kimodo commands to force text encoding to the CPU. This is slightly slower but reduces VRAM usage to <3 GB" (canonical README). That is the path documented below.

Installation

Installation steps below come from the canonical NVIDIA sources — the official Kimodo installation docs, the Quick Start docs, and the nv-tlabs/kimodo README. The ComfyUI section at the end uses the community jtydhr88/ComfyUI-Kimodo plugin and is clearly labelled as such.

1. Set up a Python environment

The installation docs specify Python 3.10:

conda create -n kimodo python=3.10
conda activate kimodo

2. Install PyTorch first, matched to your CUDA

The installation docs recommend installing a compatible PyTorch (2.0 or newer) before installing Kimodo. The RTX 4080 SUPER is an Ada Lovelace (AD103, sm_89) card, so no special CUDA index URL is required — the default stable PyTorch wheels (cu124 / cu121) already ship sm_89 kernels. A plain install is enough:

pip install torch

Unlike Blackwell GPUs (RTX 50-series, sm_120), the RTX 4080 SUPER needs no special wheel selection — the standard pip install torch includes the Ada sm_89 kernels this card uses. (If you maintain a multi-card box and want to pin a CUDA build explicitly, pick the matching index URL at pytorch.org/get-started/locally.)

3. Install Kimodo

Two options, per the installation docs:

# Minimal install (CLI generation only)
pip install git+https://github.com/nv-tlabs/kimodo.git

Or, with the interactive demo web UI:

pip install "kimodo[all] @ git+https://github.com/nv-tlabs/kimodo.git"

4. Authenticate with HuggingFace and request Llama-3 access

KiMoDo's text encoder relies on the gated meta-llama/Meta-Llama-3-8B-Instruct model — the installation docs state Kimodo requires it for text-conditioned generation. Request access on its HF page, wait for approval, then authenticate locally:

hf auth login
# or place a token at ~/.cache/huggingface/token

KiMoDo's own weights (the 282M-param diffusion model) live at nvidia/Kimodo-SOMA-RP-v1.1 and are not gated (gated: false per the HF API); they download automatically on first use.

Running

Generate a motion from a text prompt (RTX 4080 SUPER path)

This is the load-bearing command for a 16 GB card. Set TEXT_ENCODER_DEVICE=cpu so the Llama-3 text encoder runs on the CPU and the diffusion model has the GPU to itself:

TEXT_ENCODER_DEVICE=cpu kimodo_gen "a person walks forward" \
    --model Kimodo-SOMA-RP-v1.1 \
    --duration 5.0 \
    --output output

Argument reference (from the Quick Start and CLI docs):

  • --model — checkpoint name. This recipe pins Kimodo-SOMA-RP-v1.1. Other published SOMA tiers include Kimodo-SOMA-RP-v1 and Kimodo-SOMA-SEED-v1.1; Kimodo-G1-RP-v1 is the Unitree-G1 robot retarget.
  • --duration — motion length in seconds. The CLI docs document --duration as the motion duration in seconds; there is no documented hard cap, so tune it to your clip.
  • --output — output stem name; with a single sample the motion is written as output.npz (CLI docs).

Use the env var name TEXT_ENCODER_DEVICE exactly as written. The reporter in Issue #27 hit a CUDA OOM because the text encoder was loading onto the GPU even when CPU execution was requested; the maintainer added the TEXT_ENCODER_DEVICE variable that independently controls the text-encoder device in the merged PR #32 ("Fixes to multi-prompt handling and add support for TEXT_ENCODER_DEVICE"). See Troubleshooting.

Also export BVH for Blender/Maya/MotionBuilder

The CLI docs document a --bvh flag that "also export[s] BVH (SOMA models only) using the same stem as --output" — the format you want for downstream DCC tools. Because this recipe pins a SOMA checkpoint, the flag applies:

TEXT_ENCODER_DEVICE=cpu kimodo_gen "a person walks forward" \
    --model Kimodo-SOMA-RP-v1.1 \
    --duration 5.0 \
    --output walk_forward \
    --bvh

Or launch the interactive web demo

If you installed with kimodo[all], the README documents a web UI launched with kimodo_demo, running locally on http://127.0.0.1:7860. Keep the CPU text-encoder flag set so it fits a 16 GB card:

TEXT_ENCODER_DEVICE=cpu kimodo_demo

Then open http://127.0.0.1:7860 in your browser. The demo lets you author motions on a timeline of text prompts and kinematic constraints, then export the result; see the demo docs.

Optional: run inside ComfyUI

A community ComfyUI plugin, jtydhr88/ComfyUI-Kimodo (not an NVIDIA-official wrapper), exposes KiMoDo as nodes:

cd ComfyUI/custom_nodes
git clone https://github.com/jtydhr88/ComfyUI-Kimodo.git
cd ComfyUI-Kimodo
pip install -r requirements.txt

The plugin wraps the same upstream kimodo package, so the 16 GB VRAM constraint is identical: if you launch ComfyUI without the CPU text-encoder offload you will hit OOM at the text-encoder load step (the same failure as Issue #27). Set the env var in the shell where you start ComfyUI, e.g. TEXT_ENCODER_DEVICE=cpu python main.py.

Results

  • Output (NPZ): the default output is an .npz 3D skeletal animation (global/local joint rotations, joint positions, root trajectory, foot contacts), compatible with the web demo (README default-NPZ format list).
  • Output (BVH, SOMA models only): add --bvh to write a standard BVH alongside the NPZ (CLI docs). This is the format for Blender/Maya/MotionBuilder.
  • Output (G1 / SMPL-X): the G1 humanoid-robot checkpoint instead emits MuJoCo qpos CSV, and the SMPL-X checkpoint emits AMASS NPZ (README).
  • VRAM (default, all-on-GPU): ~17 GB (README) — exceeds the 4080 SUPER's 16 GB.
  • VRAM (with TEXT_ENCODER_DEVICE=cpu): <3 GB — what this recipe uses. The README documents this CPU-offload path for smaller cards, and a Kimodo maintainer (davrempe, a repo collaborator) confirmed the same ~3 GB figure on a 16 GB card in Issue #27: running TEXT_ENCODER_DEVICE=cpu kimodo_gen uses only about 3 GB of VRAM. The 4080 SUPER is the same 16 GB tier as the card in that report, so the same default-overflow and the same fix apply here.
  • Speed: not quoted. No source reports KiMoDo generation throughput on an RTX 4080 SUPER by name; the README notes the model was most extensively tested on RTX 3090, RTX 4090, and A100, and the backend has no benchmark for this pair yet (/check/ returns unknown). The 4080 SUPER's ~736 GB/s memory bandwidth and 10240 CUDA cores should make it comfortably fast for a 282M diffusion model — but quoting a number without a 4080-SUPER-named measurement would be a guess. Once a community benchmark lands it will appear at /check/kimodo/rtx-4080-super; please /contribute yours.
  • Model size: 282M parameters for the diffusion model itself (HF model card; 1.13 GB checkpoint on disk per the HF tree); the Llama-3-8B text encoder (~16 GB on disk) dwarfs it.
  • License: Apache-2.0 for the codebase (repo); the checkpoint at nvidia/Kimodo-SOMA-RP-v1.1 is under the NVIDIA Open Model License (a permissive, commercial-use-permitted license).

For the full benchmark data, see /check/kimodo/rtx-4080-super.

Troubleshooting

"CUDA out of memory" on a 16 GB card

This is the default failure on a 16 GB card and was reported directly on the canonical repo by a 16 GB-GPU owner in Issue #27: the Llama-3 text encoder loaded onto the GPU and left no room for the diffusion model. Two things to check:

  1. Set TEXT_ENCODER_DEVICE=cpu. The maintainer davrempe (a repo collaborator) responded on that issue that they added a TEXT_ENCODER_DEVICE environment variable — which controls the LLM2Vec text-encoder device independently of the Kimodo model — in the merged PR #32, and that running TEXT_ENCODER_DEVICE=cpu kimodo_gen then uses only about 3 GB of VRAM (comment).
  2. Make sure you're on a recent Kimodo. PR #32 is merged; reinstall from git+https://github.com/nv-tlabs/kimodo.git if you installed before it landed.

Note: that issue was filed by an RTX 5080 owner, but the 5080 and 4080 SUPER are the same 16 GB VRAM tier and the default ~17 GB footprint overflows both identically — the fix is the same on the 4080 SUPER.

"401 / gated model" or "Access to model Meta-Llama-3-8B-Instruct is restricted"

KiMoDo's text encoder pulls the gated meta-llama/Meta-Llama-3-8B-Instruct. You need to (a) request access on the meta-llama/Meta-Llama-3-8B-Instruct HF page and wait for approval, then (b) run hf auth login or drop a token at ~/.cache/huggingface/token (installation docs). KiMoDo's own checkpoint is not gated — only the Llama-3 dependency is.

--bvh flag is ignored / output has no .bvh file

The CLI docs document BVH export for SOMA models only. If you ran with --model Kimodo-G1-RP-v1 (the humanoid-robot retarget), the SOMA path is skipped — pass --model Kimodo-SOMA-RP-v1.1 (or -v1) to get BVH. The G1 checkpoint instead emits MuJoCo qpos CSV, and the SMPL-X checkpoint emits AMASS NPZ.

Slow generation with the CPU text encoder

Expected, and a documented trade-off — the README describes the CPU-offload path as slightly slower in exchange for the large VRAM saving. Only the one-time text-encoding step runs on the CPU; the diffusion sampling stays on the GPU. If the CPU step is your bottleneck and you have spare VRAM, you can leave the text encoder on the GPU — but on a 16 GB 4080 SUPER the all-on-GPU path overflows, so CPU offload is the recommended default.

For other issues, file a report via the submission form.