KiMoDo on RTX 4080: Text-to-3D-Motion Generation Guide

What You'll Build

A local text-to-3D-motion pipeline using NVIDIA's KiMoDo (Kinematic Motion Diffusion) on an RTX 4080. You type a prompt like "a person walks forward" and get back a 3D skeletal animation — joint rotations and root translations, up to 10 seconds long — saved as .npz (and optionally .bvh for direct import into Blender/Maya/MotionBuilder). This recipe covers both the official CLI (kimodo_gen / kimodo_demo) and the community ComfyUI plugin.

Hardware data: RTX 4080 (16 GB VRAM) · KiMoDo (282M-param diffusion model + Llama-3-8B text encoder) · See benchmark data

⚠️ Known issue (16 GB cards): The default pipeline loads the Llama-3-8B text encoder on the GPU and needs about 17 GB — more than the 4080's 16 GB. The fix is one environment variable: set TEXT_ENCODER_DEVICE=cpu to move the text encoder to system RAM and the pipeline runs in roughly 3 GB of VRAM. Both numbers come from the canonical README (see VRAM caveat below).

ℹ️ Not a roleplay / language model. Despite the -RP in the checkpoint name, KiMoDo is a motion-generation model: text in, a 3D skeleton animation out. "SOMA" / "RP" here refer to the skeleton-and-retarget variant family, not chat or roleplay. The model is in our specialized vertical; its only language component is the frozen Llama-3-8B used as a text encoder — it never generates text.

Note: as of this writing the backend has no measured benchmarks for this pair (/check/kimodo/rtx-4080 returns unknown). The VRAM figures below come from the canonical Kimodo GitHub README and a maintainer comment on Issue #27; treat them as the published baseline and report your own numbers via /contribute.

Requirements

Component	Minimum	Tested
GPU	NVIDIA, CUDA-capable, PyTorch 2.0+ — see VRAM rows	RTX 4080 (16 GB) — pair not yet benchmarked, see /check/
VRAM (default, all on GPU)	~17 GB (canonical README) — does not fit a 16 GB 4080 out of the box	—
VRAM (with `TEXT_ENCODER_DEVICE=cpu`)	~3 GB (canonical README; maintainer-confirmed in Issue #27) — comfortably fits a 4080	—
RAM	16 GB+ (Llama-3-8B sits on the CPU when offloaded; budget ~16 GB for it plus headroom)	—
Storage	~16 GB (Llama-3-8B weights ~15 GB + KiMoDo checkpoint ~1.13 GB)	—
Software	Python 3.10, PyTorch 2.0+ (installation docs)	—

VRAM caveat — this is the load-bearing detail for a 16 GB card. The KiMoDo diffusion model itself is small (282M params per the HF model card; the model.safetensors checkpoint is 1.13 GB on disk per the HF file tree). But the default pipeline loads a Llama-3-8B-based text encoder onto the same GPU, and the README states the all-on-GPU footprint is about 17 GB, "primarily due to the text embedding model" (canonical README). That exceeds the 4080's 16 GB. The documented fix is TEXT_ENCODER_DEVICE=cpu, which the README describes as running the text encoder on the CPU for GPUs with less than 16 GB of VRAM — slightly slower, but it drops GPU usage to under 3 GB (canonical README). That is the path documented below.

Installation

Installation steps below come from the canonical NVIDIA sources — the official Kimodo installation docs, the Quick Start docs, and the nv-tlabs/kimodo README. The ComfyUI section at the end uses the community jtydhr88/ComfyUI-Kimodo plugin and is clearly labelled as such.

1. Set up a Python environment

The README lists Python 3.10 as the requirement:

conda create -n kimodo python=3.10
conda activate kimodo

2. Install PyTorch first, matched to your CUDA

The installation docs recommend installing a compatible PyTorch (2.0 or newer) before installing Kimodo. The RTX 4080 is an Ada Lovelace (sm_89) card, so no special CUDA index URL is required — the default stable PyTorch wheels (cu124 / cu121) already ship sm_89 kernels. A plain install is enough:

pip install torch

Unlike Blackwell GPUs (RTX 50-series, sm_120), the RTX 4080 needs no special wheel selection — the standard pip install torch includes the Ada sm_89 kernels this card uses. (If you maintain a multi-card box and want to pin a CUDA build explicitly, pick the matching index URL at pytorch.org/get-started/locally.)

3. Install Kimodo

Two options, per the installation docs:

# Minimal install (CLI generation only)
pip install git+https://github.com/nv-tlabs/kimodo.git

Or, with the interactive demo web UI:

pip install "kimodo[all] @ git+https://github.com/nv-tlabs/kimodo.git"

4. Authenticate with HuggingFace and request Llama-3 access

KiMoDo's text encoder relies on the gated meta-llama/Meta-Llama-3-8B-Instruct model — the README states Kimodo requires it for text-conditioned generation. Request access on its HF page, wait for approval, then authenticate locally:

hf auth login
# or place a token at ~/.cache/huggingface/token

KiMoDo's own weights (the 282M-param diffusion model) live at nvidia/Kimodo-SOMA-RP-v1.1 and are not gated (gated: false per the HF API); they download automatically on first use.

Running

Generate a motion from a text prompt (RTX 4080 path)

This is the load-bearing command for a 16 GB card. Set TEXT_ENCODER_DEVICE=cpu so the Llama-3 text encoder runs on the CPU and the diffusion model has the GPU to itself:

TEXT_ENCODER_DEVICE=cpu kimodo_gen "a person walks forward" \
    --model Kimodo-SOMA-RP-v1.1 \
    --duration 5.0 \
    --output output

Argument reference (from the Quick Start and CLI docs):

--model — checkpoint name. This recipe pins Kimodo-SOMA-RP-v1.1. Other published tiers are Kimodo-SOMA-RP-v1 and Kimodo-G1-RP-v1 (humanoid-robot retarget).
--duration — motion length in seconds. The README documents an adjustable duration up to 10 seconds.
--output — output stem name; the motion file is written as output.npz.

Use the env var name TEXT_ENCODER_DEVICE exactly as written. The reporter in Issue #27 hit a CUDA OOM because the text encoder was loading onto the GPU even when CPU execution was requested; the maintainer added the TEXT_ENCODER_DEVICE variable that independently controls the text-encoder device in the merged PR #32 ("Fixes to multi-prompt handling and add support for TEXT_ENCODER_DEVICE"). See Troubleshooting.

Also export BVH for Blender/Maya/MotionBuilder

The README lists BVH as an export format for the SOMA skeleton. Add --bvh to also write a standard BVH file alongside the NPZ — the format you want for downstream DCC tools:

TEXT_ENCODER_DEVICE=cpu kimodo_gen "a person walks forward" \
    --model Kimodo-SOMA-RP-v1.1 \
    --duration 5.0 \
    --output walk_forward \
    --bvh

Or launch the interactive web demo

If you installed with kimodo[all], the Quick Start docs document a web UI. Keep the CPU text-encoder flag set so it fits a 16 GB card:

TEXT_ENCODER_DEVICE=cpu kimodo_demo

Then open the local demo URL it prints. You can load and visualize a motion you generated earlier from its .npz path; see the CLI docs for the demo's load/save workflow.

Optional: run inside ComfyUI

A community ComfyUI plugin, jtydhr88/ComfyUI-Kimodo (not an NVIDIA-official wrapper), exposes KiMoDo as nodes:

cd ComfyUI/custom_nodes
git clone https://github.com/jtydhr88/ComfyUI-Kimodo.git
cd ComfyUI-Kimodo
pip install -r requirements.txt

The plugin wraps the same upstream kimodo package, so the 16 GB VRAM constraint is identical: if you launch ComfyUI without the CPU text-encoder offload you will hit OOM at the text-encoder load step (the same failure as Issue #27). Set the env var in the shell where you start ComfyUI, e.g. TEXT_ENCODER_DEVICE=cpu python main.py.

Results

Output (NPZ): the default output is an .npz 3D skeletal animation, adjustable up to 10 seconds (README).
Output (BVH, SOMA models only): add --bvh to write a standard BVH alongside the NPZ (README export-formats list). This is the format for Blender/Maya/MotionBuilder.
Output (G1 / SMPL-X): the G1 humanoid-robot checkpoint instead supports MuJoCo qpos CSV and AMASS NPZ output (README).
VRAM (default, all-on-GPU): ~17 GB (README) — exceeds the 4080's 16 GB.
VRAM (with TEXT_ENCODER_DEVICE=cpu): ~3 GB — what this recipe uses. The README documents this CPU-offload path for GPUs under 16 GB, and a Kimodo maintainer (davrempe, a repo collaborator) confirmed the same ~3 GB figure on a 16 GB card in Issue #27: running TEXT_ENCODER_DEVICE=cpu kimodo_gen uses only about 3 GB of VRAM. The 4080 is the same 16 GB tier as the card in that report, so the same default-overflow and the same fix apply here.
Speed: not quoted. No source reports KiMoDo generation throughput on an RTX 4080 by name, and the backend has no benchmark for this pair yet (/check/ returns unknown). The 4080's ~716.8 GB/s memory bandwidth should make it comfortably fast for a 282M diffusion model — but quoting a number without a 4080-named measurement would be a guess. Once a community benchmark lands it will appear at /check/kimodo/rtx-4080; please /contribute yours.
Model size: 282M parameters for the diffusion model itself (HF model card; 1.13 GB checkpoint on disk per the HF tree); the Llama-3-8B text encoder (~15 GB on disk) dwarfs it.
License: Apache-2.0 for the codebase (repo); the checkpoint at nvidia/Kimodo-SOMA-RP-v1.1 is under the NVIDIA Open Model License (a permissive, commercial-use-permitted license).

For the full benchmark data, see /check/kimodo/rtx-4080.

Troubleshooting

"CUDA out of memory" on a 16 GB card

This is the default failure on a 16 GB card and was reported directly on the canonical repo by a 16 GB-GPU owner in Issue #27: the Llama-3 text encoder loaded onto the GPU (~14.7 GiB) and left no room for the diffusion model. Two things to check:

Set TEXT_ENCODER_DEVICE=cpu. The maintainer davrempe (a repo collaborator) responded on that issue that they added a TEXT_ENCODER_DEVICE environment variable — which controls the LLM2Vec text-encoder device independently of the Kimodo model — in the merged PR #32, and that running TEXT_ENCODER_DEVICE=cpu kimodo_gen then uses only about 3 GB of VRAM (comment).
Make sure you're on a recent Kimodo. PR #32 is merged; reinstall from git+https://github.com/nv-tlabs/kimodo.git if you installed before it landed.

Note: that issue was filed by an RTX 5080 owner, but the 5080 and 4080 are the same 16 GB VRAM tier and the default ~17 GB footprint overflows both identically — the fix is the same on the 4080.

"401 / gated model" or "Access to model Meta-Llama-3-8B-Instruct is restricted"

KiMoDo's text encoder pulls the gated meta-llama/Meta-Llama-3-8B-Instruct. You need to (a) request access on the meta-llama/Meta-Llama-3-8B-Instruct HF page and wait for approval, then (b) run hf auth login or drop a token at ~/.cache/huggingface/token (installation docs). KiMoDo's own checkpoint is not gated — only the Llama-3 dependency is.

`--bvh` flag is ignored / output has no `.bvh` file

The README lists BVH export under the SOMA skeleton. If you ran with --model Kimodo-G1-RP-v1 (the humanoid-robot retarget), the SOMA path is skipped — pass --model Kimodo-SOMA-RP-v1.1 (or -v1) to get BVH. The G1 / SMPL-X checkpoint instead emits MuJoCo qpos CSV and AMASS NPZ.

Slow generation with the CPU text encoder

Expected, and a documented trade-off — the README describes the CPU-offload path as slightly slower in exchange for the large VRAM saving. Only the one-time text-encoding step runs on the CPU; the diffusion sampling stays on the GPU. If the CPU step is your bottleneck and you have spare VRAM, you can leave the text encoder on the GPU — but on a 16 GB 4080 the all-on-GPU path overflows, so CPU offload is the recommended default.

For other issues, file a report via the submission form.