What You'll Build
A local text-to-3D-motion pipeline using NVIDIA's KiMoDo (Kinematic Motion Diffusion) on an RTX 4080. You type a prompt like "a person walks forward" and get back a 3D skeletal animation — joint rotations and root translations, up to 10 seconds long — saved as .npz (and optionally .bvh for direct import into Blender/Maya/MotionBuilder). This recipe covers both the official CLI (kimodo_gen / kimodo_demo) and the community ComfyUI plugin.
Hardware data: RTX 4080 (16 GB VRAM) · KiMoDo (282M-param diffusion model + Llama-3-8B text encoder) · See benchmark data
⚠️ Known issue (16 GB cards): The default pipeline loads the Llama-3-8B text encoder on the GPU and needs about 17 GB — more than the 4080's 16 GB. The fix is one environment variable: set
TEXT_ENCODER_DEVICE=cputo move the text encoder to system RAM and the pipeline runs in roughly 3 GB of VRAM. Both numbers come from the canonical README (see VRAM caveat below).
ℹ️ Not a roleplay / language model. Despite the
-RPin the checkpoint name, KiMoDo is a motion-generation model: text in, a 3D skeleton animation out. "SOMA" / "RP" here refer to the skeleton-and-retarget variant family, not chat or roleplay. The model is in ourspecializedvertical; its only language component is the frozen Llama-3-8B used as a text encoder — it never generates text.
Note: as of this writing the backend has no measured benchmarks for this pair (
/check/kimodo/rtx-4080returnsunknown). The VRAM figures below come from the canonical Kimodo GitHub README and a maintainer comment on Issue #27; treat them as the published baseline and report your own numbers via /contribute.
Requirements
| Component | Minimum | Tested |
|---|---|---|
| GPU | NVIDIA, CUDA-capable, PyTorch 2.0+ — see VRAM rows | RTX 4080 (16 GB) — pair not yet benchmarked, see /check/ |
| VRAM (default, all on GPU) | ~17 GB (canonical README) — does not fit a 16 GB 4080 out of the box | — |
VRAM (with TEXT_ENCODER_DEVICE=cpu) | ~3 GB (canonical README; maintainer-confirmed in Issue #27) — comfortably fits a 4080 | — |
| RAM | 16 GB+ (Llama-3-8B sits on the CPU when offloaded; budget ~16 GB for it plus headroom) | — |
| Storage | ~16 GB (Llama-3-8B weights ~15 GB + KiMoDo checkpoint ~1.13 GB) | — |
| Software | Python 3.10, PyTorch 2.0+ (installation docs) | — |
VRAM caveat — this is the load-bearing detail for a 16 GB card. The KiMoDo diffusion model itself is small (282M params per the HF model card; the
model.safetensorscheckpoint is 1.13 GB on disk per the HF file tree). But the default pipeline loads a Llama-3-8B-based text encoder onto the same GPU, and the README states the all-on-GPU footprint is about 17 GB, "primarily due to the text embedding model" (canonical README). That exceeds the 4080's 16 GB. The documented fix isTEXT_ENCODER_DEVICE=cpu, which the README describes as running the text encoder on the CPU for GPUs with less than 16 GB of VRAM — slightly slower, but it drops GPU usage to under 3 GB (canonical README). That is the path documented below.
Installation
Installation steps below come from the canonical NVIDIA sources — the official Kimodo installation docs, the Quick Start docs, and the
nv-tlabs/kimodoREADME. The ComfyUI section at the end uses the communityjtydhr88/ComfyUI-Kimodoplugin and is clearly labelled as such.
1. Set up a Python environment
The README lists Python 3.10 as the requirement:
conda create -n kimodo python=3.10
conda activate kimodo
2. Install PyTorch first, matched to your CUDA
The installation docs recommend installing a compatible PyTorch (2.0 or newer) before installing Kimodo. The RTX 4080 is an Ada Lovelace (sm_89) card, so no special CUDA index URL is required — the default stable PyTorch wheels (cu124 / cu121) already ship sm_89 kernels. A plain install is enough:
pip install torch
Unlike Blackwell GPUs (RTX 50-series, sm_120), the RTX 4080 needs no special wheel selection — the standard
pip install torchincludes the Ada sm_89 kernels this card uses. (If you maintain a multi-card box and want to pin a CUDA build explicitly, pick the matching index URL at pytorch.org/get-started/locally.)
3. Install Kimodo
Two options, per the installation docs:
# Minimal install (CLI generation only)
pip install git+https://github.com/nv-tlabs/kimodo.git
Or, with the interactive demo web UI:
pip install "kimodo[all] @ git+https://github.com/nv-tlabs/kimodo.git"
4. Authenticate with HuggingFace and request Llama-3 access
KiMoDo's text encoder relies on the gated meta-llama/Meta-Llama-3-8B-Instruct model — the README states Kimodo requires it for text-conditioned generation. Request access on its HF page, wait for approval, then authenticate locally:
hf auth login
# or place a token at ~/.cache/huggingface/token
KiMoDo's own weights (the 282M-param diffusion model) live at nvidia/Kimodo-SOMA-RP-v1.1 and are not gated (gated: false per the HF API); they download automatically on first use.
Running
Generate a motion from a text prompt (RTX 4080 path)
This is the load-bearing command for a 16 GB card. Set TEXT_ENCODER_DEVICE=cpu so the Llama-3 text encoder runs on the CPU and the diffusion model has the GPU to itself:
TEXT_ENCODER_DEVICE=cpu kimodo_gen "a person walks forward" \
--model Kimodo-SOMA-RP-v1.1 \
--duration 5.0 \
--output output
Argument reference (from the Quick Start and CLI docs):
--model— checkpoint name. This recipe pinsKimodo-SOMA-RP-v1.1. Other published tiers areKimodo-SOMA-RP-v1andKimodo-G1-RP-v1(humanoid-robot retarget).--duration— motion length in seconds. The README documents an adjustable duration up to 10 seconds.--output— output stem name; the motion file is written asoutput.npz.
Use the env var name
TEXT_ENCODER_DEVICEexactly as written. The reporter in Issue #27 hit a CUDA OOM because the text encoder was loading onto the GPU even when CPU execution was requested; the maintainer added theTEXT_ENCODER_DEVICEvariable that independently controls the text-encoder device in the merged PR #32 ("Fixes to multi-prompt handling and add support for TEXT_ENCODER_DEVICE"). See Troubleshooting.
Also export BVH for Blender/Maya/MotionBuilder
The README lists BVH as an export format for the SOMA skeleton. Add --bvh to also write a standard BVH file alongside the NPZ — the format you want for downstream DCC tools:
TEXT_ENCODER_DEVICE=cpu kimodo_gen "a person walks forward" \
--model Kimodo-SOMA-RP-v1.1 \
--duration 5.0 \
--output walk_forward \
--bvh
Or launch the interactive web demo
If you installed with kimodo[all], the Quick Start docs document a web UI. Keep the CPU text-encoder flag set so it fits a 16 GB card:
TEXT_ENCODER_DEVICE=cpu kimodo_demo
Then open the local demo URL it prints. You can load and visualize a motion you generated earlier from its .npz path; see the CLI docs for the demo's load/save workflow.
Optional: run inside ComfyUI
A community ComfyUI plugin, jtydhr88/ComfyUI-Kimodo (not an NVIDIA-official wrapper), exposes KiMoDo as nodes:
cd ComfyUI/custom_nodes
git clone https://github.com/jtydhr88/ComfyUI-Kimodo.git
cd ComfyUI-Kimodo
pip install -r requirements.txt
The plugin wraps the same upstream kimodo package, so the 16 GB VRAM constraint is identical: if you launch ComfyUI without the CPU text-encoder offload you will hit OOM at the text-encoder load step (the same failure as Issue #27). Set the env var in the shell where you start ComfyUI, e.g. TEXT_ENCODER_DEVICE=cpu python main.py.
Results
- Output (NPZ): the default output is an
.npz3D skeletal animation, adjustable up to 10 seconds (README). - Output (BVH, SOMA models only): add
--bvhto write a standard BVH alongside the NPZ (README export-formats list). This is the format for Blender/Maya/MotionBuilder. - Output (G1 / SMPL-X): the G1 humanoid-robot checkpoint instead supports MuJoCo qpos CSV and AMASS NPZ output (README).
- VRAM (default, all-on-GPU): ~17 GB (README) — exceeds the 4080's 16 GB.
- VRAM (with
TEXT_ENCODER_DEVICE=cpu): ~3 GB — what this recipe uses. The README documents this CPU-offload path for GPUs under 16 GB, and a Kimodo maintainer (davrempe, a repo collaborator) confirmed the same ~3 GB figure on a 16 GB card in Issue #27: runningTEXT_ENCODER_DEVICE=cpu kimodo_genuses only about 3 GB of VRAM. The 4080 is the same 16 GB tier as the card in that report, so the same default-overflow and the same fix apply here. - Speed: not quoted. No source reports KiMoDo generation throughput on an RTX 4080 by name, and the backend has no benchmark for this pair yet (
/check/returnsunknown). The 4080's ~716.8 GB/s memory bandwidth should make it comfortably fast for a 282M diffusion model — but quoting a number without a 4080-named measurement would be a guess. Once a community benchmark lands it will appear at /check/kimodo/rtx-4080; please /contribute yours. - Model size: 282M parameters for the diffusion model itself (HF model card; 1.13 GB checkpoint on disk per the HF tree); the Llama-3-8B text encoder (~15 GB on disk) dwarfs it.
- License: Apache-2.0 for the codebase (repo); the checkpoint at
nvidia/Kimodo-SOMA-RP-v1.1is under the NVIDIA Open Model License (a permissive, commercial-use-permitted license).
For the full benchmark data, see /check/kimodo/rtx-4080.
Troubleshooting
"CUDA out of memory" on a 16 GB card
This is the default failure on a 16 GB card and was reported directly on the canonical repo by a 16 GB-GPU owner in Issue #27: the Llama-3 text encoder loaded onto the GPU (~14.7 GiB) and left no room for the diffusion model. Two things to check:
- Set
TEXT_ENCODER_DEVICE=cpu. The maintainerdavrempe(a repo collaborator) responded on that issue that they added aTEXT_ENCODER_DEVICEenvironment variable — which controls the LLM2Vec text-encoder device independently of the Kimodo model — in the merged PR #32, and that runningTEXT_ENCODER_DEVICE=cpu kimodo_genthen uses only about 3 GB of VRAM (comment). - Make sure you're on a recent Kimodo. PR #32 is merged; reinstall from
git+https://github.com/nv-tlabs/kimodo.gitif you installed before it landed.
Note: that issue was filed by an RTX 5080 owner, but the 5080 and 4080 are the same 16 GB VRAM tier and the default ~17 GB footprint overflows both identically — the fix is the same on the 4080.
"401 / gated model" or "Access to model Meta-Llama-3-8B-Instruct is restricted"
KiMoDo's text encoder pulls the gated meta-llama/Meta-Llama-3-8B-Instruct. You need to (a) request access on the meta-llama/Meta-Llama-3-8B-Instruct HF page and wait for approval, then (b) run hf auth login or drop a token at ~/.cache/huggingface/token (installation docs). KiMoDo's own checkpoint is not gated — only the Llama-3 dependency is.
--bvh flag is ignored / output has no .bvh file
The README lists BVH export under the SOMA skeleton. If you ran with --model Kimodo-G1-RP-v1 (the humanoid-robot retarget), the SOMA path is skipped — pass --model Kimodo-SOMA-RP-v1.1 (or -v1) to get BVH. The G1 / SMPL-X checkpoint instead emits MuJoCo qpos CSV and AMASS NPZ.
Slow generation with the CPU text encoder
Expected, and a documented trade-off — the README describes the CPU-offload path as slightly slower in exchange for the large VRAM saving. Only the one-time text-encoding step runs on the CPU; the diffusion sampling stays on the GPU. If the CPU step is your bottleneck and you have spare VRAM, you can leave the text encoder on the GPU — but on a 16 GB 4080 the all-on-GPU path overflows, so CPU offload is the recommended default.
For other issues, file a report via the submission form.