What You'll Build
A local text-to-3D-motion pipeline using NVIDIA's KiMoDo (Kinematic Motion Diffusion) on an RTX 5060 Ti. You'll type a prompt like "A person walks forward." and get back a 3D skeleton animation — joint rotations and root translations at 30 fps, up to 10 s long — saved as .npz (and optionally .bvh for direct import into Blender/Maya/MotionBuilder). This recipe covers both the official CLI (kimodo_gen / kimodo_demo) and the community ComfyUI plugin.
Hardware data: RTX 5060 Ti (16 GB VRAM) · KiMoDo (282M-param diffusion model + Llama-3-8B text encoder) · See benchmark data
Note: As of this writing the backend has no measured benchmarks for this pair. The VRAM figures below come from the official Kimodo GitHub README; treat them as the published baseline and report your own numbers via /contribute.
Requirements
| Component | Minimum | Tested |
|---|---|---|
| GPU | NVIDIA, CUDA-capable, PyTorch 2.0+ — see VRAM row | RTX 5060 Ti (16 GB) — pair not yet benchmarked, see /check/ |
| VRAM (default, all on GPU) | ~17 GB (official README) — does not fit a 16 GB 5060 Ti out of the box | — |
VRAM (with TEXT_ENCODER_DEVICE=cpu) | <3 GB (official README) — comfortably fits a 5060 Ti | — |
| RAM | 16 GB (Llama-3-8B sits on the CPU when offloaded; budget ~16 GB for it plus headroom) | — |
| Storage | ~16 GB (Llama-3-8B weights ~15 GB + KiMoDo checkpoint) | — |
| Software | Python 3.10, PyTorch 2.0+ (installation docs) | — |
VRAM caveat — this is the load-bearing detail for a 16 GB card. The KiMoDo diffusion model itself is small (282 M params, HF model card), but the default pipeline loads Meta-Llama-3-8B-Instruct on the same GPU for text encoding. That pushes peak VRAM to ~17 GB, which exceeds the 5060 Ti's 16 GB by a small margin. The official fix is
TEXT_ENCODER_DEVICE=cpu, which moves Llama-3 to system RAM and drops GPU VRAM to <3 GB at the cost of "slightly slower" text encoding (official README). This is the path documented below.
Installation
Installation steps below come from the canonical NVIDIA sources only — the official Kimodo installation docs, the Quick Start docs, and the
nv-tlabs/kimodoREADME. No third-party walkthrough is required because the install path is the upstream-supported one. The ComfyUI section at the end uses the communityjtydhr88/ComfyUI-Kimodoplugin and is clearly labelled as such.
1. Set up a Python environment
Per the official installation docs:
conda create -n kimodo python=3.10
conda activate kimodo
2. Install PyTorch first, matched to your CUDA
Kimodo's docs explicitly say to install a "compatible PyTorch (version 2.0+) manually before Kimodo, optimized for your CUDA version" (installation docs). For a Blackwell-class card like the 5060 Ti, use the recent CUDA 12.x wheels from the official PyTorch index — pick the index URL appropriate to your CUDA toolkit (cu121, cu124, cu128, etc.) at pytorch.org/get-started/locally.
3. Install Kimodo
Two options from the official installation docs:
# Minimal install (CLI generation only)
pip install git+https://github.com/nv-tlabs/kimodo.git
Or, with the interactive demo web UI:
pip install "kimodo[all] @ git+https://github.com/nv-tlabs/kimodo.git"
4. Authenticate with HuggingFace and request Llama-3 access
KiMoDo's text encoder is Meta-Llama-3-8B-Instruct, which is a gated HuggingFace model — you must request access on its HF page and then authenticate locally, per the installation docs:
hf auth login
# or place a token at ~/.cache/huggingface/token
KiMoDo's own weights (the 282M-param diffusion model) live at nvidia/Kimodo-SOMA-RP-v1.1 and are not gated; they download automatically on first use.
Running
Generate a motion from a text prompt (RTX 5060 Ti path)
This is the load-bearing command for a 16 GB card. Set TEXT_ENCODER_DEVICE=cpu so Llama-3 runs on CPU and the diffusion model has the GPU to itself (official README):
TEXT_ENCODER_DEVICE=cpu kimodo_gen "A person walks forward." \
--model Kimodo-SOMA-RP-v1.1 \
--duration 5.0 \
--output output
Arguments are from the Quick Start docs:
--model— checkpoint name. For the canonical SOMA tier used in this recipe, swap inKimodo-SOMA-RP-v1.1. Other published tiers areKimodo-SOMA-RP-v1andKimodo-G1-RP-v1(humanoid robot retarget).--duration— motion length in seconds. Cap is 10 s / 300 frames at 30 fps (HF model card).--output— output stem name. The motion file is written asoutput.npz.
Also export BVH for Blender/Maya/MotionBuilder
The SOMA-family checkpoints support direct BVH export via the --bvh flag (CLI docs) — the most useful format for downstream DCC tools:
TEXT_ENCODER_DEVICE=cpu kimodo_gen "A person walks forward." \
--model Kimodo-SOMA-RP-v1.1 \
--duration 5.0 \
--output walk_forward \
--bvh
Or launch the interactive web demo
If you installed with kimodo[all], the Quick Start docs document a Gradio-style web UI:
TEXT_ENCODER_DEVICE=cpu kimodo_demo
Then open http://localhost:7860. The demo UI is also where you can re-load an .npz file generated by kimodo_gen to visualize the skeleton (CLI docs).
Optional: run inside ComfyUI
A community ComfyUI plugin, jtydhr88/ComfyUI-Kimodo (Apache-2.0, not an NVIDIA-official wrapper), exposes KiMoDo as nodes and adds FBX (Mixamo-rigged) export on top of the upstream NPZ/BVH:
cd ComfyUI/custom_nodes
git clone https://github.com/jtydhr88/ComfyUI-Kimodo.git
cd ComfyUI-Kimodo
pip install -r requirements.txt
The plugin's README documents NPZ, BVH (SOMA only), and FBX-via-Mixamo retarget as export options, plus an in-graph 2D skeleton preview. It does not document the TEXT_ENCODER_DEVICE=cpu workaround — if you launch ComfyUI on a 16 GB 5060 Ti without further changes you will hit OOM at the Llama-3 load step. Set the env var in the shell where you start ComfyUI (or pass it through Docker), e.g. TEXT_ENCODER_DEVICE=cpu python main.py.
Results
- Output (NPZ): the Kimodo NPZ contains root translation (
num_frames × 3) and joint rotations (num_frames × 30 × 3 × 3rotation matrices), at 30 fps, max 10 s / 300 frames (HF model card). NPZ also stores foot contacts and heading data (official README). - Output (BVH, SOMA models only): add
--bvhto write a standard BVH alongside the NPZ (CLI docs). This is the format you want for Blender/Maya/MotionBuilder — Kimodo does not emit MP4 video; visualization happens in the demo UI. - Output (FBX): only available via the community ComfyUI plugin's Mixamo-retarget node (ComfyUI-Kimodo README).
- VRAM (default, all-on-GPU): ~17 GB (official README) — exceeds the 5060 Ti's 16 GB.
- VRAM (with
TEXT_ENCODER_DEVICE=cpu): <3 GB (official README) — what this recipe uses. - Model size: 282M parameters for the diffusion model itself (HF model card); Llama-3-8B text encoder (~15 GB on disk) dwarfs it.
- License: Apache-2.0 for the codebase (repo); the checkpoint at
nvidia/Kimodo-SOMA-RP-v1.1is under the NVIDIA Open Model License and is marked "ready for commercial use."
Once empirical 5060 Ti numbers are seeded, they will appear at /check/kimodo/rtx-5060-ti.
Troubleshooting
"CUDA out of memory" loading the text encoder
This is the default failure on a 16 GB 5060 Ti — the Llama-3-8B text encoder doesn't fit alongside the diffusion model. Set TEXT_ENCODER_DEVICE=cpu (per the official README) before kimodo_gen, kimodo_demo, or — for the ComfyUI route — python main.py. Generation gets "slightly slower" but VRAM drops to <3 GB.
"401 / gated model" or "Access to model Meta-Llama-3-8B-Instruct is restricted"
You need to (a) request access on the meta-llama/Meta-Llama-3-8B-Instruct HF page and wait for approval, then (b) run hf auth login or drop a token at ~/.cache/huggingface/token (installation docs).
--bvh flag is ignored / output has no .bvh file
BVH export is SOMA-only (CLI docs). If you ran with --model Kimodo-G1-RP-v1 (the humanoid-robot retarget), the SOMA skeleton path is skipped — pass --model Kimodo-SOMA-RP-v1.1 (or -v1) to get BVH. For G1 you instead get a MuJoCo qpos CSV.
Slow generation on the CPU text encoder
Expected. The diffusion sampling stays on the GPU and is fast; only the one-time text-encoding step is on the CPU. If your CPU is the bottleneck and you have a second machine, the docs also describe running kimodo_textencoder as a standalone service (Quick Start docs) — out of scope here, but documented upstream.
Plugin: kimodo package not found inside ComfyUI
The ComfyUI plugin's README says "The kimodo package itself will be auto-installed on first launch if needed" (ComfyUI-Kimodo README). If the auto-install silently fails (common when ComfyUI uses an embedded Python without internet egress), install it explicitly into the same interpreter ComfyUI uses: <comfy-python> -m pip install git+https://github.com/nv-tlabs/kimodo.git.
For other issues, file a report via the submission form.