self-hosted/ai
§01·recipe · specialized

KiMoDo on RTX 5060 Ti: Text-to-3D-Motion Generation Guide

specializedintermediate3GB+ VRAMMay 18, 2026
models
tools
prerequisites
  • NVIDIA GPU with 16 GB VRAM (RTX 5060 Ti) — see VRAM note below
  • Python 3.10 (per official installation docs)
  • PyTorch 2.0+ built for your CUDA version (installed before kimodo)
  • HuggingFace account with approved access to meta-llama/Meta-Llama-3-8B-Instruct (gated)

What You'll Build

A local text-to-3D-motion pipeline using NVIDIA's KiMoDo (Kinematic Motion Diffusion) on an RTX 5060 Ti. You'll type a prompt like "A person walks forward." and get back a 3D skeleton animation — joint rotations and root translations at 30 fps, up to 10 s long — saved as .npz (and optionally .bvh for direct import into Blender/Maya/MotionBuilder). This recipe covers both the official CLI (kimodo_gen / kimodo_demo) and the community ComfyUI plugin.

Hardware data: RTX 5060 Ti (16 GB VRAM) · KiMoDo (282M-param diffusion model + Llama-3-8B text encoder) · See benchmark data

Note: As of this writing the backend has no measured benchmarks for this pair. The VRAM figures below come from the official Kimodo GitHub README; treat them as the published baseline and report your own numbers via /contribute.

Requirements

ComponentMinimumTested
GPUNVIDIA, CUDA-capable, PyTorch 2.0+ — see VRAM rowRTX 5060 Ti (16 GB) — pair not yet benchmarked, see /check/
VRAM (default, all on GPU)~17 GB (official README) — does not fit a 16 GB 5060 Ti out of the box
VRAM (with TEXT_ENCODER_DEVICE=cpu)<3 GB (official README) — comfortably fits a 5060 Ti
RAM16 GB (Llama-3-8B sits on the CPU when offloaded; budget ~16 GB for it plus headroom)
Storage~16 GB (Llama-3-8B weights ~15 GB + KiMoDo checkpoint)
SoftwarePython 3.10, PyTorch 2.0+ (installation docs)

VRAM caveat — this is the load-bearing detail for a 16 GB card. The KiMoDo diffusion model itself is small (282 M params, HF model card), but the default pipeline loads Meta-Llama-3-8B-Instruct on the same GPU for text encoding. That pushes peak VRAM to ~17 GB, which exceeds the 5060 Ti's 16 GB by a small margin. The official fix is TEXT_ENCODER_DEVICE=cpu, which moves Llama-3 to system RAM and drops GPU VRAM to <3 GB at the cost of "slightly slower" text encoding (official README). This is the path documented below.

Installation

Installation steps below come from the canonical NVIDIA sources only — the official Kimodo installation docs, the Quick Start docs, and the nv-tlabs/kimodo README. No third-party walkthrough is required because the install path is the upstream-supported one. The ComfyUI section at the end uses the community jtydhr88/ComfyUI-Kimodo plugin and is clearly labelled as such.

1. Set up a Python environment

Per the official installation docs:

conda create -n kimodo python=3.10
conda activate kimodo

2. Install PyTorch first, matched to your CUDA

Kimodo's docs explicitly say to install a "compatible PyTorch (version 2.0+) manually before Kimodo, optimized for your CUDA version" (installation docs). For a Blackwell-class card like the 5060 Ti, use the recent CUDA 12.x wheels from the official PyTorch index — pick the index URL appropriate to your CUDA toolkit (cu121, cu124, cu128, etc.) at pytorch.org/get-started/locally.

3. Install Kimodo

Two options from the official installation docs:

# Minimal install (CLI generation only)
pip install git+https://github.com/nv-tlabs/kimodo.git

Or, with the interactive demo web UI:

pip install "kimodo[all] @ git+https://github.com/nv-tlabs/kimodo.git"

4. Authenticate with HuggingFace and request Llama-3 access

KiMoDo's text encoder is Meta-Llama-3-8B-Instruct, which is a gated HuggingFace model — you must request access on its HF page and then authenticate locally, per the installation docs:

hf auth login
# or place a token at ~/.cache/huggingface/token

KiMoDo's own weights (the 282M-param diffusion model) live at nvidia/Kimodo-SOMA-RP-v1.1 and are not gated; they download automatically on first use.

Running

Generate a motion from a text prompt (RTX 5060 Ti path)

This is the load-bearing command for a 16 GB card. Set TEXT_ENCODER_DEVICE=cpu so Llama-3 runs on CPU and the diffusion model has the GPU to itself (official README):

TEXT_ENCODER_DEVICE=cpu kimodo_gen "A person walks forward." \
    --model Kimodo-SOMA-RP-v1.1 \
    --duration 5.0 \
    --output output

Arguments are from the Quick Start docs:

  • --model — checkpoint name. For the canonical SOMA tier used in this recipe, swap in Kimodo-SOMA-RP-v1.1. Other published tiers are Kimodo-SOMA-RP-v1 and Kimodo-G1-RP-v1 (humanoid robot retarget).
  • --duration — motion length in seconds. Cap is 10 s / 300 frames at 30 fps (HF model card).
  • --output — output stem name. The motion file is written as output.npz.

Also export BVH for Blender/Maya/MotionBuilder

The SOMA-family checkpoints support direct BVH export via the --bvh flag (CLI docs) — the most useful format for downstream DCC tools:

TEXT_ENCODER_DEVICE=cpu kimodo_gen "A person walks forward." \
    --model Kimodo-SOMA-RP-v1.1 \
    --duration 5.0 \
    --output walk_forward \
    --bvh

Or launch the interactive web demo

If you installed with kimodo[all], the Quick Start docs document a Gradio-style web UI:

TEXT_ENCODER_DEVICE=cpu kimodo_demo

Then open http://localhost:7860. The demo UI is also where you can re-load an .npz file generated by kimodo_gen to visualize the skeleton (CLI docs).

Optional: run inside ComfyUI

A community ComfyUI plugin, jtydhr88/ComfyUI-Kimodo (Apache-2.0, not an NVIDIA-official wrapper), exposes KiMoDo as nodes and adds FBX (Mixamo-rigged) export on top of the upstream NPZ/BVH:

cd ComfyUI/custom_nodes
git clone https://github.com/jtydhr88/ComfyUI-Kimodo.git
cd ComfyUI-Kimodo
pip install -r requirements.txt

The plugin's README documents NPZ, BVH (SOMA only), and FBX-via-Mixamo retarget as export options, plus an in-graph 2D skeleton preview. It does not document the TEXT_ENCODER_DEVICE=cpu workaround — if you launch ComfyUI on a 16 GB 5060 Ti without further changes you will hit OOM at the Llama-3 load step. Set the env var in the shell where you start ComfyUI (or pass it through Docker), e.g. TEXT_ENCODER_DEVICE=cpu python main.py.

Results

  • Output (NPZ): the Kimodo NPZ contains root translation (num_frames × 3) and joint rotations (num_frames × 30 × 3 × 3 rotation matrices), at 30 fps, max 10 s / 300 frames (HF model card). NPZ also stores foot contacts and heading data (official README).
  • Output (BVH, SOMA models only): add --bvh to write a standard BVH alongside the NPZ (CLI docs). This is the format you want for Blender/Maya/MotionBuilder — Kimodo does not emit MP4 video; visualization happens in the demo UI.
  • Output (FBX): only available via the community ComfyUI plugin's Mixamo-retarget node (ComfyUI-Kimodo README).
  • VRAM (default, all-on-GPU): ~17 GB (official README) — exceeds the 5060 Ti's 16 GB.
  • VRAM (with TEXT_ENCODER_DEVICE=cpu): <3 GB (official README) — what this recipe uses.
  • Model size: 282M parameters for the diffusion model itself (HF model card); Llama-3-8B text encoder (~15 GB on disk) dwarfs it.
  • License: Apache-2.0 for the codebase (repo); the checkpoint at nvidia/Kimodo-SOMA-RP-v1.1 is under the NVIDIA Open Model License and is marked "ready for commercial use."

Once empirical 5060 Ti numbers are seeded, they will appear at /check/kimodo/rtx-5060-ti.

Troubleshooting

"CUDA out of memory" loading the text encoder

This is the default failure on a 16 GB 5060 Ti — the Llama-3-8B text encoder doesn't fit alongside the diffusion model. Set TEXT_ENCODER_DEVICE=cpu (per the official README) before kimodo_gen, kimodo_demo, or — for the ComfyUI route — python main.py. Generation gets "slightly slower" but VRAM drops to <3 GB.

"401 / gated model" or "Access to model Meta-Llama-3-8B-Instruct is restricted"

You need to (a) request access on the meta-llama/Meta-Llama-3-8B-Instruct HF page and wait for approval, then (b) run hf auth login or drop a token at ~/.cache/huggingface/token (installation docs).

--bvh flag is ignored / output has no .bvh file

BVH export is SOMA-only (CLI docs). If you ran with --model Kimodo-G1-RP-v1 (the humanoid-robot retarget), the SOMA skeleton path is skipped — pass --model Kimodo-SOMA-RP-v1.1 (or -v1) to get BVH. For G1 you instead get a MuJoCo qpos CSV.

Slow generation on the CPU text encoder

Expected. The diffusion sampling stays on the GPU and is fast; only the one-time text-encoding step is on the CPU. If your CPU is the bottleneck and you have a second machine, the docs also describe running kimodo_textencoder as a standalone service (Quick Start docs) — out of scope here, but documented upstream.

Plugin: kimodo package not found inside ComfyUI

The ComfyUI plugin's README says "The kimodo package itself will be auto-installed on first launch if needed" (ComfyUI-Kimodo README). If the auto-install silently fails (common when ComfyUI uses an embedded Python without internet egress), install it explicitly into the same interpreter ComfyUI uses: <comfy-python> -m pip install git+https://github.com/nv-tlabs/kimodo.git.

For other issues, file a report via the submission form.