What You'll Build
A local text-to-3D-motion pipeline using NVIDIA's KiMoDo (Kinematic Motion Diffusion) on an RTX 4070 Ti Super. You type a prompt like "a person walks forward" and get back a 3D skeletal animation — joint rotations and root translations, up to 10 seconds long — saved as .npz (and optionally .bvh for direct import into Blender/Maya/MotionBuilder). This recipe covers both the official CLI (kimodo_gen / kimodo_demo) and the community ComfyUI plugin.
Hardware data: RTX 4070 Ti Super (16 GB VRAM) · KiMoDo (282M-param diffusion model + Llama-3-8B text encoder) · See benchmark data
⚠️ Known issue (16 GB cards): The default pipeline loads the Llama-3-8B-based text encoder on the GPU and needs about 17 GB — more than the 4070 Ti Super's 16 GB. The fix is one environment variable: set
TEXT_ENCODER_DEVICE=cputo move the text encoder to system RAM and the pipeline runs in under 3 GB of VRAM. Both numbers come from the canonical README (see VRAM caveat below).
ℹ️ Not a roleplay / language model. Despite the
-RPin the checkpoint name, KiMoDo is a motion-generation model: text in, a 3D skeleton animation out. The "SOMA" / "RP" in the name refer to the skeleton (the SOMA 77-joint body model) and the Rigplay training data — not chat or roleplay. The model is in ourspecializedvertical; its only language component is the frozen Llama-3-8B used as a text encoder (via LLM2Vec) — it never generates text.
Note: as of this writing the backend has no measured benchmarks for this pair (
/check/kimodo/rtx-4070-ti-superreturnsunknown). The VRAM figures below come from the canonical Kimodo GitHub README and a maintainer comment on Issue #27; treat them as the published baseline and report your own numbers via /contribute.
Requirements
| Component | Minimum | Tested |
|---|---|---|
| GPU | NVIDIA, CUDA-capable, PyTorch 2.0+ — see VRAM rows | RTX 4070 Ti Super (16 GB) — pair not yet benchmarked, see /check/ |
| VRAM (default, all on GPU) | ~17 GB (canonical README) — does not fit a 16 GB 4070 Ti Super out of the box | — |
VRAM (with TEXT_ENCODER_DEVICE=cpu) | under 3 GB (canonical README; maintainer-confirmed ~3 GB in Issue #27) — comfortably fits a 4070 Ti Super | — |
| RAM | 16 GB+ (Llama-3-8B sits on the CPU when offloaded; budget ~16 GB for it plus headroom) | — |
| Storage | ~16 GB (Llama-3-8B weights ~15 GB + KiMoDo checkpoint 1.13 GB) | — |
| Software | Python 3.10, PyTorch 2.0+ (installation docs) | — |
VRAM caveat — this is the load-bearing detail for a 16 GB card. The KiMoDo diffusion model itself is small (282M params per the HF model card; the
model.safetensorscheckpoint is 1.13 GB on disk per the HF file tree). But the default pipeline loads a Llama-3-8B-based text encoder onto the same GPU, and the README states the all-on-GPU footprint is ~17 GB, "primarily due to the text embedding model" (canonical README). That exceeds the 4070 Ti Super's 16 GB. The documented fix isTEXT_ENCODER_DEVICE=cpu, which the README describes as forcing text encoding to the CPU for smaller cards — slightly slower, but it reduces VRAM usage to under 3 GB (canonical README). That is the path documented below.
Installation
Installation steps below come from the canonical NVIDIA sources — the official Kimodo installation docs, the Quick Start docs, and the
nv-tlabs/kimodoREADME. The ComfyUI section at the end uses the communityjtydhr88/ComfyUI-Kimodoplugin and is clearly labelled as such.
1. Set up a Python environment
The installation docs suggest a fresh Python 3.10 environment:
conda create -n kimodo python=3.10
conda activate kimodo
2. Install PyTorch first, matched to your CUDA
The installation docs recommend installing a compatible PyTorch (2.0 or newer) before installing Kimodo. The RTX 4070 Ti Super is an Ada Lovelace (sm_89) card, so no special CUDA index URL is required — the default stable PyTorch wheels (cu124 / cu121) already ship sm_89 kernels. A plain install is enough:
pip install torch
Unlike Blackwell GPUs (RTX 50-series, sm_120), the RTX 4070 Ti Super needs no special wheel selection — the standard
pip install torchincludes the Ada sm_89 kernels this card uses. (If you maintain a multi-card box and want to pin a CUDA build explicitly, pick the matching index URL at pytorch.org/get-started/locally.)
3. Install Kimodo
Two options, per the installation docs:
# Minimal install (CLI generation only)
pip install git+https://github.com/nv-tlabs/kimodo.git
Or, with the interactive demo web UI:
pip install "kimodo[all] @ git+https://github.com/nv-tlabs/kimodo.git"
4. Authenticate with HuggingFace and request Llama-3 access
KiMoDo's text encoder relies on the gated meta-llama/Meta-Llama-3-8B-Instruct model — the installation docs state the Kimodo text encoder requires it. Request access on its HF page, wait for approval, then authenticate locally:
hf auth login
# or place a token at ~/.cache/huggingface/token
KiMoDo's own weights (the 282M-param diffusion model) live at nvidia/Kimodo-SOMA-RP-v1.1 and are not gated; they download automatically on first use (the README notes models download automatically when you first generate).
Running
Generate a motion from a text prompt (RTX 4070 Ti Super path)
This is the load-bearing command for a 16 GB card. Set TEXT_ENCODER_DEVICE=cpu so the Llama-3 text encoder runs on the CPU and the diffusion model has the GPU to itself:
TEXT_ENCODER_DEVICE=cpu kimodo_gen "a person walks forward" \
--model Kimodo-SOMA-RP-v1.1 \
--duration 5.0 \
--output output
Argument reference (from the Quick Start and CLI docs):
--model— checkpoint name. This recipe pinsKimodo-SOMA-RP-v1.1. Other published tiers includeKimodo-SOMA-RP-v1andKimodo-G1-RP-v1(Unitree G1 humanoid-robot retarget).--duration— motion length in seconds. The README documents an adjustable duration up to 10 seconds.--output— output stem name; the motion file is written asoutput.npz.
Use the env var name
TEXT_ENCODER_DEVICEexactly as written. The reporter in Issue #27 hit a CUDA OOM because the text encoder was loading onto the GPU even when CPU execution was requested; the maintainer added theTEXT_ENCODER_DEVICEvariable that independently controls the text-encoder (LLM2Vec) device in the merged PR #32. See Troubleshooting.
Also export BVH for Blender/Maya/MotionBuilder
The CLI docs document a --bvh flag to also export a BVH file (SOMA models only) alongside the NPZ — the format you want for downstream DCC tools:
TEXT_ENCODER_DEVICE=cpu kimodo_gen "a person walks forward" \
--model Kimodo-SOMA-RP-v1.1 \
--duration 5.0 \
--output walk_forward \
--bvh
Or launch the interactive web demo
If you installed with kimodo[all], the README documents a web UI launched with kimodo_demo (it serves locally at http://127.0.0.1:7860). Keep the CPU text-encoder flag set so it fits a 16 GB card:
TEXT_ENCODER_DEVICE=cpu kimodo_demo
Then open the local demo URL it prints. You can load and visualize a motion you generated earlier from its .npz path; see the CLI docs for the demo's load/save workflow.
Optional: run inside ComfyUI
A community ComfyUI plugin, jtydhr88/ComfyUI-Kimodo (not an NVIDIA-official wrapper), exposes KiMoDo as nodes:
cd ComfyUI/custom_nodes
git clone https://github.com/jtydhr88/ComfyUI-Kimodo.git
cd ComfyUI-Kimodo
pip install -r requirements.txt
The plugin wraps the same upstream kimodo package, so the 16 GB VRAM constraint is identical: if you launch ComfyUI without the CPU text-encoder offload you will hit OOM at the text-encoder load step (the same failure as Issue #27). Set the env var in the shell where you start ComfyUI, e.g. TEXT_ENCODER_DEVICE=cpu python main.py.
Results
- Output (NPZ): the default output is an
.npz3D skeletal animation, adjustable up to 10 seconds (README). The NPZ holds global + local joint rotation matrices, foot-contact labels, and root trajectory (README output-format spec). - Output (BVH, SOMA models only): add
--bvhto write a standard BVH alongside the NPZ (CLI docs). This is the format for Blender/Maya/MotionBuilder. - Output (G1 / SMPL-X): the G1 humanoid-robot checkpoint instead supports MuJoCo qpos CSV and AMASS NPZ output (README).
- VRAM (default, all-on-GPU): ~17 GB (README) — exceeds the 4070 Ti Super's 16 GB.
- VRAM (with
TEXT_ENCODER_DEVICE=cpu): under 3 GB — what this recipe uses. The README documents this CPU-offload path for smaller cards, and a Kimodo maintainer (davrempe, a repo collaborator) confirmed runningTEXT_ENCODER_DEVICE=cpu kimodo_genuses only ~3 GB of VRAM in Issue #27. That report was filed by a 16 GB-card owner, so the same default-overflow and the same fix apply to the 4070 Ti Super's 16 GB. - Speed: not quoted. No source reports KiMoDo generation throughput on an RTX 4070 Ti Super by name (the README's most-tested cards are the RTX 3090, RTX 4090, and A100), and the backend has no benchmark for this pair yet (
/check/returnsunknown). Quoting a number without a 4070-Ti-Super-named measurement would be a guess. Once a community benchmark lands it will appear at /check/kimodo/rtx-4070-ti-super; please /contribute yours. - Model size: 282M parameters for the diffusion model itself (HF model card; 1.13 GB checkpoint on disk per the HF tree); the Llama-3-8B text encoder (~15 GB on disk) dwarfs it.
- License: Apache-2.0 for the codebase (repo); the checkpoint at
nvidia/Kimodo-SOMA-RP-v1.1is under the NVIDIA Open Model License (a permissive, commercial-use-permitted license), per the model table in the README.
For the full benchmark data, see /check/kimodo/rtx-4070-ti-super.
Troubleshooting
"CUDA out of memory" on a 16 GB card
This is the default failure on a 16 GB card and was reported directly on the canonical repo by a 16 GB-GPU owner in Issue #27: the Llama-3 text encoder loaded onto the GPU and left no room for the diffusion model. Two things to check:
- Set
TEXT_ENCODER_DEVICE=cpu. The maintainerdavrempe(a repo collaborator) responded on that issue that they added aTEXT_ENCODER_DEVICEenvironment variable — which controls the LLM2Vec text-encoder device independently of the Kimodo model — in the merged PR #32, and that runningTEXT_ENCODER_DEVICE=cpu kimodo_genthen uses only ~3 GB of VRAM (comment). - Make sure you're on a recent Kimodo. PR #32 is merged; reinstall from
git+https://github.com/nv-tlabs/kimodo.gitif you installed before it landed (the README changelog dates theTEXT_ENCODER_DEVICEsupport to 2026-04-24).
Note: that issue was filed by an RTX 5080 owner, but the 5080 and the 4070 Ti Super are the same 16 GB VRAM tier and the default ~17 GB footprint overflows both identically — the fix is the same on the 4070 Ti Super.
"401 / gated model" or "Access to model Meta-Llama-3-8B-Instruct is restricted"
KiMoDo's text encoder pulls the gated meta-llama/Meta-Llama-3-8B-Instruct. You need to (a) request access on the meta-llama/Meta-Llama-3-8B-Instruct HF page and wait for approval, then (b) run hf auth login or drop a token at ~/.cache/huggingface/token (installation docs). KiMoDo's own checkpoint is not gated — only the Llama-3 dependency is.
--bvh flag is ignored / output has no .bvh file
The CLI docs note that --bvh exports BVH for SOMA models only. If you ran with --model Kimodo-G1-RP-v1 (the humanoid-robot retarget), the SOMA BVH path is skipped — pass --model Kimodo-SOMA-RP-v1.1 (or -v1) to get BVH. The G1 / SMPL-X checkpoints instead emit MuJoCo qpos CSV and AMASS NPZ (CLI docs).
Slow generation with the CPU text encoder
Expected, and a documented trade-off — the README describes the CPU-offload path as slightly slower in exchange for the large VRAM saving. Only the one-time text-encoding step runs on the CPU; the diffusion sampling stays on the GPU. If the CPU step is your bottleneck and you have spare VRAM, you can leave the text encoder on the GPU — but on a 16 GB 4070 Ti Super the all-on-GPU path overflows (~17 GB), so CPU offload is the recommended default.
For other issues, file a report via the submission form.