KiMoDo on RTX 4070 Ti Super: Text-to-3D-Motion Generation Guide

What You'll Build

A local text-to-3D-motion pipeline using NVIDIA's KiMoDo (Kinematic Motion Diffusion) on an RTX 4070 Ti Super. You type a prompt like "a person walks forward" and get back a 3D skeletal animation — joint rotations and root translations, up to 10 seconds long — saved as .npz (and optionally .bvh for direct import into Blender/Maya/MotionBuilder). This recipe covers both the official CLI (kimodo_gen / kimodo_demo) and the community ComfyUI plugin.

Hardware data: RTX 4070 Ti Super (16 GB VRAM) · KiMoDo (282M-param diffusion model + Llama-3-8B text encoder) · See benchmark data

⚠️ Known issue (16 GB cards): The default pipeline loads the Llama-3-8B-based text encoder on the GPU and needs about 17 GB — more than the 4070 Ti Super's 16 GB. The fix is one environment variable: set TEXT_ENCODER_DEVICE=cpu to move the text encoder to system RAM and the pipeline runs in under 3 GB of VRAM. Both numbers come from the canonical README (see VRAM caveat below).

ℹ️ Not a roleplay / language model. Despite the -RP in the checkpoint name, KiMoDo is a motion-generation model: text in, a 3D skeleton animation out. The "SOMA" / "RP" in the name refer to the skeleton (the SOMA 77-joint body model) and the Rigplay training data — not chat or roleplay. The model is in our specialized vertical; its only language component is the frozen Llama-3-8B used as a text encoder (via LLM2Vec) — it never generates text.

Note: as of this writing the backend has no measured benchmarks for this pair (/check/kimodo/rtx-4070-ti-super returns unknown). The VRAM figures below come from the canonical Kimodo GitHub README and a maintainer comment on Issue #27; treat them as the published baseline and report your own numbers via /contribute.

Requirements

Component	Minimum	Tested
GPU	NVIDIA, CUDA-capable, PyTorch 2.0+ — see VRAM rows	RTX 4070 Ti Super (16 GB) — pair not yet benchmarked, see /check/
VRAM (default, all on GPU)	~17 GB (canonical README) — does not fit a 16 GB 4070 Ti Super out of the box	—
VRAM (with `TEXT_ENCODER_DEVICE=cpu`)	under 3 GB (canonical README; maintainer-confirmed ~3 GB in Issue #27) — comfortably fits a 4070 Ti Super	—
RAM	16 GB+ (Llama-3-8B sits on the CPU when offloaded; budget ~16 GB for it plus headroom)	—
Storage	~16 GB (Llama-3-8B weights ~15 GB + KiMoDo checkpoint 1.13 GB)	—
Software	Python 3.10, PyTorch 2.0+ (installation docs)	—

VRAM caveat — this is the load-bearing detail for a 16 GB card. The KiMoDo diffusion model itself is small (282M params per the HF model card; the model.safetensors checkpoint is 1.13 GB on disk per the HF file tree). But the default pipeline loads a Llama-3-8B-based text encoder onto the same GPU, and the README states the all-on-GPU footprint is ~17 GB, "primarily due to the text embedding model" (canonical README). That exceeds the 4070 Ti Super's 16 GB. The documented fix is TEXT_ENCODER_DEVICE=cpu, which the README describes as forcing text encoding to the CPU for smaller cards — slightly slower, but it reduces VRAM usage to under 3 GB (canonical README). That is the path documented below.

Installation

Installation steps below come from the canonical NVIDIA sources — the official Kimodo installation docs, the Quick Start docs, and the nv-tlabs/kimodo README. The ComfyUI section at the end uses the community jtydhr88/ComfyUI-Kimodo plugin and is clearly labelled as such.

1. Set up a Python environment

The installation docs suggest a fresh Python 3.10 environment:

conda create -n kimodo python=3.10
conda activate kimodo

2. Install PyTorch first, matched to your CUDA

The installation docs recommend installing a compatible PyTorch (2.0 or newer) before installing Kimodo. The RTX 4070 Ti Super is an Ada Lovelace (sm_89) card, so no special CUDA index URL is required — the default stable PyTorch wheels (cu124 / cu121) already ship sm_89 kernels. A plain install is enough:

pip install torch

Unlike Blackwell GPUs (RTX 50-series, sm_120), the RTX 4070 Ti Super needs no special wheel selection — the standard pip install torch includes the Ada sm_89 kernels this card uses. (If you maintain a multi-card box and want to pin a CUDA build explicitly, pick the matching index URL at pytorch.org/get-started/locally.)

3. Install Kimodo

Two options, per the installation docs:

# Minimal install (CLI generation only)
pip install git+https://github.com/nv-tlabs/kimodo.git

Or, with the interactive demo web UI:

pip install "kimodo[all] @ git+https://github.com/nv-tlabs/kimodo.git"

4. Authenticate with HuggingFace and request Llama-3 access

KiMoDo's text encoder relies on the gated meta-llama/Meta-Llama-3-8B-Instruct model — the installation docs state the Kimodo text encoder requires it. Request access on its HF page, wait for approval, then authenticate locally:

hf auth login
# or place a token at ~/.cache/huggingface/token

KiMoDo's own weights (the 282M-param diffusion model) live at nvidia/Kimodo-SOMA-RP-v1.1 and are not gated; they download automatically on first use (the README notes models download automatically when you first generate).

Running

Generate a motion from a text prompt (RTX 4070 Ti Super path)

This is the load-bearing command for a 16 GB card. Set TEXT_ENCODER_DEVICE=cpu so the Llama-3 text encoder runs on the CPU and the diffusion model has the GPU to itself:

TEXT_ENCODER_DEVICE=cpu kimodo_gen "a person walks forward" \
    --model Kimodo-SOMA-RP-v1.1 \
    --duration 5.0 \
    --output output

Argument reference (from the Quick Start and CLI docs):

--model — checkpoint name. This recipe pins Kimodo-SOMA-RP-v1.1. Other published tiers include Kimodo-SOMA-RP-v1 and Kimodo-G1-RP-v1 (Unitree G1 humanoid-robot retarget).
--duration — motion length in seconds. The README documents an adjustable duration up to 10 seconds.
--output — output stem name; the motion file is written as output.npz.

Use the env var name TEXT_ENCODER_DEVICE exactly as written. The reporter in Issue #27 hit a CUDA OOM because the text encoder was loading onto the GPU even when CPU execution was requested; the maintainer added the TEXT_ENCODER_DEVICE variable that independently controls the text-encoder (LLM2Vec) device in the merged PR #32. See Troubleshooting.

Also export BVH for Blender/Maya/MotionBuilder

The CLI docs document a --bvh flag to also export a BVH file (SOMA models only) alongside the NPZ — the format you want for downstream DCC tools:

TEXT_ENCODER_DEVICE=cpu kimodo_gen "a person walks forward" \
    --model Kimodo-SOMA-RP-v1.1 \
    --duration 5.0 \
    --output walk_forward \
    --bvh

Or launch the interactive web demo

If you installed with kimodo[all], the README documents a web UI launched with kimodo_demo (it serves locally at http://127.0.0.1:7860). Keep the CPU text-encoder flag set so it fits a 16 GB card:

TEXT_ENCODER_DEVICE=cpu kimodo_demo

Then open the local demo URL it prints. You can load and visualize a motion you generated earlier from its .npz path; see the CLI docs for the demo's load/save workflow.

Optional: run inside ComfyUI

A community ComfyUI plugin, jtydhr88/ComfyUI-Kimodo (not an NVIDIA-official wrapper), exposes KiMoDo as nodes:

cd ComfyUI/custom_nodes
git clone https://github.com/jtydhr88/ComfyUI-Kimodo.git
cd ComfyUI-Kimodo
pip install -r requirements.txt

The plugin wraps the same upstream kimodo package, so the 16 GB VRAM constraint is identical: if you launch ComfyUI without the CPU text-encoder offload you will hit OOM at the text-encoder load step (the same failure as Issue #27). Set the env var in the shell where you start ComfyUI, e.g. TEXT_ENCODER_DEVICE=cpu python main.py.

Results

Output (NPZ): the default output is an .npz 3D skeletal animation, adjustable up to 10 seconds (README). The NPZ holds global + local joint rotation matrices, foot-contact labels, and root trajectory (README output-format spec).
Output (BVH, SOMA models only): add --bvh to write a standard BVH alongside the NPZ (CLI docs). This is the format for Blender/Maya/MotionBuilder.
Output (G1 / SMPL-X): the G1 humanoid-robot checkpoint instead supports MuJoCo qpos CSV and AMASS NPZ output (README).
VRAM (default, all-on-GPU): ~17 GB (README) — exceeds the 4070 Ti Super's 16 GB.
VRAM (with TEXT_ENCODER_DEVICE=cpu): under 3 GB — what this recipe uses. The README documents this CPU-offload path for smaller cards, and a Kimodo maintainer (davrempe, a repo collaborator) confirmed running TEXT_ENCODER_DEVICE=cpu kimodo_gen uses only ~3 GB of VRAM in Issue #27. That report was filed by a 16 GB-card owner, so the same default-overflow and the same fix apply to the 4070 Ti Super's 16 GB.
Speed: not quoted. No source reports KiMoDo generation throughput on an RTX 4070 Ti Super by name (the README's most-tested cards are the RTX 3090, RTX 4090, and A100), and the backend has no benchmark for this pair yet (/check/ returns unknown). Quoting a number without a 4070-Ti-Super-named measurement would be a guess. Once a community benchmark lands it will appear at /check/kimodo/rtx-4070-ti-super; please /contribute yours.
Model size: 282M parameters for the diffusion model itself (HF model card; 1.13 GB checkpoint on disk per the HF tree); the Llama-3-8B text encoder (~15 GB on disk) dwarfs it.
License: Apache-2.0 for the codebase (repo); the checkpoint at nvidia/Kimodo-SOMA-RP-v1.1 is under the NVIDIA Open Model License (a permissive, commercial-use-permitted license), per the model table in the README.

For the full benchmark data, see /check/kimodo/rtx-4070-ti-super.

Troubleshooting

"CUDA out of memory" on a 16 GB card

This is the default failure on a 16 GB card and was reported directly on the canonical repo by a 16 GB-GPU owner in Issue #27: the Llama-3 text encoder loaded onto the GPU and left no room for the diffusion model. Two things to check:

Set TEXT_ENCODER_DEVICE=cpu. The maintainer davrempe (a repo collaborator) responded on that issue that they added a TEXT_ENCODER_DEVICE environment variable — which controls the LLM2Vec text-encoder device independently of the Kimodo model — in the merged PR #32, and that running TEXT_ENCODER_DEVICE=cpu kimodo_gen then uses only ~3 GB of VRAM (comment).
Make sure you're on a recent Kimodo. PR #32 is merged; reinstall from git+https://github.com/nv-tlabs/kimodo.git if you installed before it landed (the README changelog dates the TEXT_ENCODER_DEVICE support to 2026-04-24).

Note: that issue was filed by an RTX 5080 owner, but the 5080 and the 4070 Ti Super are the same 16 GB VRAM tier and the default ~17 GB footprint overflows both identically — the fix is the same on the 4070 Ti Super.

"401 / gated model" or "Access to model Meta-Llama-3-8B-Instruct is restricted"

KiMoDo's text encoder pulls the gated meta-llama/Meta-Llama-3-8B-Instruct. You need to (a) request access on the meta-llama/Meta-Llama-3-8B-Instruct HF page and wait for approval, then (b) run hf auth login or drop a token at ~/.cache/huggingface/token (installation docs). KiMoDo's own checkpoint is not gated — only the Llama-3 dependency is.

`--bvh` flag is ignored / output has no `.bvh` file

The CLI docs note that --bvh exports BVH for SOMA models only. If you ran with --model Kimodo-G1-RP-v1 (the humanoid-robot retarget), the SOMA BVH path is skipped — pass --model Kimodo-SOMA-RP-v1.1 (or -v1) to get BVH. The G1 / SMPL-X checkpoints instead emit MuJoCo qpos CSV and AMASS NPZ (CLI docs).

Slow generation with the CPU text encoder

Expected, and a documented trade-off — the README describes the CPU-offload path as slightly slower in exchange for the large VRAM saving. Only the one-time text-encoding step runs on the CPU; the diffusion sampling stays on the GPU. If the CPU step is your bottleneck and you have spare VRAM, you can leave the text encoder on the GPU — but on a 16 GB 4070 Ti Super the all-on-GPU path overflows (~17 GB), so CPU offload is the recommended default.

For other issues, file a report via the submission form.