KiMoDo on RTX 4070: Text-to-3D-Motion Generation Guide

What You'll Build

A local text-to-3D-motion pipeline using NVIDIA's KiMoDo (Kinematic Motion Diffusion) on an RTX 4070. You type a prompt like "A person walks forward." and get back a 3D skeletal animation — joint rotations and root translations at 30 fps, up to 10 seconds long — saved as .npz (and optionally .bvh for direct import into Blender/Maya/MotionBuilder). This recipe covers both the official CLI (kimodo_gen / kimodo_demo) and the community ComfyUI plugin.

Hardware data: RTX 4070 (12 GB VRAM) · KiMoDo (282M-param diffusion model + LLM2Vec/Llama-3-8B text encoder) · See benchmark data

⚠️ Known issue (12 GB-card-specific): The default pipeline loads the Llama-3-8B-based text encoder on the GPU and needs ~17 GB — far more than the 4070's 12 GB. A 16 GB-card owner hit exactly this on the canonical repo (Issue #27); a Kimodo maintainer confirmed the fix below. Set TEXT_ENCODER_DEVICE=cpu and the pipeline runs in ~3 GB — leaving the 12 GB card with room to spare.

ℹ️ Not a roleplay / language model. Despite the -RP in the checkpoint name, KiMoDo is a motion-generation model: text in, a 3D skeleton animation out. The "RP" refers to the Bones Rigplay training dataset and "SOMA" to the skeleton, not chat or roleplay. The model is in our specialized vertical; its only language component is the frozen Llama-3-8B used as a text encoder (via LLM2Vec) — it never generates text.

Note: As of this writing the backend has no measured benchmarks for this pair (/check/kimodo/rtx-4070 returns unknown). The VRAM figures below come from the canonical Kimodo GitHub README and a maintainer comment on Issue #27; treat them as the published baseline and report your own numbers via /contribute.

Requirements

Component	Minimum	Tested
GPU	NVIDIA, CUDA-capable, PyTorch 2.0+ — see VRAM rows	RTX 4070 (12 GB) — pair not yet benchmarked, see /check/
VRAM (default, all on GPU)	~17 GB (canonical README) — does not fit a 12 GB 4070 out of the box	—
VRAM (with `TEXT_ENCODER_DEVICE=cpu`)	under 3 GB (canonical README; maintainer-confirmed ~3 GB in Issue #27) — comfortably fits a 4070	—
RAM	16 GB+ (Llama-3-8B sits on the CPU when offloaded; budget ~16 GB for it plus headroom)	—
Storage	~16 GB (Llama-3-8B weights ~15 GB + KiMoDo checkpoint 1.13 GB)	—
Software	Python 3.10, PyTorch 2.0+ (installation docs)	—

VRAM caveat — this is the load-bearing detail for a 12 GB card. The KiMoDo diffusion model itself is small (282M params per the HF model card; the model.safetensors checkpoint is 1.13 GB on disk per the HF file tree). But the default pipeline loads a Llama-3-8B-based text encoder (LLM2Vec) onto the same GPU. The README states it plainly: "Kimodo requires ~17GB of VRAM to generate locally entirely on GPU, primarily due to the text embedding model." (canonical README). That ~17 GB blows well past the 4070's 12 GB. The documented fix is TEXT_ENCODER_DEVICE=cpu, which moves the text encoder to system RAM — "This is slightly slower but reduces VRAM usage to <3 GB." (canonical README). At ~3 GB resident, the 12 GB 4070 has plenty of room even with a display attached. That is the path documented below.

A note on the offload path and the 4070's PCIe Gen4 link. With TEXT_ENCODER_DEVICE=cpu, the text encoder runs on the CPU and its embeddings are transferred to the GPU over the PCIe bus. The RTX 4070 uses a PCIe Gen4 x16 link — roughly half the host-bandwidth of the Gen5 cards (e.g. RTX 5070) in the same 12 GB tier — so the one-time text-encoding-and-transfer step is somewhat slower here than on a Gen5 card. This affects only the CPU-offloaded text-encoder stage; the diffusion sampling itself runs entirely on the GPU and is unaffected. The VRAM saving is identical regardless of link generation.

Installation

Installation steps below come from the canonical NVIDIA sources only — the official Kimodo installation docs, the Quick Start docs, and the nv-tlabs/kimodo README. No third-party walkthrough is required because the install path is the upstream-supported one. The ComfyUI section at the end uses the community jtydhr88/ComfyUI-Kimodo plugin and is clearly labelled as such.

1. Set up a Python environment

Per the official installation docs:

conda create -n kimodo python=3.10
conda activate kimodo

2. Install PyTorch first, matched to your CUDA

Kimodo's docs recommend installing a compatible PyTorch manually before installing Kimodo — "Anything over PyTorch 2.0 is sufficient." (installation docs). The RTX 4070 is an Ada Lovelace (AD104, sm_89) card, so no special CUDA index URL is required — the default stable PyTorch wheels (cu124 / cu121) already ship sm_89 kernels. A plain install is enough:

pip install torch

Unlike Blackwell GPUs (RTX 50-series, sm_120), the RTX 4070 needs no special wheel selection — the standard pip install torch includes the Ada sm_89 kernels this card uses. Do not copy a --index-url .../cu128 line from a Blackwell guide; the default wheel is correct for the 4070. (If you maintain a multi-card box and want to pin a CUDA build explicitly, pick the matching index URL at pytorch.org/get-started/locally.)

3. Install Kimodo

Two options from the official installation docs:

# Minimal install (CLI generation only)
pip install git+https://github.com/nv-tlabs/kimodo.git

Or, with the interactive demo web UI:

pip install "kimodo[all] @ git+https://github.com/nv-tlabs/kimodo.git"

4. Authenticate with HuggingFace and request Llama-3 access

KiMoDo's text encoder relies on the gated meta-llama/Meta-Llama-3-8B-Instruct model — the installation docs state your HF account must be granted access to that model page and you must provide a token at runtime. Request access on its HF page, wait for approval, then authenticate locally:

hf auth login
# or place a token at ~/.cache/huggingface/token

KiMoDo's own weights (the 282M-param diffusion model) live at nvidia/Kimodo-SOMA-RP-v1.1 and are not gated. The README notes that "models will be downloaded automatically when attempting to generate from the CLI or Interactive Demo, so there is no need to download them manually" (canonical README).

Running

Generate a motion from a text prompt (RTX 4070 path)

This is the load-bearing command for a 12 GB card. Set TEXT_ENCODER_DEVICE=cpu so the Llama-3 text encoder runs on the CPU and the diffusion model has the GPU to itself (canonical README):

TEXT_ENCODER_DEVICE=cpu kimodo_gen "A person walks forward." \
    --model Kimodo-SOMA-RP-v1.1 \
    --duration 5.0 \
    --output output

Arguments are from the Quick Start docs and the CLI docs:

--model — checkpoint name. This recipe pins Kimodo-SOMA-RP-v1.1. Other published tiers include Kimodo-SOMA-RP-v1 and Kimodo-G1-RP-v1 (Unitree G1 humanoid-robot retarget).
--duration — motion duration in seconds (default 5.0). The maximum is 10 seconds / 300 frames at 30 fps (HF model card).
--output — output stem name; the motion file is written as output.npz.

Use the canonical env var name TEXT_ENCODER_DEVICE — not the KIMODO_-prefixed KIMODO_TEXT_ENCODER_DEVICE. The 16 GB-card owner in Issue #27 hit a CUDA OOM precisely because the KIMODO_-prefixed name they exported did not control the text-encoder device, so the Llama fallback loaded onto the GPU anyway. The maintainer's merged fix (PR #32, merged 2026-04-25) wires up the unprefixed TEXT_ENCODER_DEVICE. See Troubleshooting.

Also export BVH for Blender/Maya/MotionBuilder

The CLI docs document a --bvh flag that, when set, will also export BVH (SOMA models only) using the same stem as --output — the format you want for downstream DCC tools:

TEXT_ENCODER_DEVICE=cpu kimodo_gen "A person walks forward." \
    --model Kimodo-SOMA-RP-v1.1 \
    --duration 5.0 \
    --output walk_forward \
    --bvh

Or launch the interactive web demo

If you installed with kimodo[all], the README documents a web UI launched with kimodo_demo; it runs locally at http://127.0.0.1:7860. Keep the CPU text-encoder flag set so it fits a 12 GB card:

TEXT_ENCODER_DEVICE=cpu kimodo_demo

Then open the local demo URL it prints. You can author motions with a timeline of text prompts and kinematic constraints, then export the result; see the Interactive Demo docs.

Optional: run inside ComfyUI

A community ComfyUI plugin, jtydhr88/ComfyUI-Kimodo (not an NVIDIA-official wrapper), exposes KiMoDo as nodes:

cd ComfyUI/custom_nodes
git clone https://github.com/jtydhr88/ComfyUI-Kimodo.git
cd ComfyUI-Kimodo
pip install -r requirements.txt

The plugin wraps the same upstream kimodo package, so the 12 GB VRAM constraint is identical: if you launch ComfyUI without the CPU text-encoder offload you will hit OOM at the text-encoder load step (the same failure as Issue #27). Set the env var in the shell where you start ComfyUI, e.g. TEXT_ENCODER_DEVICE=cpu python main.py.

Results

Output (NPZ): the default output is a custom .npz 3D skeletal animation at 30 fps, adjustable up to 10 seconds (README; HF model card). The NPZ holds global + local joint rotation matrices, foot-contact labels, and root trajectory (README output-format spec).
Output (BVH, SOMA models only): add --bvh to write a standard BVH alongside the NPZ (CLI docs). This is the format for Blender/Maya/MotionBuilder.
Output (G1 / SMPL-X): the G1 humanoid-robot checkpoint instead supports MuJoCo qpos CSV; Kimodo-SMPLX supports AMASS NPZ output (README).
VRAM (default, all-on-GPU): ~17 GB (README) — exceeds the 4070's 12 GB.
VRAM (with TEXT_ENCODER_DEVICE=cpu): under 3 GB — what this recipe uses. The README documents this CPU-offload path for smaller cards, and a Kimodo maintainer (davrempe, a repo collaborator) confirmed on Issue #27 that running TEXT_ENCODER_DEVICE=cpu kimodo_gen uses only ~3 GB VRAM. That report was filed by a 16 GB-card owner, so the same default-overflow and the same fix apply to the 4070's 12 GB.
Speed: not quoted. No source reports KiMoDo generation throughput on an RTX 4070 by name — the README states "The model has been most extensively tested on GeForce RTX 3090, GeForce RTX 4090, and NVIDIA A100 GPUs" (README) — and the backend has no benchmark for this pair yet (/check/ returns unknown). The 4070's PCIe Gen4 link also makes the CPU-offload text-encoder step slower than on a Gen5 card, so borrowing another card's number would be doubly misleading. Quoting any figure without a 4070-named measurement would be a guess. Once a community benchmark lands it will appear at /check/kimodo/rtx-4070; please /contribute yours.
Model size: 282M parameters for the diffusion model itself (HF model card; 1.13 GB checkpoint on disk per the HF tree); the Llama-3-8B text encoder (~15 GB on disk) dwarfs it.
License — commercial use is permitted. The codebase is Apache-2.0 (repo); the checkpoint at nvidia/Kimodo-SOMA-RP-v1.1 is under the NVIDIA Open Model License, and the HF model card states "This model is ready for commercial use." (HF model card). (Note the SMPL-X variant is the exception — it carries NVIDIA's research-only license; this recipe pins the commercially-usable SOMA checkpoint.)

For the full benchmark data, see /check/kimodo/rtx-4070.

Troubleshooting

"CUDA out of memory" on a 12 GB card

This is the default failure on a 12 GB card. The same root cause was reported on the canonical repo by a 16 GB-GPU owner in Issue #27: the Llama-3 text encoder loaded onto the GPU (~14.7 GiB) and left no room for the diffusion model. Two things to check:

Set TEXT_ENCODER_DEVICE=cpu. The maintainer davrempe (a repo collaborator) responded on that issue that they added a TEXT_ENCODER_DEVICE environment variable — which controls the LLM2Vec text-encoder device independently of the Kimodo model — in the merged PR #32, and that running it then uses only ~3 GB of VRAM (comment).
Use the unprefixed variable name. The reporter exported KIMODO_TEXT_ENCODER_DEVICE=cpu (with the KIMODO_ prefix) and still hit OOM — that prefixed name is not the one the fix wires up. The canonical name is TEXT_ENCODER_DEVICE, with no prefix.
Make sure you're on a recent Kimodo. PR #32 merged on 2026-04-25; reinstall from git+https://github.com/nv-tlabs/kimodo.git if you installed before it landed (the README changelog dates the TEXT_ENCODER_DEVICE support to 2026-04-24).

Note: that issue was filed by an RTX 5080 (Blackwell, 16 GB) owner, but the default ~17 GB footprint overflows a 12 GB 4070 even harder — the fix (TEXT_ENCODER_DEVICE=cpu) is the same on the 4070, and at ~3 GB resident the 4070 has ample headroom.

"401 / gated model" or "Access to model Meta-Llama-3-8B-Instruct is restricted"

KiMoDo's text encoder pulls the gated meta-llama/Meta-Llama-3-8B-Instruct. You need to (a) request access on the meta-llama/Meta-Llama-3-8B-Instruct HF page and wait for approval, then (b) run hf auth login or drop a token at ~/.cache/huggingface/token (installation docs). KiMoDo's own checkpoint is not gated — only the Llama-3 dependency is.

`--bvh` flag is ignored / output has no `.bvh` file

The CLI docs note that --bvh exports BVH for SOMA models only. If you ran with --model Kimodo-G1-RP-v1 (the humanoid-robot retarget), the SOMA BVH path is skipped — pass --model Kimodo-SOMA-RP-v1.1 (or -v1) to get BVH. The G1 checkpoint instead emits MuJoCo qpos CSV, and Kimodo-SMPLX emits AMASS NPZ (CLI docs).

Slow generation with the CPU text encoder

Expected, and a documented trade-off — the README describes the CPU-offload path as slightly slower in exchange for the large VRAM saving. Only the one-time text-encoding step runs on the CPU; the diffusion sampling stays on the GPU. On the RTX 4070 this CPU step is a little slower than on a Gen5 card because the embeddings cross a PCIe Gen4 link, but the saving is what lets the model fit 12 GB at all — leaving the text encoder on the GPU overflows the card (~17 GB).

For other issues, file a report via the submission form.