What You'll Build
A local text-to-3D-motion pipeline using NVIDIA's KiMoDo (Kinematic Motion Diffusion) on an RTX 5070. You'll type a prompt like "A person walks forward." and get back a 3D skeleton animation — joint rotations and root translations at 30 fps, up to 10 s long — saved as .npz (and optionally .bvh for direct import into Blender/Maya/MotionBuilder). This recipe covers both the official CLI (kimodo_gen / kimodo_demo) and the community ComfyUI plugin.
Hardware data: RTX 5070 (12 GB VRAM) · KiMoDo (282M-param diffusion model + LLM2Vec/Llama-3-8B text encoder) · See benchmark data
⚠️ Known issue (12 GB-card-specific): The default pipeline loads the Llama-3-8B text encoder on the GPU and needs ~17 GB — far more than the 5070's 12 GB. A 16 GB Blackwell-card owner hit exactly this on the canonical repo (Issue #27); a Kimodo maintainer confirmed the fix below. Set
TEXT_ENCODER_DEVICE=cpuand the pipeline runs in ~3 GB — leaving the 12 GB card with room to spare.
Note: As of this writing the backend has no measured benchmarks for this pair (
/check/returnsunknown). The VRAM figures below come from the official Kimodo GitHub README and a maintainer comment on Issue #27; treat them as the published baseline and report your own numbers via /contribute.
Requirements
| Component | Minimum | Tested |
|---|---|---|
| GPU | NVIDIA, CUDA-capable, PyTorch 2.0+ — see VRAM row | RTX 5070 (12 GB) — pair not yet benchmarked, see /check/ |
| VRAM (default, all on GPU) | ~17 GB (official README) — does not fit a 12 GB 5070 out of the box | — |
VRAM (with TEXT_ENCODER_DEVICE=cpu) | ~3 GB (official README; maintainer-confirmed on a 16 GB Blackwell card in Issue #27) — fits a 5070 with wide headroom | — |
| RAM | 16 GB (Llama-3-8B sits on the CPU when offloaded; budget ~16 GB for it plus headroom) | — |
| Storage | ~16 GB (Llama-3-8B weights ~15 GB + KiMoDo checkpoint ~1.13 GB) | — |
| Software | Python 3.10, PyTorch 2.0+ (installation docs) | — |
VRAM caveat — this is the load-bearing detail for a 12 GB card. The KiMoDo diffusion model itself is small (282 M params, HF model card; checkpoint ~1.13 GB on disk), but the default pipeline loads a Llama-3-8B-based text encoder (LLM2Vec) on the same GPU. The README states it plainly: "Kimodo requires ~17GB of VRAM to generate locally entirely on GPU, primarily due to the text embedding model." (official README). That ~17 GB blows well past the 5070's 12 GB. The official fix is
TEXT_ENCODER_DEVICE=cpu, which moves the text encoder to system RAM and drops GPU VRAM to ~3 GB — "This is slightly slower but reduces VRAM usage to <3 GB." (official README). At ~3 GB resident, the 12 GB 5070 has plenty of room even with a display attached. This is the path documented below.
Installation
Installation steps below come from the canonical NVIDIA sources only — the official Kimodo installation docs, the Quick Start docs, and the
nv-tlabs/kimodoREADME. No third-party walkthrough is required because the install path is the upstream-supported one. The ComfyUI section at the end uses the communityjtydhr88/ComfyUI-Kimodoplugin and is clearly labelled as such.
1. Set up a Python environment
Per the official installation docs:
conda create -n kimodo python=3.10
conda activate kimodo
2. Install PyTorch first, matched to your CUDA
Kimodo's docs say to install a compatible PyTorch manually before Kimodo, optimized for your CUDA version — "Anything over PyTorch 2.0 is sufficient." (installation docs). The RTX 5070 is a Blackwell (GB205, sm_120) card, so install a wheel that ships sm_120 kernels — that means the CUDA 12.8 (cu128) build or newer. Pick the matching index URL at pytorch.org/get-started/locally:
pip install torch --index-url https://download.pytorch.org/whl/cu128
The stable
cu128(and newer) PyTorch wheels include Blackwell sm_120 kernels. Oldercu121/cu124wheels predate sm_120 and will fall back to slow JIT or fail outright on a 5070 — do not use them on this card.
3. Install Kimodo
Two options from the official installation docs:
# Minimal install (CLI generation only)
pip install git+https://github.com/nv-tlabs/kimodo.git
Or, with the interactive demo web UI:
pip install "kimodo[all] @ git+https://github.com/nv-tlabs/kimodo.git"
4. Authenticate with HuggingFace and request Llama-3 access
KiMoDo's text encoder relies on the gated meta-llama/Meta-Llama-3-8B-Instruct model — you must request access on its HF page and then authenticate locally, per the installation docs:
hf auth login
# or place a token at ~/.cache/huggingface/token
KiMoDo's own weights (the 282M-param diffusion model) live at nvidia/Kimodo-SOMA-RP-v1.1 and are not gated; they download automatically on first use.
Running
Generate a motion from a text prompt (RTX 5070 path)
This is the load-bearing command for a 12 GB card. Set TEXT_ENCODER_DEVICE=cpu so the Llama-3 text encoder runs on CPU and the diffusion model has the GPU to itself (official README):
TEXT_ENCODER_DEVICE=cpu kimodo_gen "A person walks forward." \
--model Kimodo-SOMA-RP-v1.1 \
--duration 5.0 \
--output output
Arguments are from the Quick Start docs and the CLI docs:
--model— checkpoint name. This recipe pinsKimodo-SOMA-RP-v1.1. Other published tiers areKimodo-SOMA-RP-v1andKimodo-G1-RP-v1(humanoid robot retarget).--duration— motion length in seconds. The cap is 10 s / 300 frames at 30 fps (HF model card).--output— output stem name. The motion file is written asoutput.npz.
Use the canonical env var name
TEXT_ENCODER_DEVICE— notKIMODO_TEXT_ENCODER_DEVICE. The 16 GB-card owner in Issue #27 hit a CUDA OOM precisely because theKIMODO_-prefixed name did not control the encoder device; the maintainer's merged fix (PR #32) wires up the unprefixedTEXT_ENCODER_DEVICE. See Troubleshooting.
Also export BVH for Blender/Maya/MotionBuilder
The SOMA-family checkpoints support direct BVH export via the --bvh flag — per the CLI docs, when this flag is set Kimodo will also export BVH (SOMA models only) to the same output stem — the most useful format for downstream DCC tools:
TEXT_ENCODER_DEVICE=cpu kimodo_gen "A person walks forward." \
--model Kimodo-SOMA-RP-v1.1 \
--duration 5.0 \
--output walk_forward \
--bvh
Or launch the interactive web demo
If you installed with kimodo[all], the Quick Start docs document a web UI:
TEXT_ENCODER_DEVICE=cpu kimodo_demo
Then open http://localhost:7860. To visualize a motion you generated earlier, the CLI docs say to go under "Load/Save" > "Motion", type the path of the generated output .npz file, then click "Load Motion".
Optional: run inside ComfyUI
A community ComfyUI plugin, jtydhr88/ComfyUI-Kimodo (a third-party wrapper, not an NVIDIA-official one), exposes KiMoDo as nodes and adds FBX (Mixamo-rigged) export on top of the upstream NPZ/BVH:
cd ComfyUI/custom_nodes
git clone https://github.com/jtydhr88/ComfyUI-Kimodo.git
cd ComfyUI-Kimodo
pip install -r requirements.txt
The plugin's README documents NPZ, BVH (SOMA skeletons only), and FBX-via-Mixamo retarget as export options. It does not document the TEXT_ENCODER_DEVICE=cpu workaround — if you launch ComfyUI on a 12 GB 5070 without it you will hit OOM at the text-encoder load step (the same failure as Issue #27). Set the env var in the shell where you start ComfyUI, e.g. TEXT_ENCODER_DEVICE=cpu python main.py.
Results
- Output (NPZ): the Kimodo NPZ contains global and local joint rotation matrices, foot contacts, root positions, and the global root heading at 30 fps, max 10 s / 300 frames (official README; HF model card).
- Output (BVH, SOMA models only): add
--bvhto write a standard BVH alongside the NPZ (CLI docs). This is the format you want for Blender/Maya/MotionBuilder — Kimodo does not emit MP4 video; visualization happens in the demo UI. - Output (FBX): only available via the community ComfyUI plugin's Mixamo-retarget node (ComfyUI-Kimodo README).
- VRAM (default, all-on-GPU): ~17 GB (official README) — exceeds the 5070's 12 GB.
- VRAM (with
TEXT_ENCODER_DEVICE=cpu): ~3 GB (official README) — what this recipe uses. A Kimodo maintainer confirmed this number on a 16 GB Blackwell card in Issue #27: "RunningTEXT_ENCODER_DEVICE=cpu kimodo_genuses only ~3 GB VRAM now." That 16 GB card is the same Blackwellsm_120compute generation as the 5070, so the ~3 GB resident footprint carries over directly; the 5070's 12 GB leaves ample headroom. - Speed: not quoted. No source reports KiMoDo generation throughput on an RTX 5070 by name, and the model card's tested-hardware list names RTX 3090 / 4090 / 5090 (and several datacenter cards) but not the 5070 (HF model card). Quoting a number without a 5070-named measurement would be a guess. Once a community benchmark lands it will appear at /check/kimodo/rtx-5070; please /contribute yours.
- Model size: 282M parameters for the diffusion model itself (HF model card; ~1.13 GB checkpoint on disk); the Llama-3-8B text encoder (~15 GB on disk) dwarfs it.
- License: Apache-2.0 for the codebase (repo); the checkpoint at
nvidia/Kimodo-SOMA-RP-v1.1is under the NVIDIA Open Model License and the model card states it "is ready for commercial use." — no non-commercial restriction.
For the full benchmark data, see /check/kimodo/rtx-5070.
Troubleshooting
"CUDA out of memory" even though you set CPU mode
This is the default failure on a small-VRAM card, and it was reported directly on the canonical repo by a 16 GB Blackwell-card owner in Issue #27: the Llama-3 text encoder loaded onto the GPU (~14.7 GiB) and left no room for the diffusion model. On a 12 GB 5070 the same load is even more out of reach. Two things to check:
- Use the unprefixed variable. Set
TEXT_ENCODER_DEVICE=cpu, notKIMODO_TEXT_ENCODER_DEVICE=cpu. The prefixed name does not control the encoder device; the maintainer's merged fix (PR #32) wires upTEXT_ENCODER_DEVICE. The maintainer (davrempe, a repo collaborator) confirmed on that issue: "RunningTEXT_ENCODER_DEVICE=cpu kimodo_genuses only ~3 GB VRAM now." (comment). - Make sure you're on a recent Kimodo. PR #32 is merged; reinstall from
git+https://github.com/nv-tlabs/kimodo.gitif you installed before it landed. A separate external PR (#12) proposes a quantized LLM2Vec encoder, but the maintainer notes on Issue #27 that it is untested — the CPU-offload path above is the supported fix.
"401 / gated model" or "Access to model Meta-Llama-3-8B-Instruct is restricted"
You need to (a) request access on the meta-llama/Meta-Llama-3-8B-Instruct HF page and wait for approval, then (b) run hf auth login or drop a token at ~/.cache/huggingface/token (installation docs).
--bvh flag is ignored / output has no .bvh file
BVH export is SOMA-only (CLI docs). If you ran with --model Kimodo-G1-RP-v1 (the humanoid-robot retarget), the SOMA skeleton path is skipped — pass --model Kimodo-SOMA-RP-v1.1 (or -v1) to get BVH. G1 / SMPL-X models instead support MuJoCo qpos CSV and AMASS NPZ output (CLI docs).
Slow generation on the CPU text encoder
Expected, and the documented trade-off ("This is slightly slower but reduces VRAM usage to <3 GB." — official README). The diffusion sampling stays on the GPU and is fast; only the one-time text-encoding step is on the CPU. If the CPU is your bottleneck, the Quick Start docs describe running kimodo_textencoder as a standalone service: on a smaller-VRAM card you can launch TEXT_ENCODER_DEVICE=cpu kimodo_textencoder in one terminal and kimodo_demo in another. A 16 GB-card owner found this two-terminal pattern resolved the OOM in Issue #27.
Plugin: kimodo package not found inside ComfyUI
The ComfyUI plugin's README says "The kimodo package itself will be auto-installed on first launch if needed." (ComfyUI-Kimodo README). If the auto-install silently fails (common when ComfyUI uses an embedded Python without internet egress), install it explicitly into the same interpreter ComfyUI uses: <comfy-python> -m pip install git+https://github.com/nv-tlabs/kimodo.git.
For other issues, file a report via the submission form.