KiMoDo on RTX 5070: Text-to-3D-Motion Generation Guide

What You'll Build

A local text-to-3D-motion pipeline using NVIDIA's KiMoDo (Kinematic Motion Diffusion) on an RTX 5070. You'll type a prompt like "A person walks forward." and get back a 3D skeleton animation — joint rotations and root translations at 30 fps, up to 10 s long — saved as .npz (and optionally .bvh for direct import into Blender/Maya/MotionBuilder). This recipe covers both the official CLI (kimodo_gen / kimodo_demo) and the community ComfyUI plugin.

Hardware data: RTX 5070 (12 GB VRAM) · KiMoDo (282M-param diffusion model + LLM2Vec/Llama-3-8B text encoder) · See benchmark data

⚠️ Known issue (12 GB-card-specific): The default pipeline loads the Llama-3-8B text encoder on the GPU and needs ~17 GB — far more than the 5070's 12 GB. A 16 GB Blackwell-card owner hit exactly this on the canonical repo (Issue #27); a Kimodo maintainer confirmed the fix below. Set TEXT_ENCODER_DEVICE=cpu and the pipeline runs in ~3 GB — leaving the 12 GB card with room to spare.

Note: As of this writing the backend has no measured benchmarks for this pair (/check/ returns unknown). The VRAM figures below come from the official Kimodo GitHub README and a maintainer comment on Issue #27; treat them as the published baseline and report your own numbers via /contribute.

Requirements

Component	Minimum	Tested
GPU	NVIDIA, CUDA-capable, PyTorch 2.0+ — see VRAM row	RTX 5070 (12 GB) — pair not yet benchmarked, see /check/
VRAM (default, all on GPU)	~17 GB (official README) — does not fit a 12 GB 5070 out of the box	—
VRAM (with `TEXT_ENCODER_DEVICE=cpu`)	~3 GB (official README; maintainer-confirmed on a 16 GB Blackwell card in Issue #27) — fits a 5070 with wide headroom	—
RAM	16 GB (Llama-3-8B sits on the CPU when offloaded; budget ~16 GB for it plus headroom)	—
Storage	~16 GB (Llama-3-8B weights ~15 GB + KiMoDo checkpoint ~1.13 GB)	—
Software	Python 3.10, PyTorch 2.0+ (installation docs)	—

VRAM caveat — this is the load-bearing detail for a 12 GB card. The KiMoDo diffusion model itself is small (282 M params, HF model card; checkpoint ~1.13 GB on disk), but the default pipeline loads a Llama-3-8B-based text encoder (LLM2Vec) on the same GPU. The README states it plainly: "Kimodo requires ~17GB of VRAM to generate locally entirely on GPU, primarily due to the text embedding model." (official README). That ~17 GB blows well past the 5070's 12 GB. The official fix is TEXT_ENCODER_DEVICE=cpu, which moves the text encoder to system RAM and drops GPU VRAM to ~3 GB — "This is slightly slower but reduces VRAM usage to <3 GB." (official README). At ~3 GB resident, the 12 GB 5070 has plenty of room even with a display attached. This is the path documented below.

Installation

Installation steps below come from the canonical NVIDIA sources only — the official Kimodo installation docs, the Quick Start docs, and the nv-tlabs/kimodo README. No third-party walkthrough is required because the install path is the upstream-supported one. The ComfyUI section at the end uses the community jtydhr88/ComfyUI-Kimodo plugin and is clearly labelled as such.

1. Set up a Python environment

Per the official installation docs:

conda create -n kimodo python=3.10
conda activate kimodo

2. Install PyTorch first, matched to your CUDA

Kimodo's docs say to install a compatible PyTorch manually before Kimodo, optimized for your CUDA version — "Anything over PyTorch 2.0 is sufficient." (installation docs). The RTX 5070 is a Blackwell (GB205, sm_120) card, so install a wheel that ships sm_120 kernels — that means the CUDA 12.8 (cu128) build or newer. Pick the matching index URL at pytorch.org/get-started/locally:

pip install torch --index-url https://download.pytorch.org/whl/cu128

The stable cu128 (and newer) PyTorch wheels include Blackwell sm_120 kernels. Older cu121/cu124 wheels predate sm_120 and will fall back to slow JIT or fail outright on a 5070 — do not use them on this card.

3. Install Kimodo

Two options from the official installation docs:

# Minimal install (CLI generation only)
pip install git+https://github.com/nv-tlabs/kimodo.git

Or, with the interactive demo web UI:

pip install "kimodo[all] @ git+https://github.com/nv-tlabs/kimodo.git"

4. Authenticate with HuggingFace and request Llama-3 access

KiMoDo's text encoder relies on the gated meta-llama/Meta-Llama-3-8B-Instruct model — you must request access on its HF page and then authenticate locally, per the installation docs:

hf auth login
# or place a token at ~/.cache/huggingface/token

KiMoDo's own weights (the 282M-param diffusion model) live at nvidia/Kimodo-SOMA-RP-v1.1 and are not gated; they download automatically on first use.

Running

Generate a motion from a text prompt (RTX 5070 path)

This is the load-bearing command for a 12 GB card. Set TEXT_ENCODER_DEVICE=cpu so the Llama-3 text encoder runs on CPU and the diffusion model has the GPU to itself (official README):

TEXT_ENCODER_DEVICE=cpu kimodo_gen "A person walks forward." \
    --model Kimodo-SOMA-RP-v1.1 \
    --duration 5.0 \
    --output output

Arguments are from the Quick Start docs and the CLI docs:

--model — checkpoint name. This recipe pins Kimodo-SOMA-RP-v1.1. Other published tiers are Kimodo-SOMA-RP-v1 and Kimodo-G1-RP-v1 (humanoid robot retarget).
--duration — motion length in seconds. The cap is 10 s / 300 frames at 30 fps (HF model card).
--output — output stem name. The motion file is written as output.npz.

Use the canonical env var name TEXT_ENCODER_DEVICE — not KIMODO_TEXT_ENCODER_DEVICE. The 16 GB-card owner in Issue #27 hit a CUDA OOM precisely because the KIMODO_-prefixed name did not control the encoder device; the maintainer's merged fix (PR #32) wires up the unprefixed TEXT_ENCODER_DEVICE. See Troubleshooting.

Also export BVH for Blender/Maya/MotionBuilder

The SOMA-family checkpoints support direct BVH export via the --bvh flag — per the CLI docs, when this flag is set Kimodo will also export BVH (SOMA models only) to the same output stem — the most useful format for downstream DCC tools:

TEXT_ENCODER_DEVICE=cpu kimodo_gen "A person walks forward." \
    --model Kimodo-SOMA-RP-v1.1 \
    --duration 5.0 \
    --output walk_forward \
    --bvh

Or launch the interactive web demo

If you installed with kimodo[all], the Quick Start docs document a web UI:

TEXT_ENCODER_DEVICE=cpu kimodo_demo

Then open http://localhost:7860. To visualize a motion you generated earlier, the CLI docs say to go under "Load/Save" > "Motion", type the path of the generated output .npz file, then click "Load Motion".

Optional: run inside ComfyUI

A community ComfyUI plugin, jtydhr88/ComfyUI-Kimodo (a third-party wrapper, not an NVIDIA-official one), exposes KiMoDo as nodes and adds FBX (Mixamo-rigged) export on top of the upstream NPZ/BVH:

cd ComfyUI/custom_nodes
git clone https://github.com/jtydhr88/ComfyUI-Kimodo.git
cd ComfyUI-Kimodo
pip install -r requirements.txt

The plugin's README documents NPZ, BVH (SOMA skeletons only), and FBX-via-Mixamo retarget as export options. It does not document the TEXT_ENCODER_DEVICE=cpu workaround — if you launch ComfyUI on a 12 GB 5070 without it you will hit OOM at the text-encoder load step (the same failure as Issue #27). Set the env var in the shell where you start ComfyUI, e.g. TEXT_ENCODER_DEVICE=cpu python main.py.

Results

Output (NPZ): the Kimodo NPZ contains global and local joint rotation matrices, foot contacts, root positions, and the global root heading at 30 fps, max 10 s / 300 frames (official README; HF model card).
Output (BVH, SOMA models only): add --bvh to write a standard BVH alongside the NPZ (CLI docs). This is the format you want for Blender/Maya/MotionBuilder — Kimodo does not emit MP4 video; visualization happens in the demo UI.
Output (FBX): only available via the community ComfyUI plugin's Mixamo-retarget node (ComfyUI-Kimodo README).
VRAM (default, all-on-GPU): ~17 GB (official README) — exceeds the 5070's 12 GB.
VRAM (with TEXT_ENCODER_DEVICE=cpu): ~3 GB (official README) — what this recipe uses. A Kimodo maintainer confirmed this number on a 16 GB Blackwell card in Issue #27: "Running TEXT_ENCODER_DEVICE=cpu kimodo_gen uses only ~3 GB VRAM now." That 16 GB card is the same Blackwell sm_120 compute generation as the 5070, so the ~3 GB resident footprint carries over directly; the 5070's 12 GB leaves ample headroom.
Speed: not quoted. No source reports KiMoDo generation throughput on an RTX 5070 by name, and the model card's tested-hardware list names RTX 3090 / 4090 / 5090 (and several datacenter cards) but not the 5070 (HF model card). Quoting a number without a 5070-named measurement would be a guess. Once a community benchmark lands it will appear at /check/kimodo/rtx-5070; please /contribute yours.
Model size: 282M parameters for the diffusion model itself (HF model card; ~1.13 GB checkpoint on disk); the Llama-3-8B text encoder (~15 GB on disk) dwarfs it.
License: Apache-2.0 for the codebase (repo); the checkpoint at nvidia/Kimodo-SOMA-RP-v1.1 is under the NVIDIA Open Model License and the model card states it "is ready for commercial use." — no non-commercial restriction.

For the full benchmark data, see /check/kimodo/rtx-5070.

Troubleshooting

"CUDA out of memory" even though you set CPU mode

This is the default failure on a small-VRAM card, and it was reported directly on the canonical repo by a 16 GB Blackwell-card owner in Issue #27: the Llama-3 text encoder loaded onto the GPU (~14.7 GiB) and left no room for the diffusion model. On a 12 GB 5070 the same load is even more out of reach. Two things to check:

Use the unprefixed variable. Set TEXT_ENCODER_DEVICE=cpu, not KIMODO_TEXT_ENCODER_DEVICE=cpu. The prefixed name does not control the encoder device; the maintainer's merged fix (PR #32) wires up TEXT_ENCODER_DEVICE. The maintainer (davrempe, a repo collaborator) confirmed on that issue: "Running TEXT_ENCODER_DEVICE=cpu kimodo_gen uses only ~3 GB VRAM now." (comment).
Make sure you're on a recent Kimodo. PR #32 is merged; reinstall from git+https://github.com/nv-tlabs/kimodo.git if you installed before it landed. A separate external PR (#12) proposes a quantized LLM2Vec encoder, but the maintainer notes on Issue #27 that it is untested — the CPU-offload path above is the supported fix.

"401 / gated model" or "Access to model Meta-Llama-3-8B-Instruct is restricted"

You need to (a) request access on the meta-llama/Meta-Llama-3-8B-Instruct HF page and wait for approval, then (b) run hf auth login or drop a token at ~/.cache/huggingface/token (installation docs).

`--bvh` flag is ignored / output has no `.bvh` file

BVH export is SOMA-only (CLI docs). If you ran with --model Kimodo-G1-RP-v1 (the humanoid-robot retarget), the SOMA skeleton path is skipped — pass --model Kimodo-SOMA-RP-v1.1 (or -v1) to get BVH. G1 / SMPL-X models instead support MuJoCo qpos CSV and AMASS NPZ output (CLI docs).

Slow generation on the CPU text encoder

Expected, and the documented trade-off ("This is slightly slower but reduces VRAM usage to <3 GB." — official README). The diffusion sampling stays on the GPU and is fast; only the one-time text-encoding step is on the CPU. If the CPU is your bottleneck, the Quick Start docs describe running kimodo_textencoder as a standalone service: on a smaller-VRAM card you can launch TEXT_ENCODER_DEVICE=cpu kimodo_textencoder in one terminal and kimodo_demo in another. A 16 GB-card owner found this two-terminal pattern resolved the OOM in Issue #27.

Plugin: `kimodo` package not found inside ComfyUI

The ComfyUI plugin's README says "The kimodo package itself will be auto-installed on first launch if needed." (ComfyUI-Kimodo README). If the auto-install silently fails (common when ComfyUI uses an embedded Python without internet egress), install it explicitly into the same interpreter ComfyUI uses: <comfy-python> -m pip install git+https://github.com/nv-tlabs/kimodo.git.

For other issues, file a report via the submission form.