What You'll Build
A local install of Waypoint 1.5 — Overworld's 1.2B-parameter real-time interactive video world model — running 720p generation driven by keyboard and mouse input on a 24 GB Radeon RX 7900 XTX (RDNA3, Navi 31, gfx1100) through the ROCm stack. Waypoint is a real-time interactive video world model: the controller loop (button presses, mouse deltas) is part of the inference loop, not a post-hoc edit. With 24 GB of VRAM the BF16 weights (~11.19 GB on disk) fit unquantized with wide headroom — the right lead path on this card, since RDNA3 has no FP8/FP4 hardware to exploit the model's NVIDIA-oriented quant configs.
Hardware data: RX 7900 XTX (24GB VRAM) · 720p target · BF16 on ROCm · See benchmark data
⚠️ This is a ROCm recipe, not CUDA — and it is AMD-unverified. Overworld documents Waypoint only on NVIDIA ("Desktop RTX 30 Series and later"); the model card and the
world_engineREADME name no AMD/ROCm/Radeon support, and no community report of Waypoint running on a 7900 XTX (gfx1100) or any AMD card was found at the time of writing. What makes this a plausible AMD target rather than a dead end: the inference stack is pure PyTorch + Triton, not a custom CUDA C++ extension (see "Why this is hardware-plausible on RDNA3" below). Treat every step here as verify-on-run. If you get it working — or hit a wall — please report it.
ℹ️ Not an image-to-3D-mesh model. Despite being grouped in our
3dvertical (which spans both classical mesh-output models like TRELLIS / Hunyuan3D and the newer generative-worlds class), Waypoint 1.5 does not produce.obj/.glb/.plymesh files. It produces an interactive video stream conditioned on live controller inputs, with success measured in frames-per-second and rollout coherence rather than mesh topology. If you need a mesh output, see TRELLIS or Hunyuan3D-2.1 on this GPU instead.
Variant note: The Waypoint 1.5 family ships two tiers: a 720p model for higher-performance systems (this recipe) and a 360p fork described as "designed around local, real-time generation on Nvidia laptop GPUs, and (soon) Apple Silicon." The 720p model's stated support window is NVIDIA "Desktop RTX 30 Series and later" — the 24 GB RX 7900 XTX is well outside that documented window (it is an AMD card), but it has more VRAM than any card in the published reference set, so memory is not the constraint here; runtime portability is. The 720p 1B checkpoint is the target for this 24 GB card.
Why this is hardware-plausible on RDNA3
The decision to attempt this on AMD rests on what the inference engine is actually built from, not on any AMD endorsement (there is none). From the world_engine pyproject.toml, the runtime dependencies are torch==2.11.0, diffusers, transformers, accelerate, tensordict, triton==3.6.0 (Linux), and gemlite (a low-bit GEMM library). Crucially:
- No
flash-attn, noxformers, no CUTLASS C++ extension, no bitsandbytes, no compiled.cukernels appear in the dependency list. The custom remote code shipped on the model card (transformer/model.py,modular_blocks.py,vae/ae_model.py) is pure-PyTorch modular-diffuserscode, loaded viatrust_remote_code=True— not a compiled CUDA op. - The three quant configs the
world_engineREADME documents areintw8a8(INT8 weights + INT8 dynamic per-token activations, "NVIDIA 30xx, 40xx, Ampere+"),fp8w8a8(FP8 viatorch._scaled_mm, "Ada Lovelace / Hopper+"), andnvfp4(NVFP4 via FlashInfer/CUTLASS, "Blackwell"). On RDNA3 thefp8w8a8andnvfp4paths have no hardware — WMMA on RDNA3 accepts only FP16 / BF16 / INT8 / INT4, no native FP8 or FP4 — so both are dropped here. Theintw8a8INT8 path is the format that maps to RDNA3's WMMA INT8 units, and its INT8 GEMM comes from GemLite, which is implemented in Triton ("The project started with CUDA kernels, but we have switched to Triton") plustorch.compilefusion — Triton runs on ROCm/gfx1100 (it is the same Triton that backs PyTorch SDP-FlashAttention on AMD). So the INT8 path has no hard custom-CUDA blocker; it is plausible — but GemLite documents no AMD testing and Triton on gfx1100 has known kernel-compile rough edges, so INT8 is the experiment, not the safe default.
The safe lead path is therefore BF16, unquantized (quant=None), via the diffusers ModularPipeline: pure PyTorch matmuls routed through ROCm's hipBLAS/rocBLAS and PyTorch SDPA attention, with the 24 GB envelope swallowing the ~11.19 GB weights whole. The INT8 GemLite path is a throughput lever to try second, not a fit requirement (you are not memory-bound at 24 GB).
Requirements
| Component | Minimum | Tested |
|---|---|---|
| GPU | ROCm-supported AMD card, RDNA3 (gfx1100) | RX 7900 XTX (24 GB) — pair not benchmarked and AMD-unverified, see /check/ |
| VRAM | Not stated on the model card; the 1.2B BF16 weights are ~11.19 GB on disk per the HF tree API (3.72 GB fused model.safetensors + 7.44 GB modular transformer/diffusion_pytorch_model.safetensors + 22.76 MB VAE), so BF16 plus activations and the 512-frame context fits inside 24 GB with wide headroom. | 24 GB |
| Driver | AMD ROCm 7.2.x on Linux | — |
| RAM | 16 GB system RAM | — |
| Storage | ~12 GB for BF16 safetensors + caches | — |
| Software | Python 3.10+, PyTorch built for ROCm, HuggingFace diffusers (or world_engine) | — |
The model is released under the Apache-2.0 License per the model card; the world_engine README still expects export HF_TOKEN=<token> before the first run, so export a token even though the weights are not RAIL-gated. The card publishes its performance reference points on an RTX 3090 and an RTX 5090 (both NVIDIA) — there is no AMD entry — so the 7900 XTX has no comparable measured baseline; this recipe treats it as an open question.
Installation
Two paths cover the canonical entry points the Overworld team documents. On this AMD card, Path A (diffusers, BF16) is the recommended lead — it is the most ROCm-portable. Path B (world_engine) unlocks the INT8 GemLite lever but adds the AMD-unverified Triton-kernel surface.
Step 1 — Install PyTorch for ROCm (shared by both paths)
The RX 7900 XTX (gfx1100) is an officially ROCm-supported GPU, so it uses the stable ROCm PyTorch wheel. Per the PyTorch "Get Started" selector and the ComfyUI README "AMD GPUs (Linux)" section:
python3 -m venv .env
source .env/bin/activate
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm7.2
ℹ️ Verify the ROCm tag before you copy it. The
rocmX.Ywheel tag moves over time (6.3 → 6.4 → 7.x). Read the current line in the live PyTorch selector before running. Confirm the install is the ROCm build:python -c "import torch; print(torch.__version__)"should print a+rocm-style suffix, andtorch.cuda.is_available()returnsTrue(ROCm masquerades as thecudadevice namespace under HIP — Waypoint'sdevice="cuda"/device_map="cuda"calls work unchanged on ROCm).
Path A — HuggingFace diffusers ModularPipeline (recommended, BF16)
Install diffusers and its companions against the ROCm torch you already have — do not let pip pull a CUDA torch wheel over it:
pip install --upgrade diffusers transformers accelerate safetensors
# Do NOT run `pip install torch` here — it would replace the ROCm build with a CUDA wheel.
export HF_TOKEN=<your_huggingface_access_token>
Path B — world_engine (INT8 GemLite lever — AMD-unverified)
world_engine is Overworld's reference inference library, linked from the model card as the "Core Inference Library." Per the official README:
pip install --upgrade --ignore-installed \
"world_engine @ git+https://github.com/Overworldai/world_engine.git"
export HF_TOKEN=<your_huggingface_access_token>
This pulls triton==3.6.0 and gemlite as dependencies (per the pyproject.toml). On RDNA3 you can run world_engine with quant=None (BF16) exactly like Path A; the only AMD-relevant reason to use it over plain diffusers is to try the quant="intw8a8" INT8 GemLite/Triton path — and that path is unverified on gfx1100. Do not pass quant="fp8w8a8" or quant="nvfp4" on this card — RDNA3 has no FP8/FP4 hardware (see Troubleshooting).
Running
Path A — diffusers ModularPipeline (BF16 — start here)
The model card ships this canonical snippet; it runs unchanged on ROCm because device_map="cuda" resolves to the HIP device and torch.bfloat16 is native on RDNA3 WMMA:
import torch
from diffusers.modular_pipelines import ModularPipeline
from diffusers.utils import load_image, export_to_video
pipe = ModularPipeline.from_pretrained(
"Overworld/Waypoint-1.5-1B", trust_remote_code=True
)
pipe.load_components(
device_map="cuda", torch_dtype=torch.bfloat16, trust_remote_code=True
)
pipe.transformer.apply_inference_patches()
# torch.compile routes to Triton-ROCm on gfx1100. If the compile step errors or hangs
# (Triton kernel compile is the weakest link on RDNA3), comment this line out and run eager:
pipe.transformer.compile(fullgraph=True, mode="max-autotune", dynamic=False)
image = load_image(
"https://huggingface.co/spaces/Overworld/waypoint-1-small/resolve/main/starter_18.png"
).resize((1024, 512))
state = pipe(image=image, prompt="An explorable world",
button=set(), mouse=(0.0, 0.0), output_type="pil")
state.values["image"] = None
frames = []
for _ in range(150):
state = pipe(state, button={87}, mouse=(0.0, 0.0), output_type="pil")
frames.append(state.values["images"])
export_to_video(frames, "waypoint-v1-5.mp4", fps=60)
button={87} is the W key (walk forward). Replace it with your input loop for real-time controllable rollouts. Output lands in waypoint-v1-5.mp4.
Path B — world_engine with a scripted controller sequence
Adapted from examples/gen_sample.py in the world_engine repo — note quant=None (BF16) is the safe AMD start; switch to "intw8a8" only as an experiment:
# uv run --dev examples/gen_sample.py Overworld/Waypoint-1.5-1B
import cv2, sys
import numpy as np
import imageio.v3 as iio
import torch
from world_engine import WorldEngine, CtrlInput
# RDNA3 (RX 7900 XTX): start unquantized (BF16). "intw8a8" routes the INT8 GEMM through
# GemLite/Triton — plausible on gfx1100 WMMA-INT8 but AMD-UNVERIFIED. Do NOT use
# "fp8w8a8" or "nvfp4" — no FP8/FP4 hardware on RDNA3.
engine = WorldEngine(sys.argv[1], quant=None, device="cuda")
controller_sequence = [
CtrlInput(mouse=[0.2, 0.2]), CtrlInput(button={32}), CtrlInput(),
CtrlInput(button={1, 32}), CtrlInput(),
]
controller_sequence += (
[CtrlInput(button={32})] * 10 + # forward
[CtrlInput(button={65})] * 10 + # A — left
[CtrlInput(button={68})] * 10 + # D — right
[CtrlInput(button={83})] * 10 # S — back
)
seed_frame = cv2.imread("starter.png")
seed_frame = cv2.cvtColor(cv2.resize(seed_frame, (1280, 720)), cv2.COLOR_BGR2RGB)
seed_frame_x4 = torch.from_numpy(np.repeat(seed_frame[None], 4, axis=0))
with iio.imopen("out.mp4", "w", plugin="pyav") as out:
engine.append_frame(seed_frame_x4)
out.write(seed_frame_x4, fps=60, codec="libx264")
for ctrl in controller_sequence:
out.write(engine.gen_frame(ctrl=ctrl).cpu().numpy())
Note the 4-frame chunking — gen_frame() returns four frames per call. The world_engine README documents that the model "generates 4 frames for every controller input" via temporal compression, producing an output of shape [4, 720, 1280, 3].
Results
- Speed: No speed figure is quoted for this card — by design. The model card publishes two NVIDIA reference points at 720p, 4-step: 30 FPS on an RTX 3090 (w8a8 quantized) and 72 FPS on an RTX 5090 (w8a8; 56 FPS unquantized). Neither transfers to the RX 7900 XTX: the RTX 3090 is Ampere (sm_86, CUDA) and the RTX 5090 is Blackwell (sm_120, CUDA), and the 7900 XTX is a different vendor and architecture entirely (RDNA3 / ROCm), with different memory (24 GB GDDR6, 960 GB/s) and a different software path (BF16-on-SDPA / INT8-via-GemLite-Triton vs NVIDIA's CUDA quant kernels). Relabeling any NVIDIA FPS as a 7900 XTX number would be inventing data. The actual throughput is unknown until a community submission lands at /check/waypoint-1-5/rx-7900-xtx. If you run it, please submit your numbers.
- VRAM usage: The model card does not state a VRAM figure. As a derived envelope, the BF16 weights are ~11.19 GB on disk per the HF tree API (3.72 GB fused + 7.44 GB modular transformer + 22.76 MB VAE), so the 7900 XTX's 24 GB holds the BF16 weights plus activations and the 512-frame context with wide headroom — there is no memory pressure on this card, and therefore no fit reason to reach for INT8 (the INT8 path is a throughput experiment, not a survival requirement). Live measurements will appear at /check/waypoint-1-5/rx-7900-xtx.
- Latency target: The family-level target is "up to 720p and 60 FPS" with a 512-frame context window — about 10 seconds of rollout at 60 FPS (model card). Whether RDNA3 reaches that target with the BF16 or INT8 path is unverified.
- Quality notes: Waypoint is "a generative world model, not a simulator with guaranteed physical accuracy"; the model card warns that "long interactive rollouts may drift, collapse, or become inconsistent" (model card). These are model-level behaviours, independent of GPU vendor.
For the full benchmark data, see /check/waypoint-1-5/rx-7900-xtx.
Troubleshooting
Do NOT use the fp8w8a8 or nvfp4 quant paths on this card
The world_engine README maps fp8w8a8 to Ada/Hopper (FP8 via torch._scaled_mm) and nvfp4 to Blackwell (FP4 via FlashInfer/CUTLASS). RDNA3's WMMA units accept only FP16, BF16, INT8, INT4 — no native FP8 or FP4 (AMD GPUOpen, "WMMA on RDNA3"). An FP8 path would either fail outright or silently upcast with no benefit, and the nvfp4 FlashInfer/CUTLASS kernels are CUDA-only. Use quant=None (BF16) as the default; try quant="intw8a8" (INT8) only as the experiment below.
The intw8a8 INT8 path errors or compiles slowly
This is the expected weak point on AMD. The INT8 GEMM comes from GemLite, which is Triton-based, and Triton on gfx1100 has documented kernel-compile fragility (e.g. vLLM #4514 forces VLLM_USE_TRITON_FLASH_ATTN=0 on RDNA3 for stack-frame overflow). If quant="intw8a8" raises a Triton compile error or produces garbage frames, fall back to quant=None (BF16) — you have 24 GB, so there is no memory penalty for doing so. If you do get INT8 working on a 7900 XTX, that result is novel — please report it.
torch.compile / apply_inference_patches() hangs or errors
pipe.transformer.compile(...) routes through Triton-ROCm (Inductor) on gfx1100, which works for mainstream transformer blocks but can stall or error on exotic fused ops. If the compile step misbehaves, comment out the pipe.transformer.compile(...) line and run eager — correctness first, then re-introduce compilation once the BF16 eager path is confirmed. The attention path is PyTorch SDPA on this stack (do not install flash-attn or xformers — the ROCm forks are limited and the engine does not require them).
Torch not compiled with CUDA enabled
A CUDA build of PyTorch got installed instead of the ROCm build — most often because a later pip install torch (or a CUDA-pinned dependency) overwrote it. Reinstall the ROCm wheel:
pip uninstall torch torchvision torchaudio
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm7.2
HF_TOKEN errors / 401 on download
Per the world_engine README, world_engine requires export HF_TOKEN=<your_huggingface_access_token> before the first run even though the weights are Apache-2.0. Export your token and re-run.
Confusion with the 360P variant or other "Waypoint" projects
Only the Overworld/Waypoint-1.5-1B repo (720p) and its Overworld/Waypoint-1.5-1B-360P sibling (laptop / Apple Silicon tier) are the canonical world-model weights. Despite the name overlap, this is not an image-to-3D mesh model and not a SPar3D successor. Unrelated "Waypoint" libraries (game-dev navigation, robotics path planning) are different projects — don't conflate them.
For other issues, file a report via the submission form.