What You'll Build
A local install of Waypoint 1.5 — Overworld's 1.2B-parameter real-time interactive video world model — running 720p generation driven by keyboard and mouse input on a 16 GB Radeon RX 7800 XT (RDNA3, Navi 32, gfx1101) through the ROCm stack. Waypoint is a real-time interactive video world model: the controller loop (button presses, mouse deltas) is part of the inference loop, not a post-hoc edit. With 16 GB of VRAM the BF16 weights (~11.19 GB on disk) still fit unquantized — but the margin for activations and the 512-frame context is real, so this card sits closer to the edge of the envelope than a 24 GB Radeon does. BF16 is still the right lead path here, since RDNA3 has no FP8/FP4 hardware to exploit the model's NVIDIA-oriented quant configs; the INT8 GemLite path picks up a second rationale on 16 GB (memory relief, not just throughput) that it doesn't have on larger cards.
Hardware data: RX 7800 XT (16GB VRAM) · 720p target · BF16 on ROCm · See benchmark data
⚠️ This is a ROCm recipe, not CUDA — and it is AMD-unverified. Overworld documents Waypoint only on NVIDIA ("Desktop RTX 30 Series and later"); the model card and the
world_engineREADME name no AMD/ROCm/Radeon support, and no community report of Waypoint running on a 7800 XT (gfx1101) or any AMD card was found at the time of writing. What makes this a plausible AMD target rather than a dead end: the inference stack is pure PyTorch + Triton, not a custom CUDA C++ extension (see "Why this is hardware-plausible on RDNA3" below). The extra wrinkle on this card vs a 24 GB Radeon is VRAM headroom, not architecture — see the Requirements and Troubleshooting sections. Treat every step here as verify-on-run. If you get it working — or hit a wall — please report it.
ℹ️ Not an image-to-3D-mesh model. Despite being grouped in our
3dvertical (which spans both classical mesh-output models like TRELLIS / Hunyuan3D and the newer generative-worlds class), Waypoint 1.5 does not produce.obj/.glb/.plymesh files. It produces an interactive video stream conditioned on live controller inputs, with success measured in frames-per-second and rollout coherence rather than mesh topology. If you need a mesh output, see TRELLIS or Hunyuan3D-2.1 on this GPU instead.
Variant note: The Waypoint 1.5 family ships two tiers: a 720p model for higher-performance systems (this recipe) and a 360p fork described as "designed around local, real-time generation on Nvidia laptop GPUs, and (soon) Apple Silicon." The 720p model's stated support window is NVIDIA "Desktop RTX 30 Series and later" — the 16 GB RX 7800 XT is outside that documented window (it is an AMD card). Unlike the 24 GB Radeon, the 7800 XT does not have surplus memory to spare: the BF16 weights fit, but with a tighter activation budget. The 720p 1B checkpoint is still the target for this 16 GB card; if you run out of memory on long rollouts, the INT8 lever below — or the 360p fork — is the fallback.
Why this is hardware-plausible on RDNA3
The decision to attempt this on AMD rests on what the inference engine is actually built from, not on any AMD endorsement (there is none). From the world_engine pyproject.toml, the runtime dependencies are torch==2.11.0, diffusers, transformers, accelerate, tensordict, triton==3.6.0 (Linux), and gemlite (a low-bit GEMM library). Crucially:
- No
flash-attn, noxformers, no CUTLASS C++ extension, no bitsandbytes, no compiled.cukernels appear in the dependency list. The custom remote code shipped on the model card (transformer/model.py,modular_blocks.py,vae/ae_model.py) is pure-PyTorch modular-diffuserscode, loaded viatrust_remote_code=True— not a compiled CUDA op. - The three quant configs the
world_engineREADME documents areintw8a8(INT8 weights + INT8 dynamic per-token activations, "NVIDIA 30xx, 40xx, Ampere+"),fp8w8a8(FP8 viatorch._scaled_mm, "Ada Lovelace / Hopper+"), andnvfp4(NVFP4 via FlashInfer/CUTLASS, "Blackwell"). On RDNA3 thefp8w8a8andnvfp4paths have no hardware — WMMA on RDNA3 accepts only FP16 / BF16 / INT8 / INT4, no native FP8 or FP4 — so both are dropped here. Theintw8a8INT8 path is the format that maps to RDNA3's WMMA INT8 (IU8) units, and its INT8 GEMM comes from GemLite, which is implemented in Triton — the GemLite README notes the project began with CUDA kernels but switched to Triton for cross-platform flexibility — plustorch.compilefusion. Triton runs on ROCm/gfx1101 (it is the same Triton that backs PyTorch SDP-FlashAttention on AMD). So the INT8 path has no hard custom-CUDA blocker; it is plausible — but GemLite documents no AMD testing and Triton on RDNA3 has known kernel-compile rough edges, so INT8 is the experiment, not the safe default.
The safe lead path is therefore BF16, unquantized (quant=None), via the diffusers ModularPipeline: pure PyTorch matmuls routed through ROCm's hipBLAS/rocBLAS and PyTorch SDPA attention, with the 16 GB envelope holding the ~11.19 GB weights and leaving roughly 5 GB for activations and the frame context. The INT8 GemLite path is the lever to try second — and on this 16 GB card it doubles as memory relief (INT8 roughly halves the weight footprint vs BF16) if the BF16 path runs tight on long rollouts.
Requirements
| Component | Minimum | Tested |
|---|---|---|
| GPU | ROCm-supported AMD card, RDNA3 (gfx1101) | RX 7800 XT (16 GB) — pair not benchmarked and AMD-unverified, see /check/ |
| VRAM | Not stated on the model card; the 1.2B BF16 weights are ~11.19 GB on disk per the HF tree API (3.72 GB fused model.safetensors + 7.44 GB modular transformer/diffusion_pytorch_model.safetensors + 22.76 MB VAE), so BF16 plus activations and the 512-frame context fits inside 16 GB — but with a tight (~5 GB) margin, not the wide headroom a 24 GB card has. | 16 GB |
| Driver | AMD ROCm 7.2.x on Linux | — |
| RAM | 16 GB system RAM | — |
| Storage | ~12 GB for BF16 safetensors + caches | — |
| Software | Python 3.10+, PyTorch built for ROCm, HuggingFace diffusers (or world_engine) | — |
The model is released under the Apache-2.0 License per the model card; the world_engine README still expects export HF_TOKEN=<token> before the first run, so export a token even though the weights are not RAIL-gated. The card publishes its performance reference points on an RTX 3090 and an RTX 5090 (both NVIDIA) — there is no AMD entry, and no card in its published reference set is a 16 GB Radeon — so the 7800 XT has no comparable measured baseline; this recipe treats it as an open question.
Installation
Two paths cover the canonical entry points the Overworld team documents. On this AMD card, Path A (diffusers, BF16) is the recommended lead — it is the most ROCm-portable. Path B (world_engine) unlocks the INT8 GemLite lever — which is more interesting on 16 GB than on a 24 GB card because it also relieves memory pressure — but adds the AMD-unverified Triton-kernel surface.
Step 1 — Install PyTorch for ROCm (shared by both paths)
The RX 7800 XT (gfx1101) is an officially ROCm-supported GPU, so it uses the stable ROCm PyTorch wheel — no HSA_OVERRIDE_GFX_VERSION masquerade is required. Per the PyTorch "Get Started" selector and the ComfyUI README "AMD GPUs (Linux)" section:
python3 -m venv .env
source .env/bin/activate
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm7.2
ℹ️ Verify the ROCm tag before you copy it. The
rocmX.Ywheel tag moves over time (6.3 → 6.4 → 7.x). Read the current line in the live PyTorch selector before running. Confirm the install is the ROCm build:python -c "import torch; print(torch.__version__)"should print a+rocm-style suffix, andtorch.cuda.is_available()returnsTrue(ROCm masquerades as thecudadevice namespace under HIP — Waypoint'sdevice="cuda"/device_map="cuda"calls work unchanged on ROCm).
Path A — HuggingFace diffusers ModularPipeline (recommended, BF16)
Install diffusers and its companions against the ROCm torch you already have — do not let pip pull a CUDA torch wheel over it:
pip install --upgrade diffusers transformers accelerate safetensors
# Do NOT run `pip install torch` here — it would replace the ROCm build with a CUDA wheel.
export HF_TOKEN=<your_huggingface_access_token>
Path B — world_engine (INT8 GemLite lever — AMD-unverified)
world_engine is Overworld's reference inference library, linked from the model card as the "Core Inference Library." Per the official README:
pip install --upgrade --ignore-installed \
"world_engine @ git+https://github.com/Overworldai/world_engine.git"
export HF_TOKEN=<your_huggingface_access_token>
This pulls triton==3.6.0 and gemlite as dependencies (per the pyproject.toml). On RDNA3 you can run world_engine with quant=None (BF16) exactly like Path A; the AMD-relevant reasons to use it over plain diffusers are (a) to try the quant="intw8a8" INT8 GemLite/Triton path for throughput, and (b) on this 16 GB card, to fall back to INT8 if the BF16 context runs out of memory. Both INT8 uses are unverified on gfx1101. Do not pass quant="fp8w8a8" or quant="nvfp4" on this card — RDNA3 has no FP8/FP4 hardware (see Troubleshooting).
Running
Path A — diffusers ModularPipeline (BF16 — start here)
The model card ships this canonical snippet; it runs unchanged on ROCm because device_map="cuda" resolves to the HIP device and torch.bfloat16 is native on RDNA3 WMMA:
import torch
from diffusers.modular_pipelines import ModularPipeline
from diffusers.utils import load_image, export_to_video
pipe = ModularPipeline.from_pretrained(
"Overworld/Waypoint-1.5-1B", trust_remote_code=True
)
pipe.load_components(
device_map="cuda", torch_dtype=torch.bfloat16, trust_remote_code=True
)
pipe.transformer.apply_inference_patches()
# torch.compile routes to Triton-ROCm on gfx1101. If the compile step errors or hangs
# (Triton kernel compile is the weakest link on RDNA3), comment this line out and run eager:
pipe.transformer.compile(fullgraph=True, mode="max-autotune", dynamic=False)
image = load_image(
"https://huggingface.co/spaces/Overworld/waypoint-1-small/resolve/main/starter_18.png"
).resize((1024, 512))
state = pipe(image=image, prompt="An explorable world",
button=set(), mouse=(0.0, 0.0), output_type="pil")
state.values["image"] = None
frames = []
for _ in range(150):
state = pipe(state, button={87}, mouse=(0.0, 0.0), output_type="pil")
frames.append(state.values["images"])
export_to_video(frames, "waypoint-v1-5.mp4", fps=60)
button={87} is the W key (walk forward). Replace it with your input loop for real-time controllable rollouts. Output lands in waypoint-v1-5.mp4.
Path B — world_engine with a scripted controller sequence
Adapted from examples/gen_sample.py in the world_engine repo — note quant=None (BF16) is the safe AMD start; switch to "intw8a8" only as an experiment (and as a memory fallback on this 16 GB card):
# uv run --dev examples/gen_sample.py Overworld/Waypoint-1.5-1B
import cv2, sys
import numpy as np
import imageio.v3 as iio
import torch
from world_engine import WorldEngine, CtrlInput
# RDNA3 (RX 7800 XT, 16 GB): start unquantized (BF16). "intw8a8" routes the INT8 GEMM
# through GemLite/Triton — plausible on gfx1101 WMMA-INT8 but AMD-UNVERIFIED — and on
# 16 GB it also roughly halves the weight footprint if BF16 runs tight. Do NOT use
# "fp8w8a8" or "nvfp4" — no FP8/FP4 hardware on RDNA3.
engine = WorldEngine(sys.argv[1], quant=None, device="cuda")
controller_sequence = [
CtrlInput(mouse=[0.2, 0.2]), CtrlInput(button={32}), CtrlInput(),
CtrlInput(button={1, 32}), CtrlInput(),
]
controller_sequence += (
[CtrlInput(button={32})] * 10 + # forward
[CtrlInput(button={65})] * 10 + # A — left
[CtrlInput(button={68})] * 10 + # D — right
[CtrlInput(button={83})] * 10 # S — back
)
seed_frame = cv2.imread("starter.png")
seed_frame = cv2.cvtColor(cv2.resize(seed_frame, (1280, 720)), cv2.COLOR_BGR2RGB)
seed_frame_x4 = torch.from_numpy(np.repeat(seed_frame[None], 4, axis=0))
with iio.imopen("out.mp4", "w", plugin="pyav") as out:
engine.append_frame(seed_frame_x4)
out.write(seed_frame_x4, fps=60, codec="libx264")
for ctrl in controller_sequence:
out.write(engine.gen_frame(ctrl=ctrl).cpu().numpy())
Note the 4-frame chunking — gen_frame() returns four frames per call. The world_engine README documents that the model generates four frames for every controller input via temporal compression, producing an output of shape [4, 720, 1280, 3].
Results
- Speed: No speed figure is quoted for this card — by design. The model card publishes two NVIDIA reference points at 720p, 4-step: 30 FPS on an RTX 3090 (w8a8 quantized) and 72 FPS on an RTX 5090 (w8a8; 56 FPS unquantized). Neither transfers to the RX 7800 XT: the RTX 3090 is Ampere (sm_86, CUDA) and the RTX 5090 is Blackwell (sm_120, CUDA), and the 7800 XT is a different vendor and architecture entirely (RDNA3 / ROCm), with different memory (16 GB GDDR6, 624 GB/s) and a different software path (BF16-on-SDPA / INT8-via-GemLite-Triton vs NVIDIA's CUDA quant kernels). Relabeling any NVIDIA FPS as a 7800 XT number would be inventing data. The actual throughput is unknown until a community submission lands at /check/waypoint-1-5/rx-7800-xt. If you run it, please submit your numbers.
- VRAM usage: The model card does not state a VRAM figure. As a derived envelope, the BF16 weights are ~11.19 GB on disk per the HF tree API (3.72 GB fused + 7.44 GB modular transformer + 22.76 MB VAE), so the 7800 XT's 16 GB holds the BF16 weights plus activations and the 512-frame context — but with a tight ~5 GB margin, not the wide headroom a 24 GB Radeon has. On this card the INT8 GemLite path is therefore both a throughput experiment and a memory-relief option (INT8 roughly halves the ~11.19 GB weight footprint), unlike on a 24 GB card where INT8 is purely about speed. Live measurements will appear at /check/waypoint-1-5/rx-7800-xt.
- Latency target: The family-level target is "up to 720p and 60 FPS" with a 512-frame context window — about 10 seconds of rollout at 60 FPS (model card). Whether RDNA3 reaches that target with the BF16 or INT8 path is unverified.
- Quality notes: Waypoint is "a generative world model, not a simulator with guaranteed physical accuracy"; the model card warns that "Long interactive rollouts may drift, collapse, or become inconsistent" (model card). These are model-level behaviours, independent of GPU vendor.
For the full benchmark data, see /check/waypoint-1-5/rx-7800-xt.
Troubleshooting
Out of memory on long rollouts (16 GB-specific)
This is the headline risk on a 16 GB card that does not exist on a 24 GB Radeon. The BF16 weights are ~11.19 GB, leaving roughly 5 GB for activations and the 512-frame context — enough to load and run, but the long-rollout context growth is where a tight envelope bites. If you hit a HIP out-of-memory error during extended generation, the fallbacks, in order: (a) shorten the rollout / reduce the number of retained context frames; (b) switch Path B to quant="intw8a8", which roughly halves the weight footprint via INT8 (AMD-unverified, but it is the WMMA-mapped format on RDNA3); (c) drop to the 360p fork, which targets lighter hardware. There is no first-party 16 GB measurement yet — if you find the real BF16 ceiling, please report it.
Do NOT use the fp8w8a8 or nvfp4 quant paths on this card
The world_engine README maps fp8w8a8 to Ada/Hopper (FP8 via torch._scaled_mm) and nvfp4 to Blackwell (FP4 via FlashInfer/CUTLASS). RDNA3's WMMA units accept only FP16, BF16, INT8 (IU8), INT4 (IU4) — no native FP8 or FP4 (AMD GPUOpen, "WMMA on RDNA3"). An FP8 path would either fail outright or silently upcast with no benefit (and no memory saving — the opposite of what you want on 16 GB), and the nvfp4 FlashInfer/CUTLASS kernels are CUDA-only. Use quant=None (BF16) as the default; try quant="intw8a8" (INT8) only as the experiment / memory-fallback below.
The intw8a8 INT8 path errors or compiles slowly
This is the expected weak point on AMD. The INT8 GEMM comes from GemLite, which is Triton-based, and Triton on RDNA3 (gfx1101) has documented kernel-compile fragility (e.g. vLLM #4514 forces VLLM_USE_TRITON_FLASH_ATTN=0 on RDNA3 for stack-frame overflow). If quant="intw8a8" raises a Triton compile error or produces garbage frames, fall back to quant=None (BF16) — though note that on this 16 GB card BF16 has less memory slack than INT8, so if you reached for INT8 to relieve OOM, the fallback is the 360p fork rather than BF16. If you do get INT8 working on a 7800 XT, that result is novel — please report it.
torch.compile / apply_inference_patches() hangs or errors
pipe.transformer.compile(...) routes through Triton-ROCm (Inductor) on gfx1101, which works for mainstream transformer blocks but can stall or error on exotic fused ops. If the compile step misbehaves, comment out the pipe.transformer.compile(...) line and run eager — correctness first, then re-introduce compilation once the BF16 eager path is confirmed. The attention path is PyTorch SDPA on this stack (do not install flash-attn or xformers — the ROCm forks are limited and the engine does not require them).
Torch not compiled with CUDA enabled
A CUDA build of PyTorch got installed instead of the ROCm build — most often because a later pip install torch (or a CUDA-pinned dependency) overwrote it. Reinstall the ROCm wheel:
pip uninstall torch torchvision torchaudio
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm7.2
HF_TOKEN errors / 401 on download
Per the world_engine README, world_engine requires export HF_TOKEN=<your_huggingface_access_token> before the first run even though the weights are Apache-2.0. Export your token and re-run.
Confusion with the 360P variant or other "Waypoint" projects
Only the Overworld/Waypoint-1.5-1B repo (720p) and its Overworld/Waypoint-1.5-1B-360P sibling (laptop / Apple Silicon tier) are the canonical world-model weights. Despite the name overlap, this is not an image-to-3D mesh model and not a SPar3D successor. Unrelated "Waypoint" libraries (game-dev navigation, robotics path planning) are different projects — don't conflate them.
For other issues, file a report via the submission form.