self-hosted/ai
§01·recipe · 3d

Waypoint 1.5 on RX 7800 XT: Real-Time Interactive World Model on ROCm (BF16)

3dadvanced12GB+ VRAMJun 19, 2026

This advanced recipe sets up Waypoint 1.5 on the RX 7800 XT, needing about 12 GB of VRAM.

models
tools
prerequisites
  • AMD Radeon RX 7800 XT (16 GB VRAM, RDNA3 / Navi 32 / gfx1101) or equivalent ROCm-supported card — the 1.2B BF16 weights are ~11.19 GB on disk, so the 16 GB envelope holds them unquantized, but the activation + 512-frame-context headroom is tight (~5 GB), not generous
  • Linux (Ubuntu 24.04 / 22.04 or RHEL) with the AMD ROCm stack installed and PyTorch built for ROCm
  • Python 3.10+ with BF16 support (RDNA3 WMMA accepts BF16 natively)
  • A HuggingFace access token (`HF_TOKEN`) — the model is Apache-2.0 but `world_engine` expects the token exported before first run
  • Keyboard / mouse input loop (or a scripted `CtrlInput` sequence) — Waypoint is interactive, not text-to-video

What You'll Build

A local install of Waypoint 1.5 — Overworld's 1.2B-parameter real-time interactive video world model — running 720p generation driven by keyboard and mouse input on a 16 GB Radeon RX 7800 XT (RDNA3, Navi 32, gfx1101) through the ROCm stack. Waypoint is a real-time interactive video world model: the controller loop (button presses, mouse deltas) is part of the inference loop, not a post-hoc edit. With 16 GB of VRAM the BF16 weights (~11.19 GB on disk) still fit unquantized — but the margin for activations and the 512-frame context is real, so this card sits closer to the edge of the envelope than a 24 GB Radeon does. BF16 is still the right lead path here, since RDNA3 has no FP8/FP4 hardware to exploit the model's NVIDIA-oriented quant configs; the INT8 GemLite path picks up a second rationale on 16 GB (memory relief, not just throughput) that it doesn't have on larger cards.

Hardware data: RX 7800 XT (16GB VRAM) · 720p target · BF16 on ROCm · See benchmark data

⚠️ This is a ROCm recipe, not CUDA — and it is AMD-unverified. Overworld documents Waypoint only on NVIDIA ("Desktop RTX 30 Series and later"); the model card and the world_engine README name no AMD/ROCm/Radeon support, and no community report of Waypoint running on a 7800 XT (gfx1101) or any AMD card was found at the time of writing. What makes this a plausible AMD target rather than a dead end: the inference stack is pure PyTorch + Triton, not a custom CUDA C++ extension (see "Why this is hardware-plausible on RDNA3" below). The extra wrinkle on this card vs a 24 GB Radeon is VRAM headroom, not architecture — see the Requirements and Troubleshooting sections. Treat every step here as verify-on-run. If you get it working — or hit a wall — please report it.

ℹ️ Not an image-to-3D-mesh model. Despite being grouped in our 3d vertical (which spans both classical mesh-output models like TRELLIS / Hunyuan3D and the newer generative-worlds class), Waypoint 1.5 does not produce .obj / .glb / .ply mesh files. It produces an interactive video stream conditioned on live controller inputs, with success measured in frames-per-second and rollout coherence rather than mesh topology. If you need a mesh output, see TRELLIS or Hunyuan3D-2.1 on this GPU instead.

Variant note: The Waypoint 1.5 family ships two tiers: a 720p model for higher-performance systems (this recipe) and a 360p fork described as "designed around local, real-time generation on Nvidia laptop GPUs, and (soon) Apple Silicon." The 720p model's stated support window is NVIDIA "Desktop RTX 30 Series and later" — the 16 GB RX 7800 XT is outside that documented window (it is an AMD card). Unlike the 24 GB Radeon, the 7800 XT does not have surplus memory to spare: the BF16 weights fit, but with a tighter activation budget. The 720p 1B checkpoint is still the target for this 16 GB card; if you run out of memory on long rollouts, the INT8 lever below — or the 360p fork — is the fallback.

Why this is hardware-plausible on RDNA3

The decision to attempt this on AMD rests on what the inference engine is actually built from, not on any AMD endorsement (there is none). From the world_engine pyproject.toml, the runtime dependencies are torch==2.11.0, diffusers, transformers, accelerate, tensordict, triton==3.6.0 (Linux), and gemlite (a low-bit GEMM library). Crucially:

  • No flash-attn, no xformers, no CUTLASS C++ extension, no bitsandbytes, no compiled .cu kernels appear in the dependency list. The custom remote code shipped on the model card (transformer/model.py, modular_blocks.py, vae/ae_model.py) is pure-PyTorch modular-diffusers code, loaded via trust_remote_code=True — not a compiled CUDA op.
  • The three quant configs the world_engine README documents are intw8a8 (INT8 weights + INT8 dynamic per-token activations, "NVIDIA 30xx, 40xx, Ampere+"), fp8w8a8 (FP8 via torch._scaled_mm, "Ada Lovelace / Hopper+"), and nvfp4 (NVFP4 via FlashInfer/CUTLASS, "Blackwell"). On RDNA3 the fp8w8a8 and nvfp4 paths have no hardware — WMMA on RDNA3 accepts only FP16 / BF16 / INT8 / INT4, no native FP8 or FP4 — so both are dropped here. The intw8a8 INT8 path is the format that maps to RDNA3's WMMA INT8 (IU8) units, and its INT8 GEMM comes from GemLite, which is implemented in Triton — the GemLite README notes the project began with CUDA kernels but switched to Triton for cross-platform flexibility — plus torch.compile fusion. Triton runs on ROCm/gfx1101 (it is the same Triton that backs PyTorch SDP-FlashAttention on AMD). So the INT8 path has no hard custom-CUDA blocker; it is plausible — but GemLite documents no AMD testing and Triton on RDNA3 has known kernel-compile rough edges, so INT8 is the experiment, not the safe default.

The safe lead path is therefore BF16, unquantized (quant=None), via the diffusers ModularPipeline: pure PyTorch matmuls routed through ROCm's hipBLAS/rocBLAS and PyTorch SDPA attention, with the 16 GB envelope holding the ~11.19 GB weights and leaving roughly 5 GB for activations and the frame context. The INT8 GemLite path is the lever to try second — and on this 16 GB card it doubles as memory relief (INT8 roughly halves the weight footprint vs BF16) if the BF16 path runs tight on long rollouts.

Requirements

ComponentMinimumTested
GPUROCm-supported AMD card, RDNA3 (gfx1101)RX 7800 XT (16 GB) — pair not benchmarked and AMD-unverified, see /check/
VRAMNot stated on the model card; the 1.2B BF16 weights are ~11.19 GB on disk per the HF tree API (3.72 GB fused model.safetensors + 7.44 GB modular transformer/diffusion_pytorch_model.safetensors + 22.76 MB VAE), so BF16 plus activations and the 512-frame context fits inside 16 GB — but with a tight (~5 GB) margin, not the wide headroom a 24 GB card has.16 GB
DriverAMD ROCm 7.2.x on Linux
RAM16 GB system RAM
Storage~12 GB for BF16 safetensors + caches
SoftwarePython 3.10+, PyTorch built for ROCm, HuggingFace diffusers (or world_engine)

The model is released under the Apache-2.0 License per the model card; the world_engine README still expects export HF_TOKEN=<token> before the first run, so export a token even though the weights are not RAIL-gated. The card publishes its performance reference points on an RTX 3090 and an RTX 5090 (both NVIDIA) — there is no AMD entry, and no card in its published reference set is a 16 GB Radeon — so the 7800 XT has no comparable measured baseline; this recipe treats it as an open question.

Installation

Two paths cover the canonical entry points the Overworld team documents. On this AMD card, Path A (diffusers, BF16) is the recommended lead — it is the most ROCm-portable. Path B (world_engine) unlocks the INT8 GemLite lever — which is more interesting on 16 GB than on a 24 GB card because it also relieves memory pressure — but adds the AMD-unverified Triton-kernel surface.

Step 1 — Install PyTorch for ROCm (shared by both paths)

The RX 7800 XT (gfx1101) is an officially ROCm-supported GPU, so it uses the stable ROCm PyTorch wheel — no HSA_OVERRIDE_GFX_VERSION masquerade is required. Per the PyTorch "Get Started" selector and the ComfyUI README "AMD GPUs (Linux)" section:

python3 -m venv .env
source .env/bin/activate
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm7.2

ℹ️ Verify the ROCm tag before you copy it. The rocmX.Y wheel tag moves over time (6.3 → 6.4 → 7.x). Read the current line in the live PyTorch selector before running. Confirm the install is the ROCm build: python -c "import torch; print(torch.__version__)" should print a +rocm-style suffix, and torch.cuda.is_available() returns True (ROCm masquerades as the cuda device namespace under HIP — Waypoint's device="cuda" / device_map="cuda" calls work unchanged on ROCm).

Path A — HuggingFace diffusers ModularPipeline (recommended, BF16)

Install diffusers and its companions against the ROCm torch you already have — do not let pip pull a CUDA torch wheel over it:

pip install --upgrade diffusers transformers accelerate safetensors
# Do NOT run `pip install torch` here — it would replace the ROCm build with a CUDA wheel.
export HF_TOKEN=<your_huggingface_access_token>

Path B — world_engine (INT8 GemLite lever — AMD-unverified)

world_engine is Overworld's reference inference library, linked from the model card as the "Core Inference Library." Per the official README:

pip install --upgrade --ignore-installed \
  "world_engine @ git+https://github.com/Overworldai/world_engine.git"
export HF_TOKEN=<your_huggingface_access_token>

This pulls triton==3.6.0 and gemlite as dependencies (per the pyproject.toml). On RDNA3 you can run world_engine with quant=None (BF16) exactly like Path A; the AMD-relevant reasons to use it over plain diffusers are (a) to try the quant="intw8a8" INT8 GemLite/Triton path for throughput, and (b) on this 16 GB card, to fall back to INT8 if the BF16 context runs out of memory. Both INT8 uses are unverified on gfx1101. Do not pass quant="fp8w8a8" or quant="nvfp4" on this card — RDNA3 has no FP8/FP4 hardware (see Troubleshooting).

Running

Path A — diffusers ModularPipeline (BF16 — start here)

The model card ships this canonical snippet; it runs unchanged on ROCm because device_map="cuda" resolves to the HIP device and torch.bfloat16 is native on RDNA3 WMMA:

import torch
from diffusers.modular_pipelines import ModularPipeline
from diffusers.utils import load_image, export_to_video

pipe = ModularPipeline.from_pretrained(
    "Overworld/Waypoint-1.5-1B", trust_remote_code=True
)
pipe.load_components(
    device_map="cuda", torch_dtype=torch.bfloat16, trust_remote_code=True
)
pipe.transformer.apply_inference_patches()
# torch.compile routes to Triton-ROCm on gfx1101. If the compile step errors or hangs
# (Triton kernel compile is the weakest link on RDNA3), comment this line out and run eager:
pipe.transformer.compile(fullgraph=True, mode="max-autotune", dynamic=False)

image = load_image(
    "https://huggingface.co/spaces/Overworld/waypoint-1-small/resolve/main/starter_18.png"
).resize((1024, 512))

state = pipe(image=image, prompt="An explorable world",
             button=set(), mouse=(0.0, 0.0), output_type="pil")

state.values["image"] = None
frames = []
for _ in range(150):
    state = pipe(state, button={87}, mouse=(0.0, 0.0), output_type="pil")
    frames.append(state.values["images"])

export_to_video(frames, "waypoint-v1-5.mp4", fps=60)

button={87} is the W key (walk forward). Replace it with your input loop for real-time controllable rollouts. Output lands in waypoint-v1-5.mp4.

Path B — world_engine with a scripted controller sequence

Adapted from examples/gen_sample.py in the world_engine repo — note quant=None (BF16) is the safe AMD start; switch to "intw8a8" only as an experiment (and as a memory fallback on this 16 GB card):

# uv run --dev examples/gen_sample.py Overworld/Waypoint-1.5-1B
import cv2, sys
import numpy as np
import imageio.v3 as iio
import torch
from world_engine import WorldEngine, CtrlInput

# RDNA3 (RX 7800 XT, 16 GB): start unquantized (BF16). "intw8a8" routes the INT8 GEMM
# through GemLite/Triton — plausible on gfx1101 WMMA-INT8 but AMD-UNVERIFIED — and on
# 16 GB it also roughly halves the weight footprint if BF16 runs tight. Do NOT use
# "fp8w8a8" or "nvfp4" — no FP8/FP4 hardware on RDNA3.
engine = WorldEngine(sys.argv[1], quant=None, device="cuda")

controller_sequence = [
    CtrlInput(mouse=[0.2, 0.2]), CtrlInput(button={32}), CtrlInput(),
    CtrlInput(button={1, 32}), CtrlInput(),
]
controller_sequence += (
    [CtrlInput(button={32})] * 10 +  # forward
    [CtrlInput(button={65})] * 10 +  # A — left
    [CtrlInput(button={68})] * 10 +  # D — right
    [CtrlInput(button={83})] * 10    # S — back
)

seed_frame = cv2.imread("starter.png")
seed_frame = cv2.cvtColor(cv2.resize(seed_frame, (1280, 720)), cv2.COLOR_BGR2RGB)
seed_frame_x4 = torch.from_numpy(np.repeat(seed_frame[None], 4, axis=0))

with iio.imopen("out.mp4", "w", plugin="pyav") as out:
    engine.append_frame(seed_frame_x4)
    out.write(seed_frame_x4, fps=60, codec="libx264")
    for ctrl in controller_sequence:
        out.write(engine.gen_frame(ctrl=ctrl).cpu().numpy())

Note the 4-frame chunkinggen_frame() returns four frames per call. The world_engine README documents that the model generates four frames for every controller input via temporal compression, producing an output of shape [4, 720, 1280, 3].

Results

  • Speed: No speed figure is quoted for this card — by design. The model card publishes two NVIDIA reference points at 720p, 4-step: 30 FPS on an RTX 3090 (w8a8 quantized) and 72 FPS on an RTX 5090 (w8a8; 56 FPS unquantized). Neither transfers to the RX 7800 XT: the RTX 3090 is Ampere (sm_86, CUDA) and the RTX 5090 is Blackwell (sm_120, CUDA), and the 7800 XT is a different vendor and architecture entirely (RDNA3 / ROCm), with different memory (16 GB GDDR6, 624 GB/s) and a different software path (BF16-on-SDPA / INT8-via-GemLite-Triton vs NVIDIA's CUDA quant kernels). Relabeling any NVIDIA FPS as a 7800 XT number would be inventing data. The actual throughput is unknown until a community submission lands at /check/waypoint-1-5/rx-7800-xt. If you run it, please submit your numbers.
  • VRAM usage: The model card does not state a VRAM figure. As a derived envelope, the BF16 weights are ~11.19 GB on disk per the HF tree API (3.72 GB fused + 7.44 GB modular transformer + 22.76 MB VAE), so the 7800 XT's 16 GB holds the BF16 weights plus activations and the 512-frame context — but with a tight ~5 GB margin, not the wide headroom a 24 GB Radeon has. On this card the INT8 GemLite path is therefore both a throughput experiment and a memory-relief option (INT8 roughly halves the ~11.19 GB weight footprint), unlike on a 24 GB card where INT8 is purely about speed. Live measurements will appear at /check/waypoint-1-5/rx-7800-xt.
  • Latency target: The family-level target is "up to 720p and 60 FPS" with a 512-frame context window — about 10 seconds of rollout at 60 FPS (model card). Whether RDNA3 reaches that target with the BF16 or INT8 path is unverified.
  • Quality notes: Waypoint is "a generative world model, not a simulator with guaranteed physical accuracy"; the model card warns that "Long interactive rollouts may drift, collapse, or become inconsistent" (model card). These are model-level behaviours, independent of GPU vendor.

For the full benchmark data, see /check/waypoint-1-5/rx-7800-xt.

Troubleshooting

Out of memory on long rollouts (16 GB-specific)

This is the headline risk on a 16 GB card that does not exist on a 24 GB Radeon. The BF16 weights are ~11.19 GB, leaving roughly 5 GB for activations and the 512-frame context — enough to load and run, but the long-rollout context growth is where a tight envelope bites. If you hit a HIP out-of-memory error during extended generation, the fallbacks, in order: (a) shorten the rollout / reduce the number of retained context frames; (b) switch Path B to quant="intw8a8", which roughly halves the weight footprint via INT8 (AMD-unverified, but it is the WMMA-mapped format on RDNA3); (c) drop to the 360p fork, which targets lighter hardware. There is no first-party 16 GB measurement yet — if you find the real BF16 ceiling, please report it.

Do NOT use the fp8w8a8 or nvfp4 quant paths on this card

The world_engine README maps fp8w8a8 to Ada/Hopper (FP8 via torch._scaled_mm) and nvfp4 to Blackwell (FP4 via FlashInfer/CUTLASS). RDNA3's WMMA units accept only FP16, BF16, INT8 (IU8), INT4 (IU4) — no native FP8 or FP4 (AMD GPUOpen, "WMMA on RDNA3"). An FP8 path would either fail outright or silently upcast with no benefit (and no memory saving — the opposite of what you want on 16 GB), and the nvfp4 FlashInfer/CUTLASS kernels are CUDA-only. Use quant=None (BF16) as the default; try quant="intw8a8" (INT8) only as the experiment / memory-fallback below.

The intw8a8 INT8 path errors or compiles slowly

This is the expected weak point on AMD. The INT8 GEMM comes from GemLite, which is Triton-based, and Triton on RDNA3 (gfx1101) has documented kernel-compile fragility (e.g. vLLM #4514 forces VLLM_USE_TRITON_FLASH_ATTN=0 on RDNA3 for stack-frame overflow). If quant="intw8a8" raises a Triton compile error or produces garbage frames, fall back to quant=None (BF16) — though note that on this 16 GB card BF16 has less memory slack than INT8, so if you reached for INT8 to relieve OOM, the fallback is the 360p fork rather than BF16. If you do get INT8 working on a 7800 XT, that result is novel — please report it.

torch.compile / apply_inference_patches() hangs or errors

pipe.transformer.compile(...) routes through Triton-ROCm (Inductor) on gfx1101, which works for mainstream transformer blocks but can stall or error on exotic fused ops. If the compile step misbehaves, comment out the pipe.transformer.compile(...) line and run eager — correctness first, then re-introduce compilation once the BF16 eager path is confirmed. The attention path is PyTorch SDPA on this stack (do not install flash-attn or xformers — the ROCm forks are limited and the engine does not require them).

Torch not compiled with CUDA enabled

A CUDA build of PyTorch got installed instead of the ROCm build — most often because a later pip install torch (or a CUDA-pinned dependency) overwrote it. Reinstall the ROCm wheel:

pip uninstall torch torchvision torchaudio
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm7.2

HF_TOKEN errors / 401 on download

Per the world_engine README, world_engine requires export HF_TOKEN=<your_huggingface_access_token> before the first run even though the weights are Apache-2.0. Export your token and re-run.

Confusion with the 360P variant or other "Waypoint" projects

Only the Overworld/Waypoint-1.5-1B repo (720p) and its Overworld/Waypoint-1.5-1B-360P sibling (laptop / Apple Silicon tier) are the canonical world-model weights. Despite the name overlap, this is not an image-to-3D mesh model and not a SPar3D successor. Unrelated "Waypoint" libraries (game-dev navigation, robotics path planning) are different projects — don't conflate them.

For other issues, file a report via the submission form.

common questions
How much VRAM does Waypoint 1.5 need?

About 12 GB — the minimum this recipe targets.

Which GPUs is Waypoint 1.5 tested on?

RX 7800 XT (16 GB).

How hard is this setup?

Advanced — follow the steps above.