self-hosted/ai
§01·recipe · 3d

Waypoint 1.5 on RTX 5090: Real-Time Interactive World Model at 720p, 72 FPS

3dintermediate12GB+ VRAMMay 25, 2026
models
tools
prerequisites
  • NVIDIA RTX 5090 (32GB VRAM) — the card Overworld names as the 'Recommended GPU / device' on the model card
  • Python 3.10+ with PyTorch built against CUDA 12.8 (`cu128` wheels for sm_120 Blackwell)
  • (Optional) A HuggingFace token — Waypoint-1.5 weights are public (Apache-2.0); no gating or license acceptance required
  • Keyboard / mouse input loop (or scripted `CtrlInput` sequence) — Waypoint is interactive, not text-to-video

What You'll Build

A local install of Waypoint 1.5 — Overworld's 1.2B-parameter real-time interactive video world model — running 720p generation driven by keyboard and mouse input on an RTX 5090, at the 56 FPS (BF16) or 72 FPS (w8a8) numbers the model card explicitly publishes for this exact GPU.

Hardware data: RTX 5090 (32GB VRAM) · 720p @ 56 FPS BF16 / 72 FPS w8a8 (4-step) · See benchmark data

ℹ️ Not an image-to-3D-mesh model. Despite being grouped in our 3d vertical (which spans both classical mesh-output models like TRELLIS / Hunyuan3D and the newer generative-worlds class), Waypoint 1.5 does not produce .obj / .glb / .ply mesh files. It produces an interactive video stream conditioned on live controller inputs, with success measured in frames-per-second and rollout coherence rather than mesh topology. If you need a mesh output, see TRELLIS or Hunyuan3D-2 on this GPU instead.

Variant note: The Waypoint 1.5 family ships two tiers: a 720p model for "Desktop RTX 30 Series and later" (this recipe) and a 360P fork for laptop GPUs and Apple Silicon. The RTX 5090 is Overworld's named reference desktop card, so the 720p 1B checkpoint is the right target.

The 5090 is Overworld's named reference card. The model card names the RTX 5090 directly under "Recommended setup" — Overworld's headline 56 FPS unquantized BF16 number is the 5090 number. No cross-card extrapolation involved. The w8a8 path takes it to 72 FPS on the same card per the published performance table.

Requirements

ComponentMinimumTested
GPUNVIDIA RTX 30-series or later desktop cardRTX 5090 (32GB) — pair not yet benchmarked in our DB, see /check/
VRAMNot stated explicitly; the 1.2B BF16 weights are ~11.18 GB on disk per the HF tree API (3.72 GB fused model.safetensors + 7.44 GB modular transformer/diffusion_pytorch_model.safetensors + 22.76 MB VAE) — derived envelope ~12 GB at BF16, well below the 5090's 32 GB32GB
RAM16GB system RAM
Storage~12 GB for BF16 safetensors + caches
SoftwarePython 3.10+, PyTorch + cu128 (Blackwell), HuggingFace diffusers (or world_engine)

The official Waypoint-1.5-1B model card does not publish an explicit VRAM number, but it DOES publish a per-GPU FPS table that names the RTX 5090 directly — see Results below. The 5090's 32 GB envelope is roughly 3× the ~11 GB derived BF16 footprint; on Blackwell the binding constraint is neither memory nor compute — the 5090 simply ships at the family-level 720p target with room to spare.

Installation

Two paths cover the canonical entry points the Overworld team documents:

Path A — world_engine (recommended for interactive use)

world_engine is Overworld's reference inference library, linked from the model card as the "Core Inference Library". Per the official Overworldai/world_engine README:

python3 -m venv .env
source .env/bin/activate
pip install --upgrade --ignore-installed \
  "world_engine @ git+https://github.com/Overworldai/world_engine.git"
export HF_TOKEN=<your_huggingface_access_token>

The README maps inference quantization paths by GPU architecture:

ConfigDescriptionSupported GPUs (per the README)
intw8a8INT8 weights + INT8 dynamic per-token activationsNVIDIA (30xx, 40xx, Ampere+)
fp8w8a8FP8 (e4m3) weights + FP8 per-tensor activations via torch._scaled_mmNVIDIA Ada Lovelace / Hopper+ (RTX 40xx, H100)
nvfp4NVFP4 weights + FP4 activations via FlashInfer/CUTLASSNVIDIA Blackwell (B100, B200, RTX 5090)

The RTX 5090 is Blackwell (sm_120), the only consumer card listed in the nvfp4 row. All three quant paths are architecturally available on this card, but the model card's headline 72 FPS number is taken under the w8a8 configuration (the model card's 4-step w8a8 quantized (5090) → 72 FPS row); the 56 FPS number is unquantized BF16 (4-step unquantized (5090) → 56 FPS). The Blackwell-only nvfp4 path is supported by the library but not the configuration the published headline numbers use — see Troubleshooting for guidance on which to try first.

FP8 / NVFP4 on Blackwell are real speedups, not just memory tricks (reverses the Ampere lesson). The 5090's Tensor Cores have native FP8 (E4M3 / E5M2) AND FP4 (NVFP4 microscaling) acceleration. Where Ampere cards (3090, A100) dequantize FP8 weights back to BF16 at compute time and miss the throughput win, Blackwell runs them in their native format at full speed. The published 56 → 72 FPS lift on the model card (BF16 → w8a8) is the throughput dividend in action.

Path B — HuggingFace diffusers (modular pipelines API)

Per the official model card, Waypoint also ships as a ModularPipeline. Install the latest diffusers and PyTorch built against CUDA 12.8 — cu128 is the Blackwell-friendly index for the RTX 5090 (sm_120):

pip install --upgrade diffusers transformers accelerate safetensors
pip install torch --index-url https://download.pytorch.org/whl/cu128

Using the default cu121 / cu126 index on a 5090 risks kernel-launch failures or silent fallback to compute-without-Blackwell-acceleration — always pin cu128 on this card.

Running

Path A — world_engine with a scripted controller sequence

Adapted from examples/gen_sample.py in the world_engine repo. To hit the model card's published 72 FPS RTX 5090 throughput, pass quant="intw8a8" (the w8a8 row in the card's perf table); to reproduce the 56 FPS unquantized BF16 baseline, leave quant=None:

# uv run --dev examples/gen_sample.py Overworld/Waypoint-1.5-1B
import cv2, sys, json, random, urllib.request
import numpy as np
import imageio.v3 as iio
import torch
from world_engine import WorldEngine, CtrlInput

# Pick the configuration matching the model-card row you want to reproduce:
#   quant=None      -> 56 FPS at 720p, 4-step unquantized BF16 (model card row)
#   quant="intw8a8" -> 72 FPS at 720p, 4-step w8a8 quantized   (model card row)
engine = WorldEngine(sys.argv[1], quant="intw8a8", device="cuda")

# Build a small controller programme: mouse, jump, walk W/A/S/D
controller_sequence = [
    CtrlInput(mouse=[0.2, 0.2]), CtrlInput(button={32}), CtrlInput(),
    CtrlInput(button={1, 32}), CtrlInput(),
]
controller_sequence += (
    [CtrlInput(button={32})] * 10 +  # forward
    [CtrlInput(button={65})] * 10 +  # A — left
    [CtrlInput(button={68})] * 10 +  # D — right
    [CtrlInput(button={83})] * 10    # S — back
)

# Seed frame (any 1280x720 RGB image works)
seed_frame = cv2.imread("starter.png")
seed_frame = cv2.cvtColor(cv2.resize(seed_frame, (1280, 720)), cv2.COLOR_BGR2RGB)
seed_frame_x4 = torch.from_numpy(np.repeat(seed_frame[None], 4, axis=0))

with iio.imopen("out.mp4", "w", plugin="pyav") as out:
    engine.append_frame(seed_frame_x4)
    out.write(seed_frame_x4, fps=60, codec="libx264")
    for ctrl in controller_sequence:
        out.write(engine.gen_frame(ctrl=ctrl).cpu().numpy())

Note the 4-frame chunkinggen_frame() returns four frames per call (the world_engine README explicitly documents this: Waypoint-1.5 "applies temporal compression and generates 4 frames for every controller input" of shape [4, 720, 1280, 3]), matching the 60 FPS / 4-step schedule of the model card's reference performance numbers.

Path B — diffusers ModularPipeline

The model card ships this canonical snippet:

import torch
from diffusers.modular_pipelines import ModularPipeline
from diffusers.utils import load_image, export_to_video

pipe = ModularPipeline.from_pretrained(
    "Overworld/Waypoint-1.5-1B", trust_remote_code=True
)
pipe.load_components(
    device_map="cuda", torch_dtype=torch.bfloat16, trust_remote_code=True
)
pipe.transformer.apply_inference_patches()
pipe.transformer.compile(fullgraph=True, mode="max-autotune", dynamic=False)

image = load_image(
    "https://huggingface.co/spaces/Overworld/waypoint-1-small/resolve/main/starter_18.png"
).resize((1024, 512))

state = pipe(image=image, prompt="An explorable world",
             button=set(), mouse=(0.0, 0.0), output_type="pil")

state.values["image"] = None
frames = []
for _ in range(150):
    state = pipe(state, button={87}, mouse=(0.0, 0.0), output_type="pil")
    frames.append(state.values["images"])

export_to_video(frames, "waypoint-v1-5.mp4", fps=60)

button={87} is the W key (walk forward). Replace with your input loop for real-time controllable rollouts. Note: the diffusers path here loads at BF16, which on the 5090 reproduces the model card's 56 FPS reference number; for the 72 FPS w8a8 number, use Path A with quant="intw8a8".

Results

  • Speed: 56 FPS at 720p (4-step unquantized BF16) / 72 FPS at 720p (4-step w8a8 quantized) — both published directly by the Waypoint 1.5 model card as reference data points in its "Waypoint 1 vs Waypoint 1.5" performance table. Verbatim rows from the card: 4-step unquantized (5090) → 56 FPS and 4-step w8a8 quantized (5090) → 72 FPS. The card also names the 5090 directly under "Recommended setup" with "Expected FPS on reference hardware: 56 FPS". These are first-party numbers on the exact target GPU — no cross-card extrapolation involved. For comparison, the same table publishes 30 FPS for the RTX 3090 at w8a8 (4-step) — the 5090's INT8 throughput is 2.4× the 3090's, consistent with the generational compute and bandwidth gap (Blackwell sm_120 ≈ 105 TFLOPS FP16 dense / 1792 GB/s GDDR7 vs Ampere sm_86 ≈ 35 TFLOPS / 936 GB/s GDDR6X). If you measure something different, please submit your numbers — a community benchmark in /check/waypoint-1-5/rtx-5090 will sit alongside the first-party citation.
  • VRAM usage: The model card does not state a VRAM figure. As a derived envelope: the BF16 weights are ~3.72 GB for the fused model.safetensors plus ~7.44 GB for the modular transformer plus ~22.76 MB for the VAE per the HF tree API — roughly 11.2 GB on-disk BF16. The w8a8 INT8 path reduces this further at runtime. The 32 GB envelope of an RTX 5090 has roughly 20 GB of headroom over the BF16 footprint even before quantization — see Spending the headroom below. Live measurements: /check/waypoint-1-5/rtx-5090.
  • Latency target: Family-level target is "up to 720p and 60 FPS" with a 512-frame context window — about 10 seconds of rollout at 60 FPS (model card). On the RTX 5090, the BF16 number (56 FPS) is at the family-level 60 FPS target and the w8a8 number (72 FPS) is comfortably above it — the 5090 is the only consumer card the published performance table shows exceeding the 60 FPS family target.
  • Quality notes: Waypoint is "a generative world model, not a simulator with guaranteed physical accuracy" — design priorities are "Real-time interaction rather than offline batch generation, Low-latency responsiveness to user inputs, Local execution on consumer hardware, Persistent world rollouts where coherence across time matters as much as single-frame fidelity" (model card).

For the full benchmark data, see /check/waypoint-1-5/rtx-5090.

Spending the headroom

The 5090's 32 GB envelope is roughly 3× the Waypoint 1.5 BF16 on-disk footprint (~11.2 GB), leaving ~20 GB of free VRAM during inference. Concrete ways to use that headroom on this card:

  • Co-locate a control / dialogue LLM. Run a Qwen3-8B Q4_K_M (~5 GB) or Llama-3.1-8B (~6 GB) alongside Waypoint to drive prompt updates, NPC dialogue, or scene-level scripting in the same process. See /check/qwen3-8b/rtx-5090 and /check/llama-3-1-8b/rtx-5090 for the sibling LLM benchmarks on this card.
  • Run a Whisper / Kokoro voice loop. A Whisper-large + Kokoro 82M TTS pair sits under 2 GB combined and turns Waypoint into a voice-controllable world without a second GPU.
  • Keep the unquantized BF16 path resident. On a 24 GB card you might cap the context window or drop to intw8a8 to fit headroom for activations on long rollouts; on the 5090 you can stay on quant=None BF16 (matches the 56 FPS reference number) without the per-token activation quantization cost, and still have 20+ GB free.

Troubleshooting

Throughput is below the 56 / 72 FPS the model card publishes

The 56 / 72 FPS figures are specifically for the intw8a8 quant path (for 72 FPS) and unquantized BF16 (for 56 FPS) at 720p, 4-step. If you're seeing materially lower numbers, check (in order): (1) you are on cu128 PyTorch wheels (pip install torch --index-url https://download.pytorch.org/whl/cu128) — the default index can silently miss the sm_120 Blackwell kernels; (2) for the BF16 path, pipe.transformer.compile(fullgraph=True, mode="max-autotune", dynamic=False) is enabled in the diffusers snippet — the model card includes it for a reason; (3) for the w8a8 path, you passed quant="intw8a8" to WorldEngine(...) rather than leaving it at None. The world_engine README quant-path table confirms intw8a8 is supported on Ampere+ (which includes Blackwell sm_120).

Picking a quant path on Blackwell (RTX 5090)

All three configurations are architecturally supported. Default to:

  • quant=None (unquantized BF16) — matches the model card's 56 FPS reference number, uses ~11 GB of the 5090's 32 GB, and avoids any quantization-induced quality drift. The natural starting point.
  • quant="intw8a8" — matches the model card's 72 FPS published number. Best throughput per the published table.
  • quant="fp8w8a8" — listed for "NVIDIA Ada Lovelace / Hopper+ (RTX 40xx, H100)" in the world_engine README quant-path table. The 5090 has native FP8 Tensor Cores (E4M3 / E5M2) so the path is hardware-supported, but the model card's reference numbers do NOT use this configuration on the 5090 — they use BF16 (56 FPS) and w8a8 (72 FPS). Treat FP8 as an experimental tier on this card.
  • quant="nvfp4" — Blackwell-exclusive (B100, B200, RTX 5090 per the README). PR #3 added the initial NVFP4 benchmarking and generation code. The model card's headline numbers do not currently report an NVFP4 row for the 5090, so treat it as an experimental path; expect a further throughput uplift over w8a8 once first-party numbers land, but no published reference figure yet.

CUDA-wheel mismatch on Blackwell (5090)

The 5090 is Blackwell-class (sm_120). Use the cu128 PyTorch wheels (pip install torch --index-url https://download.pytorch.org/whl/cu128) rather than the default cu121 / cu126 index to avoid kernel-launch failures or silent fall-back to compute paths that miss Blackwell-specific acceleration.

FlashAttention-2 on Blackwell

The model card and the world_engine README do not require flash_attention_2; the codebase uses PyTorch's SDPA path. If you nevertheless layer FlashAttention-2 on top (e.g. via a custom inference fork), note that FA2 sm_120 kernel support for Blackwell is still tracked at Dao-AILab/flash-attention#2168 — when in doubt, stay with the default SDPA backend that the canonical world_engine library uses.

HF_TOKEN errors / 401 on download

Per the world_engine README, world_engine reads export HF_TOKEN=<your_huggingface_access_token> if one is set. The Waypoint-1.5-1B weights are public and Apache-2.0 — no licence acceptance or access request is required, so a token is only needed if an anonymous download hits a rate-limit or transient 401.

Confusion with the 360P variant, the SPar3D successor rumour, or other "Waypoint" projects

Only the Overworld/Waypoint-1.5-1B repo (720p) and its Overworld/Waypoint-1.5-1B-360P sibling (laptop tier) are the canonical world-model weights. Despite the name overlap, this is not an image-to-3D mesh model and not a SPar3D successor — spar3d/Waypoint-1.5 does not resolve on the Hub. Unrelated "Waypoint" libraries (e.g. game-dev navigation, robotics path planning) are different projects — don't conflate them.

For other issues, file a report via the submission form.