self-hosted/ai
§01·recipe · 3d

Waypoint 1.5 on RTX 5070 Ti: Real-Time Interactive World Model at 720p

3dintermediate12GB+ VRAMJun 4, 2026
models
tools
prerequisites
  • NVIDIA RTX 5070 Ti (16GB VRAM) or another desktop RTX 30/40/50-series card
  • Python 3.10+ with PyTorch built against CUDA 12.8 (`cu128` wheels for sm_120 Blackwell)
  • Keyboard / mouse input loop (or scripted `CtrlInput` sequence) — Waypoint is interactive, not text-to-video

What You'll Build

A local install of Waypoint 1.5 — Overworld's 1.2B-parameter real-time interactive video world model — running 720p generation driven by keyboard and mouse input on an RTX 5070 Ti. Waypoint is a real-time interactive video world model: the controller loop (button presses, mouse deltas) is part of the inference loop, not a post-hoc edit. The model card describes the family as designed "around local, real-time generation on consumer hardware ranging from the most advanced RTX 50 series cards, to older RTX 30 series cards" — the 5070 Ti is a desktop Blackwell RTX 50-series card and sits squarely in that supported range.

Hardware data: RTX 5070 Ti (16GB VRAM) · 720p target resolution · See benchmark data

ℹ️ Not an image-to-3D-mesh model. Despite being grouped in our 3d vertical (which spans both classical mesh-output models like TRELLIS / Hunyuan3D and the newer generative-worlds class), Waypoint 1.5 does not produce .obj / .glb / .ply mesh files. It produces an interactive video stream conditioned on live controller inputs, with success measured in frames-per-second and rollout coherence rather than mesh topology. If you need a mesh output, see Hunyuan3D-2 on this GPU instead.

Variant note: The Waypoint 1.5 family ships two model tiers: a 720p model for "higher-performance systems" (this recipe) and a 360p fork "intended to run smoothly across a broader range of gaming PCs and Apple Silicon Macs". The RTX 5070 Ti is a high-performance desktop Blackwell card, so the 720p 1B checkpoint (Overworld/Waypoint-1.5-1B) is the right target.

Requirements

ComponentMinimumTested
GPUNVIDIA RTX 30-series or later desktop cardRTX 5070 Ti (16GB GDDR7, 256-bit, ~896 GB/s, 8960 CUDA cores, GB203 sm_120, 300 W) — pair not yet benchmarked in our DB, see /check/
VRAMNot stated explicitly; the 1.2B BF16 weights are ~11.2 GB on disk per the HF tree API (3.72 GB fused model.safetensors + 7.44 GB modular transformer/diffusion_pytorch_model.safetensors + 22.76 MB VAE) — derived BF16 envelope ~12 GB, inside the 5070 Ti's 16 GB16GB
RAM16GB system RAM
Storage~12 GB for BF16 safetensors + caches
SoftwarePython 3.10+, PyTorch + cu128 (Blackwell), HuggingFace diffusers (or world_engine)

The official Waypoint-1.5-1B model card does not publish an explicit VRAM number, and its per-GPU performance table names only the RTX 5090 and RTX 3090 — the RTX 5070 Ti is not named in any published row (see Results). As a derived envelope, the ~11.2 GB on-disk BF16 footprint leaves roughly 5 GB of headroom on the 5070 Ti's 16 GB; the INT8 intw8a8 quant path reduces the resident footprint further at runtime.

Installation

Two paths cover the canonical entry points the Overworld team documents.

Path A — world_engine (recommended for interactive use)

world_engine is Overworld's reference inference library, named on the model card as the "Core Inference Library". Per the official Overworldai/world_engine README:

# Recommended: set up venv
python3 -m venv .env
source .env/bin/activate

# Install
pip install --upgrade --ignore-installed \
  "world_engine @ git+https://github.com/Overworldai/world_engine.git"

# Optional: a HuggingFace token avoids download rate limits
# (the Waypoint-1.5-1B repo is public, Apache-2.0 — no licence acceptance required)
export HF_TOKEN=<your access token>

The README maps inference quantization paths by GPU architecture:

ConfigDescriptionSupported GPUs (per the README)
intw8a8INT8 weights + INT8 dynamic per-token activationsNVIDIA (30xx, 40xx, Ampere+)
fp8w8a8FP8 (e4m3) weights + FP8 per-tensor activations via torch._scaled_mmNVIDIA Ada Lovelace / Hopper+ (RTX 40xx, H100)
nvfp4NVFP4 weights + FP4 activations via FlashInfer/CUTLASSNVIDIA Blackwell (B100, B200, RTX 5090)

The RTX 5070 Ti is Blackwell (sm_120) — the same GB203 die and architecture family the README lists in the nvfp4 row alongside the RTX 5090. All three quant paths are therefore architecturally available on the 5070 Ti; intw8a8 is the broadest-supported and the configuration the model card's published 5090/3090 numbers use, so it is the safest first quantized target. See Picking a quant path on Blackwell for guidance.

FP8 / NVFP4 on Blackwell are real speedups, not just memory tricks. The 5070 Ti's Tensor Cores have native FP8 (E4M3 / E5M2) and FP4 (NVFP4 microscaling) acceleration. Where Ampere cards (3090, A100) dequantize FP8 weights back to BF16 at compute time and miss the throughput win, Blackwell runs them in their native format. The model card's own 56 → 72 FPS lift (BF16 → w8a8, measured on the 5090) is that throughput dividend in action.

Path B — HuggingFace diffusers (modular pipelines API)

Per the official model card, Waypoint also ships as a ModularPipeline. Install the latest diffusers and PyTorch built against CUDA 12.8 — cu128 is the Blackwell-friendly index for the RTX 5070 Ti (sm_120):

pip install --upgrade diffusers transformers accelerate safetensors
pip install torch --index-url https://download.pytorch.org/whl/cu128

Using the default cu121 / cu126 index on a 5070 Ti risks kernel-launch failures or a silent fall-back to compute paths that miss Blackwell-specific acceleration — always pin cu128 on this card.

Running

Path A — world_engine

Per the world_engine README, load the model, set a prompt, and step the engine with controller inputs. Pass quant="intw8a8" to reproduce the INT8 path the model card's quantized numbers use; leave quant unset for the unquantized BF16 baseline:

from world_engine import WorldEngine, CtrlInput

# quant="intw8a8" -> INT8 weights + activations (the model card's w8a8 path)
# omit quant      -> unquantized BF16 baseline
engine = WorldEngine("Overworld/Waypoint-1.5-1B", quant="intw8a8", device="cuda")

engine.set_prompt("A fun game")

# Generate frames conditioned on controller inputs
for controller_input in [
    CtrlInput(button={48, 42}, mouse=[0.4, 0.3]),
    CtrlInput(mouse=[0.1, 0.2]),
    CtrlInput(button={95, 32, 105}),
]:
    img = engine.gen_frame(ctrl=controller_input)  # shape [4, 720, 1280, 3]

Note the 4-frame chunking: the world_engine README documents that "Waypoint-1.5 applies temporal compression and generates 4 frames for every controller input", so gen_frame() returns a [4, 720, 1280, 3] batch rather than a single frame. For smooth playback, space each 4-frame batch evenly across time while the next batch generates in parallel — the README ships a generation_loop / render_batch helper for exactly this. See examples/gen_sample.py for a complete reference.

Path B — diffusers ModularPipeline

The model card ships this canonical snippet (the card's final line references an undefined outputs variable — corrected here to frames, the list the loop actually builds):

import torch
from diffusers.modular_pipelines import ModularPipeline
from diffusers.utils import load_image, export_to_video

pipe = ModularPipeline.from_pretrained(
    "Overworld/Waypoint-1.5-1B", trust_remote_code=True
)
pipe.load_components(
    device_map="cuda", torch_dtype=torch.bfloat16, trust_remote_code=True
)
pipe.transformer.apply_inference_patches()
pipe.transformer.compile(fullgraph=True, mode="max-autotune", dynamic=False)

# Seed the world with an image
image = load_image(
    "https://huggingface.co/spaces/Overworld/waypoint-1-small/resolve/main/starter_18.png"
).resize((1024, 512))
state = pipe(image=image, prompt="An explorable world",
             button=set(), mouse=(0.0, 0.0), output_type="pil")

# Generate subsequent frames with controller inputs
state.values["image"] = None
frames = []
for _ in range(150):
    state = pipe(state, button={87}, mouse=(0.0, 0.0), output_type="pil")
    frames.append(state.values["images"])  # list of PIL images

export_to_video(frames, "waypoint-v1-5.mp4", fps=60)

button={87} is the W key (walk forward). Replace with your input loop for real-time controllable rollouts. The diffusers path loads at BF16; pipe.transformer.compile(...) is kept from the card snippet because the model card includes it for a reason — it is load-bearing for throughput.

Results

  • Speed: The model card does not publish an RTX 5070 Ti figure. Its "Waypoint 1 vs Waypoint 1.5" performance table names only two cards, both quoted verbatim here for context: 4-step unquantized (5090)56 FPS and 4-step w8a8 quantized (5090)72 FPS on the RTX 5090, and 4-step w8a8 (3090)30 FPS on the RTX 3090, all at 720p. The 5070 Ti's throughput will land between those bookends, but the gap is large and architecture-dependent (the 5070 Ti has materially less memory bandwidth and fewer cores than the 5090), so we do not assign it a number — extrapolating across that gap would be a guess, not a measurement. Once a community submission lands it appears at /check/waypoint-1-5/rtx-5070-ti. If you run it, please submit your numbers.
  • VRAM usage: The model card does not state a VRAM figure. As a derived envelope: the BF16 weights are ~3.72 GB for the fused model.safetensors plus ~7.44 GB for the modular transformer plus ~22.76 MB for the VAE per the HF tree API — roughly 11.2 GB on-disk BF16, with ~5 GB of headroom on the 5070 Ti's 16 GB before activations. The intw8a8 INT8 path reduces the resident footprint further at runtime. Live measurements: /check/waypoint-1-5/rtx-5070-ti.
  • Latency target: Family-level target is "up to 720p and 60 FPS" with a 512-frame frame context — about 10 seconds of rollout at 60 FPS (model card).
  • Quality notes: Waypoint is "a generative world model, not a simulator with guaranteed physical accuracy"; per the model card, "Long interactive rollouts may drift, collapse, or become inconsistent." Design priorities are real-time interaction over offline batch generation, low-latency responsiveness, local execution on consumer hardware, and persistent world rollouts where coherence across time matters as much as single-frame fidelity (model card).

For the full benchmark data, see /check/waypoint-1-5/rtx-5070-ti.

Troubleshooting

No RTX 5070 Ti throughput figure is published — what should I expect?

The model card's performance table benchmarks only the RTX 5090 (56 FPS BF16 / 72 FPS w8a8) and RTX 3090 (30 FPS w8a8) at 720p. The 5070 Ti is a supported card ("Desktop RTX 30 Series and later") but has no first-party number. Run the intw8a8 path first for the best chance of hitting the family-level 60 FPS target, confirm you are on cu128 wheels (below), and then submit your measured FPS so the figure lands at /check/waypoint-1-5/rtx-5070-ti for the next reader.

Picking a quant path on Blackwell (RTX 5070 Ti)

All three configurations in the world_engine README quant-path table are architecturally supported on Blackwell sm_120. Suggested order:

  • quant="intw8a8" — INT8 weights + activations; listed for "NVIDIA (30xx, 40xx, Ampere+)", which includes Blackwell. This is the configuration behind the model card's published 5090 (72 FPS) and 3090 (30 FPS) quantized numbers, so it is the most-tested path and the natural default on the 5070 Ti.
  • Unquantized BF16 (omit quant) — ~11.2 GB resident, no quantization-induced quality drift, fits the 16 GB envelope with headroom. Use this if you want maximum fidelity and the BF16 throughput is acceptable.
  • quant="fp8w8a8" — listed for "NVIDIA Ada Lovelace / Hopper+ (RTX 40xx, H100)". The 5070 Ti has native FP8 Tensor Cores so the path is hardware-supported, but the model card's reference numbers do not use it — treat it as an experimental tier.
  • quant="nvfp4" — Blackwell-exclusive (the README lists "B100, B200, RTX 5090"); architecturally available on the 5070 Ti's sm_120 Tensor Cores (same GB203 die as the 5090). PR #3 added the initial NVFP4 benchmarking and generation code. No published reference FPS exists for any 50-series card other than the 5090 on this path yet, so treat it as experimental.

CUDA-wheel mismatch on Blackwell (5070 Ti)

The 5070 Ti is Blackwell-class (sm_120). Use the cu128 PyTorch wheels (pip install torch --index-url https://download.pytorch.org/whl/cu128) rather than the default cu121 / cu126 index to avoid kernel-launch failures or a silent fall-back to compute paths that miss Blackwell-specific acceleration.

FlashAttention-2 on Blackwell

The model card and the world_engine README do not require flash_attention_2; the canonical library uses PyTorch's SDPA path. If you layer FlashAttention-2 on top via a custom fork, note that FA2 sm_120 kernel support for Blackwell is still tracked at Dao-AILab/flash-attention#2168 — when in doubt, stay with the default SDPA backend the world_engine library uses.

HF_TOKEN errors / rate-limited download

The Overworld/Waypoint-1.5-1B repo is public and Apache-2.0 — no licence acceptance is required to download. If you hit HTTP 429 rate limits or want authenticated pulls, set export HF_TOKEN=<your access token> (from huggingface.co/settings/tokens) before the first run, as the world_engine README suggests.

Confusion with the 360P variant, the SPar3D successor rumour, or other "Waypoint" projects

Only the Overworld/Waypoint-1.5-1B repo (720p) and its Overworld/Waypoint-1.5-1B-360P sibling (laptop / Apple-Silicon tier) are the canonical world-model weights. Despite the name overlap, this is not an image-to-3D mesh model and not a SPar3D successor — spar3d/Waypoint-1.5 does not resolve on the Hub. Unrelated "Waypoint" libraries (game-dev navigation, robotics path planning) are different projects — don't conflate them.

For other issues, file a report via the submission form.