How much VRAM does Waypoint 1.5 need?

About 12 GB — the minimum this recipe targets.

How hard is this setup?

Intermediate — follow the steps above.

Waypoint 1.5 on RTX 4090: Real-Time Interactive World Model at 720p

What You'll Build

A local install of Waypoint 1.5 — Overworld's 1.2B-parameter real-time interactive video world model — running 720p generation driven by keyboard and mouse input on an RTX 4090. Waypoint is a "real-time interactive video world model" — the controller loop (button presses, mouse deltas) is part of the inference loop, not a post-hoc edit.

Hardware data: RTX 4090 (24GB VRAM) · 720p target resolution · See benchmark data

ℹ️ Not an image-to-3D-mesh model. Despite being grouped in our 3d vertical (which spans both classical mesh-output models like TRELLIS / Hunyuan3D and the newer generative-worlds class), Waypoint 1.5 does not produce .obj / .glb / .ply mesh files. It produces an interactive video stream conditioned on live controller inputs, with success measured in frames-per-second and rollout coherence rather than mesh topology. If you need a mesh output, see TRELLIS or Hunyuan3D-2 on this GPU instead.

Variant note: The Waypoint 1.5 family ships two tiers: a 720p model for "desktop RTX 30 series through RTX 50 series cards" (this recipe) and a 360P fork for laptop GPUs and Apple Silicon. The RTX 4090 is a desktop Ada Lovelace card, so the 720p 1B checkpoint is the right target.

Requirements

Component	Minimum	Tested
GPU	NVIDIA RTX 30-series or later desktop card	RTX 4090 (24GB) — pair not yet benchmarked, see /check/
VRAM	Not stated in the model card; the 1.2B BF16 weights are ~11.18 GB on disk per the HF tree API (3.72 GB fused `model.safetensors` + 7.44 GB modular `transformer/diffusion_pytorch_model.safetensors` + 22.76 MB VAE), so the 24 GB envelope of a 4090 has ample headroom even before quantization	24GB
RAM	16GB system RAM	—
Storage	~12 GB for BF16 safetensors + caches	—
Software	Python 3.10+, PyTorch with CUDA + BF16, HuggingFace `diffusers` (or `world_engine`)	—

The official Waypoint-1.5-1B model card does not publish an explicit VRAM number. It does explicitly support "Desktop RTX 30 Series and later" and publishes a headline performance table with the 1B BF16 model running 720p at 56 FPS on an RTX 5090 (unquantized, 4-step) and 30 FPS on an RTX 3090 (w8a8 quantized, 4-step). The RTX 4090 — Ada Lovelace (sm_89), 24 GB, ~1008 GB/s memory bandwidth — sits squarely inside the supported envelope between those two reference points. The 24 GB VRAM matches a 5090 and is well above the 11.18 GB on-disk BF16 footprint.

Installation

Two paths cover the canonical entry points the Overworld team documents:

Path A — `world_engine` (recommended for interactive use)

world_engine is Overworld's reference inference library, linked from the model card as the "Core Inference Library". Per the official Wayfarer-Labs/world_engine README:

python3 -m venv .env
source .env/bin/activate
pip install --upgrade --ignore-installed \
  "world_engine @ git+https://github.com/Overworldai/world_engine.git"
export HF_TOKEN=<your_huggingface_access_token>

The README lists three inference quantization paths by GPU architecture:

Config	Description	Supported GPUs (per the README)
`intw8a8`	INT8 weights + INT8 dynamic per-token activations	NVIDIA (30xx, 40xx, Ampere+)
`fp8w8a8`	FP8 (e4m3) weights + FP8 per-tensor activations via `torch._scaled_mm`	NVIDIA Ada Lovelace / Hopper+ (RTX 40xx, H100)
`nvfp4`	NVFP4 weights + FP4 activations via FlashInfer/CUTLASS	NVIDIA Blackwell (B100, B200, RTX 5090)

The RTX 4090 is Ada Lovelace (sm_89), so the recommended fast paths are fp8w8a8 (Ada has 4th-gen Tensor Cores with native FP8 support) or intw8a8 (broadly compatible). The Blackwell-only nvfp4 path does not apply on a 4090 — it requires sm_120. Start with quant=None (unquantized BF16) since the 24 GB envelope has enough headroom, then drop to fp8w8a8 for throughput.

Path B — HuggingFace `diffusers` (modular pipelines API)

Per the official model card, Waypoint also ships as a ModularPipeline. Install the latest diffusers against a standard CUDA wheel — Ada is fully covered by the default pip install torch (no special index needed, unlike Blackwell which requires cu128):

pip install --upgrade diffusers transformers accelerate safetensors
pip install torch  # default cu124/cu126 index is fine on Ada sm_89

Running

Path A — `world_engine` with a scripted controller sequence

Adapted verbatim from examples/gen_sample.py in the world_engine repo:

# uv run --dev examples/gen_sample.py Overworld/Waypoint-1.5-1B
import cv2, sys, json, random, urllib.request
import numpy as np
import imageio.v3 as iio
import torch
from world_engine import WorldEngine, CtrlInput

engine = WorldEngine(sys.argv[1], quant=None, device="cuda")

# Build a small controller programme: mouse, jump, walk W/A/S/D
controller_sequence = [
    CtrlInput(mouse=[0.2, 0.2]), CtrlInput(button={32}), CtrlInput(),
    CtrlInput(button={1, 32}), CtrlInput(),
]
controller_sequence += (
    [CtrlInput(button={32})] * 10 +  # forward
    [CtrlInput(button={65})] * 10 +  # A — left
    [CtrlInput(button={68})] * 10 +  # D — right
    [CtrlInput(button={83})] * 10    # S — back
)

# Seed frame (any 1280x720 RGB image works)
seed_frame = cv2.imread("starter.png")
seed_frame = cv2.cvtColor(cv2.resize(seed_frame, (1280, 720)), cv2.COLOR_BGR2RGB)
seed_frame_x4 = torch.from_numpy(np.repeat(seed_frame[None], 4, axis=0))

with iio.imopen("out.mp4", "w", plugin="pyav") as out:
    engine.append_frame(seed_frame_x4)
    out.write(seed_frame_x4, fps=60, codec="libx264")
    for ctrl in controller_sequence:
        out.write(engine.gen_frame(ctrl=ctrl).cpu().numpy())

Note the 4-frame chunking — gen_frame() returns four frames per call (the world_engine README explicitly documents this: Waypoint-1.5 "applies temporal compression and generates 4 frames for every controller input" of shape [4, 720, 1280, 3]), matching the 60 FPS / 4-step schedule the model card cites.

Path B — `diffusers` ModularPipeline

The model card ships this canonical snippet:

import torch
from diffusers.modular_pipelines import ModularPipeline
from diffusers.utils import load_image, export_to_video

pipe = ModularPipeline.from_pretrained(
    "Overworld/Waypoint-1.5-1B", trust_remote_code=True
)
pipe.load_components(
    device_map="cuda", torch_dtype=torch.bfloat16, trust_remote_code=True
)
pipe.transformer.apply_inference_patches()
pipe.transformer.compile(fullgraph=True, mode="max-autotune", dynamic=False)

image = load_image(
    "https://huggingface.co/spaces/Overworld/waypoint-1-small/resolve/main/starter_18.png"
).resize((1024, 512))

state = pipe(image=image, prompt="An explorable world",
             button=set(), mouse=(0.0, 0.0), output_type="pil")

state.values["image"] = None
frames = []
for _ in range(150):
    state = pipe(state, button={87}, mouse=(0.0, 0.0), output_type="pil")
    frames.append(state.values["images"])

export_to_video(frames, "waypoint-v1-5.mp4", fps=60)

button={87} is the W key (walk forward). Replace with your input loop for real-time controllable rollouts.

Results

Speed: No community benchmark exists on RTX 4090 yet. The model card publishes two reference data points — 56 FPS at 720p on RTX 5090 (unquantized BF16, 4-step) and 30 FPS at 720p on RTX 3090 (w8a8 quantized, 4-step). Neither figure can be relabeled as a 4090 measurement: the 5090 is Blackwell (sm_120, GDDR7) and the 3090 is Ampere (sm_86, GDDR6X), both cross-generation from Ada Lovelace's sm_89. The 4090's actual throughput on this model is unknown until a community submission lands at /check/waypoint-1-5/rtx-4090. If you run it, please submit your numbers.
VRAM usage: The model card does not state a VRAM figure. As a derived envelope: the BF16 weights are ~11.18 GB on disk per the HF tree API (3.72 GB fused + 7.44 GB modular transformer + 22.76 MB VAE), so the 24 GB envelope of a 4090 has ample headroom for activations, the 512-frame context window, and the autoencoder pipeline even at unquantized BF16. Live measurements: /check/waypoint-1-5/rtx-4090.
Latency target: Family-level target is "up to 720p and 60 FPS" with a 512-frame context window — about 10 seconds of rollout at 60 FPS (model card).
Quality notes: Waypoint is "a generative world model, not a simulator with guaranteed physical accuracy" — design priorities are "Real-time interaction rather than offline batch generation, Low-latency responsiveness to user inputs, Local execution on consumer hardware, Persistent world rollouts where coherence across time matters as much as single-frame fidelity" (model card).

For the full benchmark data, see /check/waypoint-1-5/rtx-4090.

Troubleshooting

Frame rate is below your expectation

The model card's two reference data points (56 FPS 5090 BF16 / 30 FPS 3090 w8a8) cover the boundaries of Overworld's "RTX 30 Series and later" support envelope. On a 4090 you have two levers per the world_engine README: try quant="fp8w8a8" (Ada has 4th-gen Tensor Cores with native FP8 support, distinct from the Blackwell-only nvfp4 path) or quant="intw8a8" (the broadly-compatible Ampere+ path). Also confirm pipe.transformer.compile(fullgraph=True, mode="max-autotune", dynamic=False) is enabled in the diffusers path — the model card snippet includes it for a reason.

`HF_TOKEN` errors / 401 on download

Per the world_engine README, world_engine reads export HF_TOKEN=<your_huggingface_access_token> if one is set. The Waypoint-1.5-1B weights are public and Apache-2.0 — no licence acceptance or access request is required, so a token is only needed if an anonymous download hits a rate-limit or transient 401.

Confusion with the 360P variant, the SPar3D successor rumour, or other "Waypoint" projects

Only the Overworld/Waypoint-1.5-1B repo (720p) and its Overworld/Waypoint-1.5-1B-360P sibling (laptop tier) are the canonical world-model weights. Despite the name overlap, this is not an image-to-3D mesh model and not a SPar3D successor — spar3d/Waypoint-1.5 does not resolve on the Hub. Unrelated "Waypoint" libraries (e.g. game-dev navigation, robotics path planning) are different projects — don't conflate them.

Picking a quant path on Ada (RTX 4090)

The world_engine README maps nvfp4 to Blackwell only (B100, B200, RTX 5090). On a 4090, do not attempt the nvfp4 path — it requires sm_120 kernels via FlashInfer/CUTLASS that the Ada Tensor Cores do not implement. Use unquantized BF16 as the default (24 GB has the headroom), fp8w8a8 for a throughput uplift (Ada's native FP8), or intw8a8 if you want the most broadly-compatible quantized path.

For other issues, file a report via the submission form.