How much VRAM does Waypoint 1.5 need?

About 12 GB — the minimum this recipe targets.

How hard is this setup?

Intermediate — follow the steps above.

Waypoint 1.5 on RTX 3090: Real-Time Interactive World Model at 720p, 30 FPS

What You'll Build

A local install of Waypoint 1.5 — Overworld's 1.2B-parameter real-time interactive video world model — running 720p generation driven by keyboard and mouse input on an RTX 3090, at the 30 FPS target the model card explicitly publishes for this exact GPU.

Hardware data: RTX 3090 (24GB VRAM) · 720p @ 30 FPS (w8a8 quantized, 4-step) · See benchmark data

ℹ️ Not an image-to-3D-mesh model. Despite being grouped in our 3d vertical (which spans both classical mesh-output models like TRELLIS / Hunyuan3D and the newer generative-worlds class), Waypoint 1.5 does not produce .obj / .glb / .ply mesh files. It produces an interactive video stream conditioned on live controller inputs, with success measured in frames-per-second and rollout coherence rather than mesh topology. If you need a mesh output, see TRELLIS or Hunyuan3D-2 on this GPU instead.

Variant note: The Waypoint 1.5 family ships two tiers: a 720p model for "desktop RTX 30 series through RTX 50 series cards" (this recipe) and a 360P fork for laptop GPUs and Apple Silicon. The RTX 3090 is a desktop Ampere card squarely inside the 720p 1B envelope.

Requirements

Component	Minimum	Tested
GPU	NVIDIA RTX 30-series or later desktop card	RTX 3090 (24GB) — pair not yet benchmarked in our DB, see /check/
VRAM	Not stated in the model card; the 1.2B BF16 weights are ~3.72 GB for the fused `model.safetensors` per the HF tree API, plus a separate modular transformer at ~7.44 GB and a small VAE at ~22.76 MB — derived envelope ~12 GB at BF16 (well below the 3090's 24 GB)	24GB
RAM	16GB system RAM	—
Storage	~12 GB for BF16 safetensors + caches	—
Software	Python 3.10+, PyTorch with CUDA + BF16, HuggingFace `diffusers` (or `world_engine`)	—

The official Waypoint-1.5-1B model card does not publish an explicit VRAM number, but it DOES publish a per-GPU FPS table that names the RTX 3090 directly — see Results below. The 3090's 24 GB envelope is comfortably above the ~12 GB derived BF16 footprint; the binding constraint on Ampere is compute (INT8 path required, see Installation), not memory.

Installation

Two paths cover the canonical entry points the Overworld team documents:

Path A — `world_engine` (recommended for interactive use)

world_engine is Overworld's reference inference library, linked from the model card as the "Core Inference Library". Per the official Wayfarer-Labs/world_engine README:

python3 -m venv .env
source .env/bin/activate
pip install --upgrade --ignore-installed \
  "world_engine @ git+https://github.com/Overworldai/world_engine.git"
export HF_TOKEN=<your_huggingface_access_token>

The README maps inference quantization paths by GPU architecture:

Config	Description	Supported GPUs (per the README)
`intw8a8`	INT8 weights + INT8 dynamic per-token activations	NVIDIA (30xx, 40xx, Ampere+)
`fp8w8a8`	FP8 (e4m3) weights + FP8 per-tensor activations via `torch._scaled_mm`	NVIDIA Ada Lovelace / Hopper+ (RTX 40xx, H100)
`nvfp4`	NVFP4 weights + FP4 activations via FlashInfer/CUTLASS	NVIDIA Blackwell (B100, B200, RTX 5090)

The RTX 3090 is Ampere (sm_86), so the only quantized fast path is intw8a8 — and this is exactly the path the model card's headline 30 FPS RTX 3090 figure uses (4-step w8a8 (3090) → 30 FPS). The fp8w8a8 path is Ada Lovelace / Hopper+ only per the README (the table explicitly excludes Ampere); attempting it on a 3090 will fall back to BF16 dequantization at best and lose the throughput benefit. The Blackwell-only nvfp4 path does not apply on a 3090 either — it requires sm_120.

⚠️ FP8 weight ≠ FP8 compute on Ampere. Ampere (sm_86) has no FP8 tensor cores; an FP8 weight file will load but the runtime dequantizes to BF16 on the fly, costing you the speed win without giving the 3090 the throughput uplift Ada/Blackwell cards get. Default to intw8a8 on the 3090 — that's both the world_engine README's recommendation and the configuration the model card's 30 FPS measurement was taken under.

Path B — HuggingFace `diffusers` (modular pipelines API)

Per the official model card, Waypoint also ships as a ModularPipeline. Install the latest diffusers against a standard CUDA wheel — Ampere is fully covered by the default pip install torch (no special index needed, unlike Blackwell which requires cu128):

pip install --upgrade diffusers transformers accelerate safetensors
pip install torch  # default cu124/cu126 index is fine on Ampere sm_86

Running

Path A — `world_engine` with a scripted controller sequence

Adapted from examples/gen_sample.py in the world_engine repo. To hit the model card's published 30 FPS RTX 3090 throughput, pass quant="intw8a8":

# uv run --dev examples/gen_sample.py Overworld/Waypoint-1.5-1B
import cv2, sys, json, random, urllib.request
import numpy as np
import imageio.v3 as iio
import torch
from world_engine import WorldEngine, CtrlInput

# Use intw8a8 on the RTX 3090 — matches the model card's 30 FPS benchmark config.
engine = WorldEngine(sys.argv[1], quant="intw8a8", device="cuda")

# Build a small controller programme: mouse, jump, walk W/A/S/D
controller_sequence = [
    CtrlInput(mouse=[0.2, 0.2]), CtrlInput(button={32}), CtrlInput(),
    CtrlInput(button={1, 32}), CtrlInput(),
]
controller_sequence += (
    [CtrlInput(button={32})] * 10 +  # forward
    [CtrlInput(button={65})] * 10 +  # A — left
    [CtrlInput(button={68})] * 10 +  # D — right
    [CtrlInput(button={83})] * 10    # S — back
)

# Seed frame (any 1280x720 RGB image works)
seed_frame = cv2.imread("starter.png")
seed_frame = cv2.cvtColor(cv2.resize(seed_frame, (1280, 720)), cv2.COLOR_BGR2RGB)
seed_frame_x4 = torch.from_numpy(np.repeat(seed_frame[None], 4, axis=0))

with iio.imopen("out.mp4", "w", plugin="pyav") as out:
    engine.append_frame(seed_frame_x4)
    out.write(seed_frame_x4, fps=60, codec="libx264")
    for ctrl in controller_sequence:
        out.write(engine.gen_frame(ctrl=ctrl).cpu().numpy())

Note the 4-frame chunking — gen_frame() returns four frames per call (the world_engine README explicitly documents this: Waypoint-1.5 "applies temporal compression and generates 4 frames for every controller input" of shape [4, 720, 1280, 3]), matching the 60 FPS / 4-step schedule of the model card's reference performance numbers.

Path B — `diffusers` ModularPipeline

The model card ships this canonical snippet:

import torch
from diffusers.modular_pipelines import ModularPipeline
from diffusers.utils import load_image, export_to_video

pipe = ModularPipeline.from_pretrained(
    "Overworld/Waypoint-1.5-1B", trust_remote_code=True
)
pipe.load_components(
    device_map="cuda", torch_dtype=torch.bfloat16, trust_remote_code=True
)
pipe.transformer.apply_inference_patches()
pipe.transformer.compile(fullgraph=True, mode="max-autotune", dynamic=False)

image = load_image(
    "https://huggingface.co/spaces/Overworld/waypoint-1-small/resolve/main/starter_18.png"
).resize((1024, 512))

state = pipe(image=image, prompt="An explorable world",
             button=set(), mouse=(0.0, 0.0), output_type="pil")

state.values["image"] = None
frames = []
for _ in range(150):
    state = pipe(state, button={87}, mouse=(0.0, 0.0), output_type="pil")
    frames.append(state.values["images"])

export_to_video(frames, "waypoint-v1-5.mp4", fps=60)

button={87} is the W key (walk forward). Replace with your input loop for real-time controllable rollouts. Note: the diffusers path here loads at BF16, which on Ampere does NOT get the INT8 throughput uplift of Path A's intw8a8 config — for the 30 FPS target, prefer Path A.

Results

Speed: 30 FPS at 720p, w8a8 quantized, 4-step — published directly by the Waypoint 1.5 model card as one of three reference data points in its "Waypoint 1 vs Waypoint 1.5" performance table. Verbatim row: "4-step w8a8 (3090) → 30 FPS". This is a first-party number on the exact target GPU — no cross-card extrapolation involved. For comparison, the same table publishes 56 FPS (5090 BF16 unquantized, 4-step) and 72 FPS (5090 w8a8 quantized, 4-step) — the 3090's INT8 throughput sits at roughly 42% of the 5090's INT8 number, which is consistent with the generational compute gap (Ampere sm_86 vs Blackwell sm_120). If you measure something different, please submit your numbers — a community benchmark in /check/waypoint-1-5/rtx-3090 will replace this single first-party citation.
VRAM usage: The model card does not state a VRAM figure. As a derived envelope: the BF16 weights are ~3.72 GB for the fused model.safetensors plus ~7.44 GB for the modular transformer plus ~22.76 MB for the VAE per the HF tree API — roughly 12 GB on-disk BF16. The w8a8 INT8 path reduces this further at runtime. The 24 GB envelope of an RTX 3090 has ample headroom for the 512-frame context window and the autoencoder pipeline even at unquantized BF16. Live measurements: /check/waypoint-1-5/rtx-3090.
Latency target: Family-level target is "up to 720p and 60 FPS" with a 512-frame context window — about 10 seconds of rollout at 60 FPS (model card). On the RTX 3090 with intw8a8, expect 30 FPS at 720p per the card's published number — half the 60 FPS family-level target.
Quality notes: Waypoint is "a generative world model, not a simulator with guaranteed physical accuracy" — design priorities are "Real-time interaction rather than offline batch generation, Low-latency responsiveness to user inputs, Local execution on consumer hardware, Persistent world rollouts where coherence across time matters as much as single-frame fidelity" (model card).

For the full benchmark data, see /check/waypoint-1-5/rtx-3090.

Troubleshooting

Throughput is below the 30 FPS the model card publishes

The 30 FPS figure is specifically for the intw8a8 quant path at 720p, 4-step. If you're running unquantized BF16 (Path B's diffusers default), you'll see lower throughput on the 3090 because Ampere doesn't get the INT8 tensor-core speedup automatically — pip install world_engine and switch to Path A with quant="intw8a8". Also confirm pipe.transformer.compile(fullgraph=True, mode="max-autotune", dynamic=False) is enabled when on the diffusers path — the model card snippet includes it for a reason.

`fp8w8a8` errors or no speed uplift

The world_engine README explicitly limits fp8w8a8 to Ada Lovelace and Hopper+ — Ampere sm_86 has no FP8 tensor cores. On the RTX 3090, do not attempt quant="fp8w8a8"; use quant="intw8a8" instead. The Blackwell-only nvfp4 path also does not apply — it requires sm_120 kernels via FlashInfer/CUTLASS that the Ampere Tensor Cores do not implement.

`HF_TOKEN` errors / 401 on download

Per the world_engine README, world_engine reads export HF_TOKEN=<your_huggingface_access_token> if one is set. The Waypoint-1.5-1B weights are public and Apache-2.0 — no licence acceptance or access request is required, so a token is only needed if an anonymous download hits a rate-limit or transient 401.

Confusion with the 360P variant, the SPar3D successor rumour, or other "Waypoint" projects

Only the Overworld/Waypoint-1.5-1B repo (720p) and its Overworld/Waypoint-1.5-1B-360P sibling (laptop tier) are the canonical world-model weights. Despite the name overlap, this is not an image-to-3D mesh model and not a SPar3D successor — spar3d/Waypoint-1.5 does not resolve on the Hub. Unrelated "Waypoint" libraries (e.g. game-dev navigation, robotics path planning) are different projects — don't conflate them.

Picking a quant path on Ampere (RTX 3090)

Default to intw8a8 on the 3090. The unquantized BF16 path works (the 11.18 GB on-disk envelope fits comfortably in 24 GB) and is the right starting point if you want to verify install correctness before introducing quantization, but the 30 FPS headline number requires INT8. The fp8w8a8 path is Ada/Hopper-only; the nvfp4 path is Blackwell-only — neither applies on Ampere. See the world_engine README quant-path table for the architecture mapping.

For other issues, file a report via the submission form.