Waypoint 1.5 on RTX 3090 Ti: Real-Time Interactive World Model at 720p, ~32 FPS

What You'll Build

A local install of Waypoint 1.5 — Overworld's 1.2B-parameter real-time interactive video world model — running 720p generation driven by keyboard and mouse input on an RTX 3090 Ti, on the same intw8a8 quant path the model card uses to publish its first-party 30 FPS RTX 3090 benchmark. The 3090 Ti shares the 3090's Ampere sm_86 arch and 24 GB envelope with ~8% more memory bandwidth and ~12% more FP16 compute, so expect a small uplift on the Ti at the same settings — see Results for the close-sibling forward-statement.

Hardware data: RTX 3090 Ti (24GB VRAM) · 720p, w8a8 quantized, 4-step — close-sibling forward-statement from the model card's published 3090 row · See benchmark data

ℹ️ Not an image-to-3D-mesh model. Despite being grouped in our 3d vertical (which spans both classical mesh-output models like TRELLIS / Hunyuan3D and the newer generative-worlds class), Waypoint 1.5 does not produce .obj / .glb / .ply mesh files. It produces an interactive video stream conditioned on live controller inputs, with success measured in frames-per-second and rollout coherence rather than mesh topology. If you need a mesh output, see TRELLIS or Hunyuan3D-2 on this GPU instead.

Variant note: The Waypoint 1.5 family ships two tiers: a 720p model for "desktop RTX 30 series through RTX 50 series cards" (this recipe) and a 360P fork for laptop GPUs and Apple Silicon. The RTX 3090 Ti is a desktop Ampere card squarely inside the 720p 1B envelope.

Requirements

Component	Minimum	Tested
GPU	NVIDIA RTX 30-series or later desktop card	RTX 3090 Ti (24GB) — pair not yet benchmarked in our DB, see /check/
VRAM	Not stated in the model card; the 1.2B BF16 weights are ~3.72 GB for the fused `model.safetensors` per the HF tree API, plus a separate modular transformer at ~7.44 GB and a small VAE at ~22.76 MB — derived envelope ~12 GB at BF16 (well below the 3090 Ti's 24 GB)	24GB
RAM	16GB system RAM	—
Storage	~12 GB for BF16 safetensors + caches	—
Software	Python 3.10+, PyTorch with CUDA + BF16, HuggingFace `diffusers` (or `world_engine`)	—

The official Waypoint-1.5-1B model card does not publish an explicit VRAM number, but it DOES publish a per-GPU FPS table that names the RTX 3090 directly — see Results below for the close-sibling forward-statement to the 3090 Ti. The 3090 Ti's 24 GB envelope is comfortably above the ~12 GB derived BF16 footprint; the binding constraint on Ampere is compute (INT8 path required, see Installation), not memory.

Installation

Two paths cover the canonical entry points the Overworld team documents:

Path A — `world_engine` (recommended for interactive use)

world_engine is Overworld's reference inference library, linked from the model card as the "Core Inference Library". Per the official Wayfarer-Labs/world_engine README:

python3 -m venv .env
source .env/bin/activate
pip install --upgrade --ignore-installed \
  "world_engine @ git+https://github.com/Overworldai/world_engine.git"
export HF_TOKEN=<your_huggingface_access_token>

The README maps inference quantization paths by GPU architecture:

Config	Description	Supported GPUs (per the README)
`intw8a8`	INT8 weights + INT8 dynamic per-token activations	NVIDIA (30xx, 40xx, Ampere+)
`fp8w8a8`	FP8 (e4m3) weights + FP8 per-tensor activations via `torch._scaled_mm`	NVIDIA Ada Lovelace / Hopper+ (RTX 40xx, H100)
`nvfp4`	NVFP4 weights + FP4 activations via FlashInfer/CUTLASS	NVIDIA Blackwell (B100, B200, RTX 5090)

The RTX 3090 Ti is Ampere (sm_86) — the same arch generation as the RTX 3090 — so the only quantized fast path is intw8a8, and this is the same path the model card's headline 30 FPS RTX 3090 figure uses (4-step w8a8 (3090) → 30 FPS). The fp8w8a8 path is Ada Lovelace / Hopper+ only per the README (the table explicitly excludes Ampere); attempting it on a 3090 Ti will fall back to BF16 dequantization at best and lose the throughput benefit. The Blackwell-only nvfp4 path does not apply on a 3090 Ti either — it requires sm_120.

⚠️ FP8 weight ≠ FP8 compute on Ampere. Ampere (sm_86) has no FP8 tensor cores — this is true on both the RTX 3090 and the RTX 3090 Ti. An FP8 weight file will load but the runtime dequantizes to BF16 on the fly, costing you the speed win without giving the Ti the throughput uplift Ada/Blackwell cards get. Default to intw8a8 on the 3090 Ti — that's both the world_engine README's recommendation and the configuration the model card's RTX 3090 measurement was taken under.

Path B — HuggingFace `diffusers` (modular pipelines API)

Per the official model card, Waypoint also ships as a ModularPipeline. Install the latest diffusers against a standard CUDA wheel — Ampere is fully covered by the default pip install torch (no special index needed, unlike Blackwell which requires cu128):

pip install --upgrade diffusers transformers accelerate safetensors
pip install torch  # default cu124/cu126 index is fine on Ampere sm_86

Running

Path A — `world_engine` with a scripted controller sequence

Adapted from examples/gen_sample.py in the world_engine repo. To match the model card's published 30 FPS RTX 3090 throughput configuration (and pick up the small Ti uplift on top — see Results), pass quant="intw8a8":

# uv run --dev examples/gen_sample.py Overworld/Waypoint-1.5-1B
import cv2, sys, json, random, urllib.request
import numpy as np
import imageio.v3 as iio
import torch
from world_engine import WorldEngine, CtrlInput

# Use intw8a8 on the RTX 3090 Ti — same config as the model card's 3090 benchmark.
engine = WorldEngine(sys.argv[1], quant="intw8a8", device="cuda")

# Build a small controller programme: mouse, jump, walk W/A/S/D
controller_sequence = [
    CtrlInput(mouse=[0.2, 0.2]), CtrlInput(button={32}), CtrlInput(),
    CtrlInput(button={1, 32}), CtrlInput(),
]
controller_sequence += (
    [CtrlInput(button={32})] * 10 +  # forward
    [CtrlInput(button={65})] * 10 +  # A — left
    [CtrlInput(button={68})] * 10 +  # D — right
    [CtrlInput(button={83})] * 10    # S — back
)

# Seed frame (any 1280x720 RGB image works)
seed_frame = cv2.imread("starter.png")
seed_frame = cv2.cvtColor(cv2.resize(seed_frame, (1280, 720)), cv2.COLOR_BGR2RGB)
seed_frame_x4 = torch.from_numpy(np.repeat(seed_frame[None], 4, axis=0))

with iio.imopen("out.mp4", "w", plugin="pyav") as out:
    engine.append_frame(seed_frame_x4)
    out.write(seed_frame_x4, fps=60, codec="libx264")
    for ctrl in controller_sequence:
        out.write(engine.gen_frame(ctrl=ctrl).cpu().numpy())

Note the 4-frame chunking — gen_frame() returns four frames per call (the world_engine README explicitly documents this: Waypoint-1.5 "applies temporal compression and generates 4 frames for every controller input" of shape [4, 720, 1280, 3]), matching the 60 FPS / 4-step schedule of the model card's reference performance numbers.

Path B — `diffusers` ModularPipeline

The model card ships this canonical snippet:

import torch
from diffusers.modular_pipelines import ModularPipeline
from diffusers.utils import load_image, export_to_video

pipe = ModularPipeline.from_pretrained(
    "Overworld/Waypoint-1.5-1B", trust_remote_code=True
)
pipe.load_components(
    device_map="cuda", torch_dtype=torch.bfloat16, trust_remote_code=True
)
pipe.transformer.apply_inference_patches()
pipe.transformer.compile(fullgraph=True, mode="max-autotune", dynamic=False)

image = load_image(
    "https://huggingface.co/spaces/Overworld/waypoint-1-small/resolve/main/starter_18.png"
).resize((1024, 512))

state = pipe(image=image, prompt="An explorable world",
             button=set(), mouse=(0.0, 0.0), output_type="pil")

state.values["image"] = None
frames = []
for _ in range(150):
    state = pipe(state, button={87}, mouse=(0.0, 0.0), output_type="pil")
    frames.append(state.values["images"])

export_to_video(frames, "waypoint-v1-5.mp4", fps=60)

button={87} is the W key (walk forward). Replace with your input loop for real-time controllable rollouts. Note: the diffusers path here loads at BF16, which on Ampere does NOT get the INT8 throughput uplift of Path A's intw8a8 config — for the closest-to-30 FPS target on the 3090 Ti, prefer Path A.

Results

Speed: The Waypoint 1.5 model card publishes a first-party performance row for the RTX 3090 directly — in the "Waypoint 1 vs Waypoint 1.5" table, the 4-step w8a8 (3090) row reports 30 FPS for Waypoint 1.5. The model card does not publish a row for the RTX 3090 Ti. The Ti is a close-sibling card to the 3090 — same Ampere sm_86 arch, same 24 GB GDDR6X envelope, with ~8% more memory bandwidth (1008 GB/s vs 936 GB/s) and ~12% more FP16 dense compute (~40 TFLOPS vs ~35.6 TFLOPS) per TechPowerUp's specs. Under the close-sibling rule for same-arch same-VRAM-tier pairs with ≤10% compute delta, forward-statement: expect a small uplift over the 3090's 30 FPS — roughly 32-33 FPS at the same 4-step w8a8 720p settings, with the actual delta dominated by whichever stage of the pipeline is binding (memory-bound stages take the +8% bandwidth uplift; compute-bound stages take the +12% TFLOPS uplift). This is a forward-statement, not a measurement — we have not measured the 3090 Ti directly. If you run the benchmark yourself, please submit your numbers — a community measurement on the Ti will replace this forward-statement at /check/waypoint-1-5/rtx-3090-ti. For reference, the same model-card table publishes 56 FPS (5090 BF16 unquantized, 4-step) and 72 FPS (5090 w8a8 quantized, 4-step).
VRAM usage: The model card does not state a VRAM figure. As a derived envelope: the BF16 weights are ~3.72 GB for the fused model.safetensors plus ~7.44 GB for the modular transformer plus ~22.76 MB for the VAE per the HF tree API — roughly 12 GB on-disk BF16. The w8a8 INT8 path reduces this further at runtime. The 24 GB envelope of the RTX 3090 Ti has ample headroom for the 512-frame context window and the autoencoder pipeline even at unquantized BF16. Live measurements: /check/waypoint-1-5/rtx-3090-ti.
Latency target: Family-level target is "up to 720p and 60 FPS" with a 512-frame context window — about 10 seconds of rollout at 60 FPS (model card). On the RTX 3090 Ti with intw8a8, expect roughly the 3090's 30 FPS plus the small close-sibling uplift discussed above — well below the family-level 60 FPS target (which is reserved for the 5090).
Quality notes: Per the model card "Limitations" section, Waypoint "is a generative world model, not a simulator with guaranteed physical accuracy." The card's "Compared with a conventional video generation workflow" section lists the family's design priorities as: real-time interaction rather than offline batch generation; low-latency responsiveness to user inputs; local execution on consumer hardware; and persistent world rollouts where coherence across time matters as much as single-frame fidelity.

For the full benchmark data, see /check/waypoint-1-5/rtx-3090-ti.

Troubleshooting

Throughput is below the ~32 FPS close-sibling forward-statement

The 30 FPS first-party RTX 3090 figure (and the ~32 FPS forward-statement for the Ti) is specifically for the intw8a8 quant path at 720p, 4-step. If you're running unquantized BF16 (Path B's diffusers default), you'll see lower throughput on the 3090 Ti because Ampere doesn't get the INT8 tensor-core speedup automatically — pip install world_engine and switch to Path A with quant="intw8a8". Also confirm pipe.transformer.compile(fullgraph=True, mode="max-autotune", dynamic=False) is enabled when on the diffusers path — the model card snippet includes it for a reason.

`fp8w8a8` errors or no speed uplift

The world_engine README explicitly limits fp8w8a8 to Ada Lovelace and Hopper+ — Ampere sm_86 has no FP8 tensor cores. On the RTX 3090 Ti (same Ampere sm_86 as the 3090), do not attempt quant="fp8w8a8"; use quant="intw8a8" instead. The Blackwell-only nvfp4 path also does not apply — it requires sm_120 kernels via FlashInfer/CUTLASS that the Ampere Tensor Cores do not implement.

`HF_TOKEN` errors / 401 on download

Per the world_engine README, world_engine reads export HF_TOKEN=<your_huggingface_access_token> if one is set. The Waypoint-1.5-1B weights are public and Apache-2.0 — no licence acceptance or access request is required, so a token is only needed if an anonymous download hits a rate-limit or transient 401.

Confusion with the 360P variant, the SPar3D successor rumour, or other "Waypoint" projects

Only the Overworld/Waypoint-1.5-1B repo (720p) and its Overworld/Waypoint-1.5-1B-360P sibling (laptop tier) are the canonical world-model weights. Despite the name overlap, this is not an image-to-3D mesh model and not a SPar3D successor — spar3d/Waypoint-1.5 does not resolve on the Hub. Unrelated "Waypoint" libraries (e.g. game-dev navigation, robotics path planning) are different projects — don't conflate them.

Picking a quant path on Ampere (RTX 3090 Ti)

Default to intw8a8 on the 3090 Ti. The unquantized BF16 path works (the 11.18 GB on-disk envelope fits comfortably in 24 GB) and is the right starting point if you want to verify install correctness before introducing quantization, but the model card's headline 3090 throughput requires INT8 — and the Ti's small uplift over the 3090 only materializes on the same quant path. The fp8w8a8 path is Ada/Hopper-only; the nvfp4 path is Blackwell-only — neither applies on Ampere. See the world_engine README quant-path table for the architecture mapping.

For other issues, file a report via the submission form.