Waypoint-1: Real-time Interactive Video Diffusion

Waypoint-1: Real-time Interactive Video Diffusion from Overworld

Waypoint-1 is Overworld’s new real-time interactive video diffusion model, controllable through text, mouse, and keyboard inputs. This model is trained on 10,000 hours of video game footage and empowers users to create and interact with generated worlds with minimal latency.

The model addresses limitations in existing world models, which often struggle with latency and control simplicity. While those models typically allow limited camera movement every few frames, Waypoint-1 permits free camera movement, full keyboard input, and zero-latency frame generation, providing a truly interactive experience.

Why This Matters

Current AI video generation often trades interactivity for realism, requiring significant computational resources and exhibiting delays unacceptable for real-time applications. Existing solutions often rely on fine-tuning pre-trained video models with simplistic controls, leading to limited user agency and sluggish responsiveness; this can create a disconnect between the intended creative vision and the experienced interaction, hindering applications like game development and virtual environments.

Key Insights

Diffusion Forcing: Waypoint-1 is pre-trained using diffusion forcing, a technique that trains the model to denoise future frames given past frames.
Self-Forcing: Post-training with self-forcing addresses error accumulation and noise in long rollouts, improving output quality.
WorldEngine: Overworld’s inference library, WorldEngine, optimizes performance for interactive applications in Python, achieving ~30,000 token-passes/sec on a 5090.

Working Example

from world_engine import WorldEngine, CtrlInput
# Create inference engine
engine = WorldEngine("Overworld/Waypoint-1-Small", device="cuda")
# Specify a prompt
engine.set_prompt("A game where you herd goats in a beautiful valley")
# Optional: Force the next frame to be a specific image
# img = pipeline.append_frame(uint8_img) # (H, W, 3)
# Generate 3 video frames conditioned on controller inputs
for controller_input in [
CtrlInput(button={48, 42}, mouse=[0.4, 0.3]),
CtrlInput(mouse=[0.1, 0.2]),
CtrlInput(button={95, 32, 105}),
]:
    img = engine.gen_frame(ctrl=controller_input)

Practical Applications

Game Development: Enable rapid prototyping and interactive world design with real-time feedback.
Virtual Environments: Create immersive, controllable virtual spaces for training, simulation, or entertainment.

References:

On This Page